All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/4] Fix request completion holes
@ 2017-11-20 11:30 ` Sagi Grimberg
  0 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2017-11-20 11:30 UTC (permalink / raw)
  To: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Christoph Hellwig

We have two holes in nvme-rdma when completing request.

1. We never wait for send work request to complete before completing
a request. It is possible that the HCA retries a send operation (due
to dropped ack) after the nvme cqe has already arrived back to the host.
If we unmap the host buffer upon reception of the cqe, the HCA might
get iommu errors when attempting to access an unmapped host buffer.
We must wait also for the send completion before completing a request,
most of the time it will be before the nvme cqe has arrived back so
we pay only for the extra cq entry processing.

2. We don't wait for the request memory region to be fully invalidated
in case the target didn't invalidate remotely. We must wait for the local
invalidation to complete before completing the request.

Note that we might face two concurrent completion processing contexts for
a single request. One is the ib_cq irq-poll context and the second is
blk_mq_poll which is invoked from IOCB_HIPRI requests. Thus we need the
completion flags updates (send/receive) to be atomic. A new request
lock is introduced to guarantee the mutual exclusion of the completion
flags updates.

Thanks to Christoph for suggesting request refcounts instead of a private
lock.


Changes from v3:
- Added a patch for detecting bogus remote invalidation (while we're in the area)
- Saved a third atomic op for local invalidate (Christoph)
- pull the status and result from nvme cqe to nvme_rdma_request

Changes from v2:
- Fixed send completion signalling in patch 1 (still signal conditionally
  as we still want to suppress it for async events)
- replaced req->lock with req->ref for micro-optimization (Christoph)

Changes from v1:
- Added atomic send/resp_completed updated (via per-request lock)

Sagi Grimberg (4):
  nvme-rdma: don't suppress send completions
  nvme-rdma: don't complete requests before a send work request has
    completed
  nvme-rdma: wait for local invalidation before completing a request
  nvme-rdma: Check remotely invalidated rkey matches our expected rkey

 drivers/nvme/host/rdma.c | 118 +++++++++++++++++++++++++----------------------
 1 file changed, 62 insertions(+), 56 deletions(-)

-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 0/4] Fix request completion holes
@ 2017-11-20 11:30 ` Sagi Grimberg
  0 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2017-11-20 11:30 UTC (permalink / raw)


We have two holes in nvme-rdma when completing request.

1. We never wait for send work request to complete before completing
a request. It is possible that the HCA retries a send operation (due
to dropped ack) after the nvme cqe has already arrived back to the host.
If we unmap the host buffer upon reception of the cqe, the HCA might
get iommu errors when attempting to access an unmapped host buffer.
We must wait also for the send completion before completing a request,
most of the time it will be before the nvme cqe has arrived back so
we pay only for the extra cq entry processing.

2. We don't wait for the request memory region to be fully invalidated
in case the target didn't invalidate remotely. We must wait for the local
invalidation to complete before completing the request.

Note that we might face two concurrent completion processing contexts for
a single request. One is the ib_cq irq-poll context and the second is
blk_mq_poll which is invoked from IOCB_HIPRI requests. Thus we need the
completion flags updates (send/receive) to be atomic. A new request
lock is introduced to guarantee the mutual exclusion of the completion
flags updates.

Thanks to Christoph for suggesting request refcounts instead of a private
lock.


Changes from v3:
- Added a patch for detecting bogus remote invalidation (while we're in the area)
- Saved a third atomic op for local invalidate (Christoph)
- pull the status and result from nvme cqe to nvme_rdma_request

Changes from v2:
- Fixed send completion signalling in patch 1 (still signal conditionally
  as we still want to suppress it for async events)
- replaced req->lock with req->ref for micro-optimization (Christoph)

Changes from v1:
- Added atomic send/resp_completed updated (via per-request lock)

Sagi Grimberg (4):
  nvme-rdma: don't suppress send completions
  nvme-rdma: don't complete requests before a send work request has
    completed
  nvme-rdma: wait for local invalidation before completing a request
  nvme-rdma: Check remotely invalidated rkey matches our expected rkey

 drivers/nvme/host/rdma.c | 118 +++++++++++++++++++++++++----------------------
 1 file changed, 62 insertions(+), 56 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 1/4] nvme-rdma: don't suppress send completions
  2017-11-20 11:30 ` Sagi Grimberg
@ 2017-11-20 11:30     ` Sagi Grimberg
  -1 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2017-11-20 11:30 UTC (permalink / raw)
  To: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Christoph Hellwig

The entire completions suppress mechanism is currently
broken because the HCA might retry a send operation
(due to dropped ack) after the nvme transaction has completed.

In order to handle this, we signal all send completions (besides
async event which is not racing anything).

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
---
 drivers/nvme/host/rdma.c | 46 +++++++++-------------------------------------
 1 file changed, 9 insertions(+), 37 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index fdd6659a09a0..85c98589a5e0 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -77,7 +77,6 @@ enum nvme_rdma_queue_flags {
 
 struct nvme_rdma_queue {
 	struct nvme_rdma_qe	*rsp_ring;
-	atomic_t		sig_count;
 	int			queue_size;
 	size_t			cmnd_capsule_len;
 	struct nvme_rdma_ctrl	*ctrl;
@@ -510,7 +509,6 @@ static int nvme_rdma_alloc_queue(struct nvme_rdma_ctrl *ctrl,
 		queue->cmnd_capsule_len = sizeof(struct nvme_command);
 
 	queue->queue_size = queue_size;
-	atomic_set(&queue->sig_count, 0);
 
 	queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, queue,
 			RDMA_PS_TCP, IB_QPT_RC);
@@ -1204,21 +1202,9 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 		nvme_rdma_wr_error(cq, wc, "SEND");
 }
 
-/*
- * We want to signal completion at least every queue depth/2.  This returns the
- * largest power of two that is not above half of (queue size + 1) to optimize
- * (avoid divisions).
- */
-static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
-{
-	int limit = 1 << ilog2((queue->queue_size + 1) / 2);
-
-	return (atomic_inc_return(&queue->sig_count) & (limit - 1)) == 0;
-}
-
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 		struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
-		struct ib_send_wr *first, bool flush)
+		struct ib_send_wr *first, bool signal)
 {
 	struct ib_send_wr wr, *bad_wr;
 	int ret;
@@ -1234,24 +1220,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 	wr.sg_list    = sge;
 	wr.num_sge    = num_sge;
 	wr.opcode     = IB_WR_SEND;
-	wr.send_flags = 0;
-
-	/*
-	 * Unsignalled send completions are another giant desaster in the
-	 * IB Verbs spec:  If we don't regularly post signalled sends
-	 * the send queue will fill up and only a QP reset will rescue us.
-	 * Would have been way to obvious to handle this in hardware or
-	 * at least the RDMA stack..
-	 *
-	 * Always signal the flushes. The magic request used for the flush
-	 * sequencer is not allocated in our driver's tagset and it's
-	 * triggered to be freed by blk_cleanup_queue(). So we need to
-	 * always mark it as signaled to ensure that the "wr_cqe", which is
-	 * embedded in request's payload, is not freed when __ib_process_cq()
-	 * calls wr_cqe->done().
-	 */
-	if (nvme_rdma_queue_sig_limit(queue) || flush)
-		wr.send_flags |= IB_SEND_SIGNALED;
+	wr.send_flags = signal ? IB_SEND_SIGNALED : 0;
 
 	if (first)
 		first->next = &wr;
@@ -1322,6 +1291,12 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg)
 	ib_dma_sync_single_for_device(dev, sqe->dma, sizeof(*cmd),
 			DMA_TO_DEVICE);
 
+	/*
+	 * async events do not race with reply completions and don't
+	 * contain inline data, thus are safe to suppress. so in order
+	 * to avoid the complexity of detecting async event send completions
+	 * in the hot path we simply suppress their send completions.
+	 */
 	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL, false);
 	WARN_ON_ONCE(ret);
 }
@@ -1607,7 +1582,6 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
 	struct nvme_rdma_qe *sqe = &req->sqe;
 	struct nvme_command *c = sqe->data;
-	bool flush = false;
 	struct ib_device *dev;
 	blk_status_t ret;
 	int err;
@@ -1639,10 +1613,8 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	ib_dma_sync_single_for_device(dev, sqe->dma,
 			sizeof(struct nvme_command), DMA_TO_DEVICE);
 
-	if (req_op(rq) == REQ_OP_FLUSH)
-		flush = true;
 	err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
-			req->mr->need_inval ? &req->reg_wr.wr : NULL, flush);
+			req->mr->need_inval ? &req->reg_wr.wr : NULL, true);
 	if (unlikely(err)) {
 		nvme_rdma_unmap_data(queue, rq);
 		goto err;
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 1/4] nvme-rdma: don't suppress send completions
@ 2017-11-20 11:30     ` Sagi Grimberg
  0 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2017-11-20 11:30 UTC (permalink / raw)


The entire completions suppress mechanism is currently
broken because the HCA might retry a send operation
(due to dropped ack) after the nvme transaction has completed.

In order to handle this, we signal all send completions (besides
async event which is not racing anything).

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 46 +++++++++-------------------------------------
 1 file changed, 9 insertions(+), 37 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index fdd6659a09a0..85c98589a5e0 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -77,7 +77,6 @@ enum nvme_rdma_queue_flags {
 
 struct nvme_rdma_queue {
 	struct nvme_rdma_qe	*rsp_ring;
-	atomic_t		sig_count;
 	int			queue_size;
 	size_t			cmnd_capsule_len;
 	struct nvme_rdma_ctrl	*ctrl;
@@ -510,7 +509,6 @@ static int nvme_rdma_alloc_queue(struct nvme_rdma_ctrl *ctrl,
 		queue->cmnd_capsule_len = sizeof(struct nvme_command);
 
 	queue->queue_size = queue_size;
-	atomic_set(&queue->sig_count, 0);
 
 	queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, queue,
 			RDMA_PS_TCP, IB_QPT_RC);
@@ -1204,21 +1202,9 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 		nvme_rdma_wr_error(cq, wc, "SEND");
 }
 
-/*
- * We want to signal completion at least every queue depth/2.  This returns the
- * largest power of two that is not above half of (queue size + 1) to optimize
- * (avoid divisions).
- */
-static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
-{
-	int limit = 1 << ilog2((queue->queue_size + 1) / 2);
-
-	return (atomic_inc_return(&queue->sig_count) & (limit - 1)) == 0;
-}
-
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 		struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
-		struct ib_send_wr *first, bool flush)
+		struct ib_send_wr *first, bool signal)
 {
 	struct ib_send_wr wr, *bad_wr;
 	int ret;
@@ -1234,24 +1220,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 	wr.sg_list    = sge;
 	wr.num_sge    = num_sge;
 	wr.opcode     = IB_WR_SEND;
-	wr.send_flags = 0;
-
-	/*
-	 * Unsignalled send completions are another giant desaster in the
-	 * IB Verbs spec:  If we don't regularly post signalled sends
-	 * the send queue will fill up and only a QP reset will rescue us.
-	 * Would have been way to obvious to handle this in hardware or
-	 * at least the RDMA stack..
-	 *
-	 * Always signal the flushes. The magic request used for the flush
-	 * sequencer is not allocated in our driver's tagset and it's
-	 * triggered to be freed by blk_cleanup_queue(). So we need to
-	 * always mark it as signaled to ensure that the "wr_cqe", which is
-	 * embedded in request's payload, is not freed when __ib_process_cq()
-	 * calls wr_cqe->done().
-	 */
-	if (nvme_rdma_queue_sig_limit(queue) || flush)
-		wr.send_flags |= IB_SEND_SIGNALED;
+	wr.send_flags = signal ? IB_SEND_SIGNALED : 0;
 
 	if (first)
 		first->next = &wr;
@@ -1322,6 +1291,12 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg)
 	ib_dma_sync_single_for_device(dev, sqe->dma, sizeof(*cmd),
 			DMA_TO_DEVICE);
 
+	/*
+	 * async events do not race with reply completions and don't
+	 * contain inline data, thus are safe to suppress. so in order
+	 * to avoid the complexity of detecting async event send completions
+	 * in the hot path we simply suppress their send completions.
+	 */
 	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL, false);
 	WARN_ON_ONCE(ret);
 }
@@ -1607,7 +1582,6 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
 	struct nvme_rdma_qe *sqe = &req->sqe;
 	struct nvme_command *c = sqe->data;
-	bool flush = false;
 	struct ib_device *dev;
 	blk_status_t ret;
 	int err;
@@ -1639,10 +1613,8 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	ib_dma_sync_single_for_device(dev, sqe->dma,
 			sizeof(struct nvme_command), DMA_TO_DEVICE);
 
-	if (req_op(rq) == REQ_OP_FLUSH)
-		flush = true;
 	err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
-			req->mr->need_inval ? &req->reg_wr.wr : NULL, flush);
+			req->mr->need_inval ? &req->reg_wr.wr : NULL, true);
 	if (unlikely(err)) {
 		nvme_rdma_unmap_data(queue, rq);
 		goto err;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 2/4] nvme-rdma: don't complete requests before a send work request has completed
  2017-11-20 11:30 ` Sagi Grimberg
@ 2017-11-20 11:30     ` Sagi Grimberg
  -1 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2017-11-20 11:30 UTC (permalink / raw)
  To: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Christoph Hellwig

In order to guarantee that the HCA will never get an access violation
(either from invalidated rkey or from iommu) when retrying a send
operation we must complete a request only when both send completion
and the nvme cqe has arrived. We need to set the send/recv completions
flags atomically because we might have more than a single context
accessing the request concurrently (one is cq irq-poll context and
the other is user-polling used in IOCB_HIPRI).

Only then we are safe to invalidate the rkey (if needed), unmap
the host buffers, and complete the IO.

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
---
 drivers/nvme/host/rdma.c | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 85c98589a5e0..9202cfa9300b 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -59,6 +59,9 @@ struct nvme_rdma_request {
 	struct nvme_request	req;
 	struct ib_mr		*mr;
 	struct nvme_rdma_qe	sqe;
+	union nvme_result	result;
+	__le16			status;
+	refcount_t		ref;
 	struct ib_sge		sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS];
 	u32			num_sge;
 	int			nents;
@@ -1162,6 +1165,7 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 	req->num_sge = 1;
 	req->inline_data = false;
 	req->mr->need_inval = false;
+	refcount_set(&req->ref, 2); /* send and recv completions */
 
 	c->common.flags |= NVME_CMD_SGL_METABUF;
 
@@ -1198,8 +1202,19 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 
 static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	if (unlikely(wc->status != IB_WC_SUCCESS))
+	struct nvme_rdma_qe *qe =
+		container_of(wc->wr_cqe, struct nvme_rdma_qe, cqe);
+	struct nvme_rdma_request *req =
+		container_of(qe, struct nvme_rdma_request, sqe);
+	struct request *rq = blk_mq_rq_from_pdu(req);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		nvme_rdma_wr_error(cq, wc, "SEND");
+		return;
+	}
+
+	if (refcount_dec_and_test(&req->ref))
+		nvme_end_request(rq, req->status, req->result);
 }
 
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
@@ -1318,14 +1333,19 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	}
 	req = blk_mq_rq_to_pdu(rq);
 
-	if (rq->tag == tag)
-		ret = 1;
+	req->status = cqe->status;
+	req->result = cqe->result;
 
 	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
 	    wc->ex.invalidate_rkey == req->mr->rkey)
 		req->mr->need_inval = false;
 
-	nvme_end_request(rq, cqe->status, cqe->result);
+	if (refcount_dec_and_test(&req->ref)) {
+		if (rq->tag == tag)
+			ret = 1;
+		nvme_end_request(rq, req->status, req->result);
+	}
+
 	return ret;
 }
 
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 2/4] nvme-rdma: don't complete requests before a send work request has completed
@ 2017-11-20 11:30     ` Sagi Grimberg
  0 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2017-11-20 11:30 UTC (permalink / raw)


In order to guarantee that the HCA will never get an access violation
(either from invalidated rkey or from iommu) when retrying a send
operation we must complete a request only when both send completion
and the nvme cqe has arrived. We need to set the send/recv completions
flags atomically because we might have more than a single context
accessing the request concurrently (one is cq irq-poll context and
the other is user-polling used in IOCB_HIPRI).

Only then we are safe to invalidate the rkey (if needed), unmap
the host buffers, and complete the IO.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 85c98589a5e0..9202cfa9300b 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -59,6 +59,9 @@ struct nvme_rdma_request {
 	struct nvme_request	req;
 	struct ib_mr		*mr;
 	struct nvme_rdma_qe	sqe;
+	union nvme_result	result;
+	__le16			status;
+	refcount_t		ref;
 	struct ib_sge		sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS];
 	u32			num_sge;
 	int			nents;
@@ -1162,6 +1165,7 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 	req->num_sge = 1;
 	req->inline_data = false;
 	req->mr->need_inval = false;
+	refcount_set(&req->ref, 2); /* send and recv completions */
 
 	c->common.flags |= NVME_CMD_SGL_METABUF;
 
@@ -1198,8 +1202,19 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 
 static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	if (unlikely(wc->status != IB_WC_SUCCESS))
+	struct nvme_rdma_qe *qe =
+		container_of(wc->wr_cqe, struct nvme_rdma_qe, cqe);
+	struct nvme_rdma_request *req =
+		container_of(qe, struct nvme_rdma_request, sqe);
+	struct request *rq = blk_mq_rq_from_pdu(req);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		nvme_rdma_wr_error(cq, wc, "SEND");
+		return;
+	}
+
+	if (refcount_dec_and_test(&req->ref))
+		nvme_end_request(rq, req->status, req->result);
 }
 
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
@@ -1318,14 +1333,19 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	}
 	req = blk_mq_rq_to_pdu(rq);
 
-	if (rq->tag == tag)
-		ret = 1;
+	req->status = cqe->status;
+	req->result = cqe->result;
 
 	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
 	    wc->ex.invalidate_rkey == req->mr->rkey)
 		req->mr->need_inval = false;
 
-	nvme_end_request(rq, cqe->status, cqe->result);
+	if (refcount_dec_and_test(&req->ref)) {
+		if (rq->tag == tag)
+			ret = 1;
+		nvme_end_request(rq, req->status, req->result);
+	}
+
 	return ret;
 }
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 3/4] nvme-rdma: wait for local invalidation before completing a request
  2017-11-20 11:30 ` Sagi Grimberg
@ 2017-11-20 11:31     ` Sagi Grimberg
  -1 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2017-11-20 11:31 UTC (permalink / raw)
  To: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Christoph Hellwig

We must not complete a request before the host memory region is
invalidated. Luckily we have send with invalidate protocol support
so we usually don't need to execute it, but in case the target
did not invalidate a memory region for us, we must wait for
the invalidation to complete before unmapping host memory and
completing the I/O.

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
---
 drivers/nvme/host/rdma.c | 39 ++++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 9202cfa9300b..363dd1ca1a82 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1019,8 +1019,18 @@ static void nvme_rdma_memreg_done(struct ib_cq *cq, struct ib_wc *wc)
 
 static void nvme_rdma_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	if (unlikely(wc->status != IB_WC_SUCCESS))
+	struct nvme_rdma_request *req =
+		container_of(wc->wr_cqe, struct nvme_rdma_request, reg_cqe);
+	struct request *rq = blk_mq_rq_from_pdu(req);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		nvme_rdma_wr_error(cq, wc, "LOCAL_INV");
+		return;
+	}
+
+	if (refcount_dec_and_test(&req->ref))
+		nvme_end_request(rq, req->status, req->result);
+
 }
 
 static int nvme_rdma_inv_rkey(struct nvme_rdma_queue *queue,
@@ -1031,7 +1041,7 @@ static int nvme_rdma_inv_rkey(struct nvme_rdma_queue *queue,
 		.opcode		    = IB_WR_LOCAL_INV,
 		.next		    = NULL,
 		.num_sge	    = 0,
-		.send_flags	    = 0,
+		.send_flags	    = IB_SEND_SIGNALED,
 		.ex.invalidate_rkey = req->mr->rkey,
 	};
 
@@ -1045,24 +1055,12 @@ static void nvme_rdma_unmap_data(struct nvme_rdma_queue *queue,
 		struct request *rq)
 {
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
-	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
 	struct nvme_rdma_device *dev = queue->device;
 	struct ib_device *ibdev = dev->dev;
-	int res;
 
 	if (!blk_rq_bytes(rq))
 		return;
 
-	if (req->mr->need_inval && test_bit(NVME_RDMA_Q_LIVE, &req->queue->flags)) {
-		res = nvme_rdma_inv_rkey(queue, req);
-		if (unlikely(res < 0)) {
-			dev_err(ctrl->ctrl.device,
-				"Queueing INV WR for rkey %#x failed (%d)\n",
-				req->mr->rkey, res);
-			nvme_rdma_error_recovery(queue->ctrl);
-		}
-	}
-
 	ib_dma_unmap_sg(ibdev, req->sg_table.sgl,
 			req->nents, rq_data_dir(rq) ==
 				    WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
@@ -1337,8 +1335,19 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	req->result = cqe->result;
 
 	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
-	    wc->ex.invalidate_rkey == req->mr->rkey)
+	    wc->ex.invalidate_rkey == req->mr->rkey) {
 		req->mr->need_inval = false;
+	} else if (req->mr->need_inval) {
+		ret = nvme_rdma_inv_rkey(queue, req);
+		if (unlikely(ret < 0)) {
+			dev_err(queue->ctrl->ctrl.device,
+				"Queueing INV WR for rkey %#x failed (%d)\n",
+				req->mr->rkey, ret);
+			nvme_rdma_error_recovery(queue->ctrl);
+		}
+		/* the local invalidation completion will end the request */
+		return 0;
+	}
 
 	if (refcount_dec_and_test(&req->ref)) {
 		if (rq->tag == tag)
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 3/4] nvme-rdma: wait for local invalidation before completing a request
@ 2017-11-20 11:31     ` Sagi Grimberg
  0 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2017-11-20 11:31 UTC (permalink / raw)


We must not complete a request before the host memory region is
invalidated. Luckily we have send with invalidate protocol support
so we usually don't need to execute it, but in case the target
did not invalidate a memory region for us, we must wait for
the invalidation to complete before unmapping host memory and
completing the I/O.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 39 ++++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 9202cfa9300b..363dd1ca1a82 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1019,8 +1019,18 @@ static void nvme_rdma_memreg_done(struct ib_cq *cq, struct ib_wc *wc)
 
 static void nvme_rdma_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	if (unlikely(wc->status != IB_WC_SUCCESS))
+	struct nvme_rdma_request *req =
+		container_of(wc->wr_cqe, struct nvme_rdma_request, reg_cqe);
+	struct request *rq = blk_mq_rq_from_pdu(req);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		nvme_rdma_wr_error(cq, wc, "LOCAL_INV");
+		return;
+	}
+
+	if (refcount_dec_and_test(&req->ref))
+		nvme_end_request(rq, req->status, req->result);
+
 }
 
 static int nvme_rdma_inv_rkey(struct nvme_rdma_queue *queue,
@@ -1031,7 +1041,7 @@ static int nvme_rdma_inv_rkey(struct nvme_rdma_queue *queue,
 		.opcode		    = IB_WR_LOCAL_INV,
 		.next		    = NULL,
 		.num_sge	    = 0,
-		.send_flags	    = 0,
+		.send_flags	    = IB_SEND_SIGNALED,
 		.ex.invalidate_rkey = req->mr->rkey,
 	};
 
@@ -1045,24 +1055,12 @@ static void nvme_rdma_unmap_data(struct nvme_rdma_queue *queue,
 		struct request *rq)
 {
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
-	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
 	struct nvme_rdma_device *dev = queue->device;
 	struct ib_device *ibdev = dev->dev;
-	int res;
 
 	if (!blk_rq_bytes(rq))
 		return;
 
-	if (req->mr->need_inval && test_bit(NVME_RDMA_Q_LIVE, &req->queue->flags)) {
-		res = nvme_rdma_inv_rkey(queue, req);
-		if (unlikely(res < 0)) {
-			dev_err(ctrl->ctrl.device,
-				"Queueing INV WR for rkey %#x failed (%d)\n",
-				req->mr->rkey, res);
-			nvme_rdma_error_recovery(queue->ctrl);
-		}
-	}
-
 	ib_dma_unmap_sg(ibdev, req->sg_table.sgl,
 			req->nents, rq_data_dir(rq) ==
 				    WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
@@ -1337,8 +1335,19 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	req->result = cqe->result;
 
 	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
-	    wc->ex.invalidate_rkey == req->mr->rkey)
+	    wc->ex.invalidate_rkey == req->mr->rkey) {
 		req->mr->need_inval = false;
+	} else if (req->mr->need_inval) {
+		ret = nvme_rdma_inv_rkey(queue, req);
+		if (unlikely(ret < 0)) {
+			dev_err(queue->ctrl->ctrl.device,
+				"Queueing INV WR for rkey %#x failed (%d)\n",
+				req->mr->rkey, ret);
+			nvme_rdma_error_recovery(queue->ctrl);
+		}
+		/* the local invalidation completion will end the request */
+		return 0;
+	}
 
 	if (refcount_dec_and_test(&req->ref)) {
 		if (rq->tag == tag)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 4/4] nvme-rdma: Check remotely invalidated rkey matches our expected rkey
  2017-11-20 11:30 ` Sagi Grimberg
@ 2017-11-20 11:31     ` Sagi Grimberg
  -1 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2017-11-20 11:31 UTC (permalink / raw)
  To: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Christoph Hellwig

If we got a remote invalidation on a bogus rkey, this is a protocol
error. fail if it happened.

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
---
 drivers/nvme/host/rdma.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 363dd1ca1a82..4ba58939bdc6 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1334,8 +1334,13 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	req->status = cqe->status;
 	req->result = cqe->result;
 
-	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
-	    wc->ex.invalidate_rkey == req->mr->rkey) {
+	if (wc->wc_flags & IB_WC_WITH_INVALIDATE) {
+		if (unlikely(wc->ex.invalidate_rkey != req->mr->rkey)) {
+			dev_err(queue->ctrl->ctrl.device,
+				"Bogus remote invalidation for rkey %#x\n",
+				req->mr->rkey);
+			nvme_rdma_error_recovery(queue->ctrl);
+		}
 		req->mr->need_inval = false;
 	} else if (req->mr->need_inval) {
 		ret = nvme_rdma_inv_rkey(queue, req);
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 4/4] nvme-rdma: Check remotely invalidated rkey matches our expected rkey
@ 2017-11-20 11:31     ` Sagi Grimberg
  0 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2017-11-20 11:31 UTC (permalink / raw)


If we got a remote invalidation on a bogus rkey, this is a protocol
error. fail if it happened.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 363dd1ca1a82..4ba58939bdc6 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1334,8 +1334,13 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	req->status = cqe->status;
 	req->result = cqe->result;
 
-	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
-	    wc->ex.invalidate_rkey == req->mr->rkey) {
+	if (wc->wc_flags & IB_WC_WITH_INVALIDATE) {
+		if (unlikely(wc->ex.invalidate_rkey != req->mr->rkey)) {
+			dev_err(queue->ctrl->ctrl.device,
+				"Bogus remote invalidation for rkey %#x\n",
+				req->mr->rkey);
+			nvme_rdma_error_recovery(queue->ctrl);
+		}
 		req->mr->need_inval = false;
 	} else if (req->mr->need_inval) {
 		ret = nvme_rdma_inv_rkey(queue, req);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/4] Fix request completion holes
  2017-11-20 11:30 ` Sagi Grimberg
@ 2017-11-21 13:18     ` Christoph Hellwig
  -1 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2017-11-21 13:18 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig

Thanks Sagi,

the series looks fine to me.  But I'd really like to get another
review before applying it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 0/4] Fix request completion holes
@ 2017-11-21 13:18     ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2017-11-21 13:18 UTC (permalink / raw)


Thanks Sagi,

the series looks fine to me.  But I'd really like to get another
review before applying it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/4] Fix request completion holes
  2017-11-21 13:18     ` Christoph Hellwig
@ 2017-11-22 11:28         ` Sagi Grimberg
  -1 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2017-11-22 11:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

> Thanks Sagi,
> 
> the series looks fine to me.  But I'd really like to get another
> review before applying it.

Let's hope someone steps up... I'd say give it another
day or two and if no feedback we can pull it in.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 0/4] Fix request completion holes
@ 2017-11-22 11:28         ` Sagi Grimberg
  0 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2017-11-22 11:28 UTC (permalink / raw)


> Thanks Sagi,
> 
> the series looks fine to me.  But I'd really like to get another
> review before applying it.

Let's hope someone steps up... I'd say give it another
day or two and if no feedback we can pull it in.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/4] Fix request completion holes
  2017-11-22 11:28         ` Sagi Grimberg
@ 2017-11-22 11:37             ` Max Gurtovoy
  -1 siblings, 0 replies; 22+ messages in thread
From: Max Gurtovoy @ 2017-11-22 11:37 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA



On 11/22/2017 1:28 PM, Sagi Grimberg wrote:
>> Thanks Sagi,
>>
>> the series looks fine to me.  But I'd really like to get another
>> review before applying it.
> 
> Let's hope someone steps up... I'd say give it another
> day or two and if no feedback we can pull it in.

I'll review it today.

> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fvger.kernel.org%2Fmajordomo-info.html&data=02%7C01%7Cmaxg%40mellanox.com%7C6bb2a68cd9b743d2568608d5319c2236%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636469469100251062&sdata=BxpF62WEA6BcIGZWjMvIsV08hawfDAESB18GcbyoT%2B8%3D&reserved=0 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 0/4] Fix request completion holes
@ 2017-11-22 11:37             ` Max Gurtovoy
  0 siblings, 0 replies; 22+ messages in thread
From: Max Gurtovoy @ 2017-11-22 11:37 UTC (permalink / raw)




On 11/22/2017 1:28 PM, Sagi Grimberg wrote:
>> Thanks Sagi,
>>
>> the series looks fine to me.? But I'd really like to get another
>> review before applying it.
> 
> Let's hope someone steps up... I'd say give it another
> day or two and if no feedback we can pull it in.

I'll review it today.

> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fvger.kernel.org%2Fmajordomo-info.html&data=02%7C01%7Cmaxg%40mellanox.com%7C6bb2a68cd9b743d2568608d5319c2236%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636469469100251062&sdata=BxpF62WEA6BcIGZWjMvIsV08hawfDAESB18GcbyoT%2B8%3D&reserved=0 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/4] nvme-rdma: don't suppress send completions
  2017-11-20 11:30     ` Sagi Grimberg
@ 2017-11-22 15:37         ` Max Gurtovoy
  -1 siblings, 0 replies; 22+ messages in thread
From: Max Gurtovoy @ 2017-11-22 15:37 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Christoph Hellwig



On 11/20/2017 1:30 PM, Sagi Grimberg wrote:
> The entire completions suppress mechanism is currently
> broken because the HCA might retry a send operation
> (due to dropped ack) after the nvme transaction has completed.
> 
> In order to handle this, we signal all send completions (besides
> async event which is not racing anything).
> 
> Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
> ---
>   drivers/nvme/host/rdma.c | 46 +++++++++-------------------------------------
>   1 file changed, 9 insertions(+), 37 deletions(-)
> 

Looks good,

Reviewed-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 1/4] nvme-rdma: don't suppress send completions
@ 2017-11-22 15:37         ` Max Gurtovoy
  0 siblings, 0 replies; 22+ messages in thread
From: Max Gurtovoy @ 2017-11-22 15:37 UTC (permalink / raw)




On 11/20/2017 1:30 PM, Sagi Grimberg wrote:
> The entire completions suppress mechanism is currently
> broken because the HCA might retry a send operation
> (due to dropped ack) after the nvme transaction has completed.
> 
> In order to handle this, we signal all send completions (besides
> async event which is not racing anything).
> 
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>   drivers/nvme/host/rdma.c | 46 +++++++++-------------------------------------
>   1 file changed, 9 insertions(+), 37 deletions(-)
> 

Looks good,

Reviewed-by: Max Gurtovoy <maxg at mellanox.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/4] nvme-rdma: don't complete requests before a send work request has completed
  2017-11-20 11:30     ` Sagi Grimberg
@ 2017-11-22 16:06         ` Max Gurtovoy
  -1 siblings, 0 replies; 22+ messages in thread
From: Max Gurtovoy @ 2017-11-22 16:06 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Christoph Hellwig



On 11/20/2017 1:30 PM, Sagi Grimberg wrote:
> In order to guarantee that the HCA will never get an access violation
> (either from invalidated rkey or from iommu) when retrying a send
> operation we must complete a request only when both send completion
> and the nvme cqe has arrived. We need to set the send/recv completions
> flags atomically because we might have more than a single context
> accessing the request concurrently (one is cq irq-poll context and
> the other is user-polling used in IOCB_HIPRI).
> 
> Only then we are safe to invalidate the rkey (if needed), unmap
> the host buffers, and complete the IO.
> 
> Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
> ---
>   drivers/nvme/host/rdma.c | 28 ++++++++++++++++++++++++----
>   1 file changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 85c98589a5e0..9202cfa9300b 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -59,6 +59,9 @@ struct nvme_rdma_request {
>   	struct nvme_request	req;
>   	struct ib_mr		*mr;
>   	struct nvme_rdma_qe	sqe;
> +	union nvme_result	result;
> +	__le16			status;
> +	refcount_t		ref;
>   	struct ib_sge		sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS];
>   	u32			num_sge;
>   	int			nents;
> @@ -1162,6 +1165,7 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
>   	req->num_sge = 1;
>   	req->inline_data = false;
>   	req->mr->need_inval = false;
> +	refcount_set(&req->ref, 2); /* send and recv completions */
>   
>   	c->common.flags |= NVME_CMD_SGL_METABUF;
>   
> @@ -1198,8 +1202,19 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
>   
>   static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
>   {
> -	if (unlikely(wc->status != IB_WC_SUCCESS))
> +	struct nvme_rdma_qe *qe =
> +		container_of(wc->wr_cqe, struct nvme_rdma_qe, cqe);
> +	struct nvme_rdma_request *req =
> +		container_of(qe, struct nvme_rdma_request, sqe);

what will happen if we get here from qe that belongs to async_event 
post_send request (completion with error ) ?
the container_of will be wrong...


> +	struct request *rq = blk_mq_rq_from_pdu(req);
> +
> +	if (unlikely(wc->status != IB_WC_SUCCESS)) {
>   		nvme_rdma_wr_error(cq, wc, "SEND");
> +		return;
> +	}
> +
> +	if (refcount_dec_and_test(&req->ref))
> +		nvme_end_request(rq, req->status, req->result);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 2/4] nvme-rdma: don't complete requests before a send work request has completed
@ 2017-11-22 16:06         ` Max Gurtovoy
  0 siblings, 0 replies; 22+ messages in thread
From: Max Gurtovoy @ 2017-11-22 16:06 UTC (permalink / raw)




On 11/20/2017 1:30 PM, Sagi Grimberg wrote:
> In order to guarantee that the HCA will never get an access violation
> (either from invalidated rkey or from iommu) when retrying a send
> operation we must complete a request only when both send completion
> and the nvme cqe has arrived. We need to set the send/recv completions
> flags atomically because we might have more than a single context
> accessing the request concurrently (one is cq irq-poll context and
> the other is user-polling used in IOCB_HIPRI).
> 
> Only then we are safe to invalidate the rkey (if needed), unmap
> the host buffers, and complete the IO.
> 
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>   drivers/nvme/host/rdma.c | 28 ++++++++++++++++++++++++----
>   1 file changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 85c98589a5e0..9202cfa9300b 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -59,6 +59,9 @@ struct nvme_rdma_request {
>   	struct nvme_request	req;
>   	struct ib_mr		*mr;
>   	struct nvme_rdma_qe	sqe;
> +	union nvme_result	result;
> +	__le16			status;
> +	refcount_t		ref;
>   	struct ib_sge		sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS];
>   	u32			num_sge;
>   	int			nents;
> @@ -1162,6 +1165,7 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
>   	req->num_sge = 1;
>   	req->inline_data = false;
>   	req->mr->need_inval = false;
> +	refcount_set(&req->ref, 2); /* send and recv completions */
>   
>   	c->common.flags |= NVME_CMD_SGL_METABUF;
>   
> @@ -1198,8 +1202,19 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
>   
>   static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
>   {
> -	if (unlikely(wc->status != IB_WC_SUCCESS))
> +	struct nvme_rdma_qe *qe =
> +		container_of(wc->wr_cqe, struct nvme_rdma_qe, cqe);
> +	struct nvme_rdma_request *req =
> +		container_of(qe, struct nvme_rdma_request, sqe);

what will happen if we get here from qe that belongs to async_event 
post_send request (completion with error ) ?
the container_of will be wrong...


> +	struct request *rq = blk_mq_rq_from_pdu(req);
> +
> +	if (unlikely(wc->status != IB_WC_SUCCESS)) {
>   		nvme_rdma_wr_error(cq, wc, "SEND");
> +		return;
> +	}
> +
> +	if (refcount_dec_and_test(&req->ref))
> +		nvme_end_request(rq, req->status, req->result);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 3/4] nvme-rdma: wait for local invalidation before completing a request
  2017-11-20 11:31     ` Sagi Grimberg
@ 2017-11-22 16:45         ` Max Gurtovoy
  -1 siblings, 0 replies; 22+ messages in thread
From: Max Gurtovoy @ 2017-11-22 16:45 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Christoph Hellwig


> @@ -1337,8 +1335,19 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
>   	req->result = cqe->result;
>   
>   	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
> -	    wc->ex.invalidate_rkey == req->mr->rkey)
> +	    wc->ex.invalidate_rkey == req->mr->rkey) {
>   		req->mr->need_inval = false;
> +	} else if (req->mr->need_inval) {
> +		ret = nvme_rdma_inv_rkey(queue, req);
> +		if (unlikely(ret < 0)) {
> +			dev_err(queue->ctrl->ctrl.device,
> +				"Queueing INV WR for rkey %#x failed (%d)\n",
> +				req->mr->rkey, ret);
> +			nvme_rdma_error_recovery(queue->ctrl);
> +		}
> +		/* the local invalidation completion will end the request */

inv completion or nvme_rdma_error_recovery will end the request...

> +		return 0;
> +	}
>   
>   	if (refcount_dec_and_test(&req->ref)) {
>   		if (rq->tag == tag)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 3/4] nvme-rdma: wait for local invalidation before completing a request
@ 2017-11-22 16:45         ` Max Gurtovoy
  0 siblings, 0 replies; 22+ messages in thread
From: Max Gurtovoy @ 2017-11-22 16:45 UTC (permalink / raw)



> @@ -1337,8 +1335,19 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
>   	req->result = cqe->result;
>   
>   	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
> -	    wc->ex.invalidate_rkey == req->mr->rkey)
> +	    wc->ex.invalidate_rkey == req->mr->rkey) {
>   		req->mr->need_inval = false;
> +	} else if (req->mr->need_inval) {
> +		ret = nvme_rdma_inv_rkey(queue, req);
> +		if (unlikely(ret < 0)) {
> +			dev_err(queue->ctrl->ctrl.device,
> +				"Queueing INV WR for rkey %#x failed (%d)\n",
> +				req->mr->rkey, ret);
> +			nvme_rdma_error_recovery(queue->ctrl);
> +		}
> +		/* the local invalidation completion will end the request */

inv completion or nvme_rdma_error_recovery will end the request...

> +		return 0;
> +	}
>   
>   	if (refcount_dec_and_test(&req->ref)) {
>   		if (rq->tag == tag)
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-11-22 16:45 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-20 11:30 [PATCH v4 0/4] Fix request completion holes Sagi Grimberg
2017-11-20 11:30 ` Sagi Grimberg
     [not found] ` <20171120113101.8292-1-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-20 11:30   ` [PATCH v4 1/4] nvme-rdma: don't suppress send completions Sagi Grimberg
2017-11-20 11:30     ` Sagi Grimberg
     [not found]     ` <20171120113101.8292-2-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-22 15:37       ` Max Gurtovoy
2017-11-22 15:37         ` Max Gurtovoy
2017-11-20 11:30   ` [PATCH v4 2/4] nvme-rdma: don't complete requests before a send work request has completed Sagi Grimberg
2017-11-20 11:30     ` Sagi Grimberg
     [not found]     ` <20171120113101.8292-3-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-22 16:06       ` Max Gurtovoy
2017-11-22 16:06         ` Max Gurtovoy
2017-11-20 11:31   ` [PATCH v4 3/4] nvme-rdma: wait for local invalidation before completing a request Sagi Grimberg
2017-11-20 11:31     ` Sagi Grimberg
     [not found]     ` <20171120113101.8292-4-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-22 16:45       ` Max Gurtovoy
2017-11-22 16:45         ` Max Gurtovoy
2017-11-20 11:31   ` [PATCH v4 4/4] nvme-rdma: Check remotely invalidated rkey matches our expected rkey Sagi Grimberg
2017-11-20 11:31     ` Sagi Grimberg
2017-11-21 13:18   ` [PATCH v4 0/4] Fix request completion holes Christoph Hellwig
2017-11-21 13:18     ` Christoph Hellwig
     [not found]     ` <20171121131855.GA32154-jcswGhMUV9g@public.gmane.org>
2017-11-22 11:28       ` Sagi Grimberg
2017-11-22 11:28         ` Sagi Grimberg
     [not found]         ` <84e60525-7a7e-503f-8e32-ec83f02e0d58-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-22 11:37           ` Max Gurtovoy
2017-11-22 11:37             ` Max Gurtovoy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.