All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] Fix request completion holes
@ 2017-11-08 10:06 ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-08 10:06 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
  Cc: Christoph Hellwig, Max Gurtuvoy

We have two holes in nvme-rdma when completing request.

1. We never wait for send work request to complete before completing
a request. It is possible that the HCA retries a send operation (due
to dropped ack) after the nvme cqe has already arrived back to the host.
If we unmap the host buffer upon reception of the cqe, the HCA might
get iommu errors when attempting to access an unmapped host buffer.
We must wait also for the send completion before completing a request,
most of the time it will be before the nvme cqe has arrived back so
we pay only for the extra cq entry processing.

2. We don't wait for the request memory region to be fully invalidated
in case the target didn't invalidate remotely. We must wait for the local
invalidation to complete before completing the request.

Note that we might face two concurrent completion processing contexts for
a single request. One is the ib_cq irq-poll context and the second is
blk_mq_poll which is invoked from IOCB_HIPRI requests. Thus we need the
completion flags updates (send/receive) to be atomic. A new request
lock is introduced to guarantee the mutual exclusion of the completion
flags updates.

Note that we could have used a per-queue lock for these updates (which
would have generated less locks as we have less queues), but given that
we access the request in the completion handlers we might benefit by having
the lock local in the request. I'm open to suggestions though.

Changes from v1:
- Added atomic send/resp_completed updated (via per-request lock)

Sagi Grimberg (3):
  nvme-rdma: don't suppress send completions
  nvme-rdma: don't complete requests before a send work request has
    completed
  nvme-rdma: wait for local invalidation before completing a request

 drivers/nvme/host/rdma.c | 125 ++++++++++++++++++++++++++---------------------
 1 file changed, 70 insertions(+), 55 deletions(-)

-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 0/3] Fix request completion holes
@ 2017-11-08 10:06 ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-08 10:06 UTC (permalink / raw)


We have two holes in nvme-rdma when completing request.

1. We never wait for send work request to complete before completing
a request. It is possible that the HCA retries a send operation (due
to dropped ack) after the nvme cqe has already arrived back to the host.
If we unmap the host buffer upon reception of the cqe, the HCA might
get iommu errors when attempting to access an unmapped host buffer.
We must wait also for the send completion before completing a request,
most of the time it will be before the nvme cqe has arrived back so
we pay only for the extra cq entry processing.

2. We don't wait for the request memory region to be fully invalidated
in case the target didn't invalidate remotely. We must wait for the local
invalidation to complete before completing the request.

Note that we might face two concurrent completion processing contexts for
a single request. One is the ib_cq irq-poll context and the second is
blk_mq_poll which is invoked from IOCB_HIPRI requests. Thus we need the
completion flags updates (send/receive) to be atomic. A new request
lock is introduced to guarantee the mutual exclusion of the completion
flags updates.

Note that we could have used a per-queue lock for these updates (which
would have generated less locks as we have less queues), but given that
we access the request in the completion handlers we might benefit by having
the lock local in the request. I'm open to suggestions though.

Changes from v1:
- Added atomic send/resp_completed updated (via per-request lock)

Sagi Grimberg (3):
  nvme-rdma: don't suppress send completions
  nvme-rdma: don't complete requests before a send work request has
    completed
  nvme-rdma: wait for local invalidation before completing a request

 drivers/nvme/host/rdma.c | 125 ++++++++++++++++++++++++++---------------------
 1 file changed, 70 insertions(+), 55 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 1/3] nvme-rdma: don't suppress send completions
  2017-11-08 10:06 ` Sagi Grimberg
@ 2017-11-08 10:06     ` Sagi Grimberg
  -1 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-08 10:06 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
  Cc: Christoph Hellwig, Max Gurtuvoy

The entire completions suppress mechanism is currently
broken because the HCA might retry a send operation
(due to dropped ack) after the nvme transaction has completed.

In order to handle this, we signal all send completions (besides
async event which is not racing anything).

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
---
 drivers/nvme/host/rdma.c | 42 ++++--------------------------------------
 1 file changed, 4 insertions(+), 38 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index ef7d27b63088..c765f1d20141 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -85,7 +85,6 @@ enum nvme_rdma_queue_flags {
 
 struct nvme_rdma_queue {
 	struct nvme_rdma_qe	*rsp_ring;
-	atomic_t		sig_count;
 	int			queue_size;
 	size_t			cmnd_capsule_len;
 	struct nvme_rdma_ctrl	*ctrl;
@@ -510,7 +509,6 @@ static int nvme_rdma_alloc_queue(struct nvme_rdma_ctrl *ctrl,
 		queue->cmnd_capsule_len = sizeof(struct nvme_command);
 
 	queue->queue_size = queue_size;
-	atomic_set(&queue->sig_count, 0);
 
 	queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, queue,
 			RDMA_PS_TCP, IB_QPT_RC);
@@ -1196,21 +1194,9 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 		nvme_rdma_wr_error(cq, wc, "SEND");
 }
 
-/*
- * We want to signal completion at least every queue depth/2.  This returns the
- * largest power of two that is not above half of (queue size + 1) to optimize
- * (avoid divisions).
- */
-static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
-{
-	int limit = 1 << ilog2((queue->queue_size + 1) / 2);
-
-	return (atomic_inc_return(&queue->sig_count) & (limit - 1)) == 0;
-}
-
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 		struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
-		struct ib_send_wr *first, bool flush)
+		struct ib_send_wr *first)
 {
 	struct ib_send_wr wr, *bad_wr;
 	int ret;
@@ -1226,24 +1212,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 	wr.sg_list    = sge;
 	wr.num_sge    = num_sge;
 	wr.opcode     = IB_WR_SEND;
-	wr.send_flags = 0;
-
-	/*
-	 * Unsignalled send completions are another giant desaster in the
-	 * IB Verbs spec:  If we don't regularly post signalled sends
-	 * the send queue will fill up and only a QP reset will rescue us.
-	 * Would have been way to obvious to handle this in hardware or
-	 * at least the RDMA stack..
-	 *
-	 * Always signal the flushes. The magic request used for the flush
-	 * sequencer is not allocated in our driver's tagset and it's
-	 * triggered to be freed by blk_cleanup_queue(). So we need to
-	 * always mark it as signaled to ensure that the "wr_cqe", which is
-	 * embedded in request's payload, is not freed when __ib_process_cq()
-	 * calls wr_cqe->done().
-	 */
-	if (nvme_rdma_queue_sig_limit(queue) || flush)
-		wr.send_flags |= IB_SEND_SIGNALED;
+	wr.send_flags = IB_SEND_SIGNALED;
 
 	if (first)
 		first->next = &wr;
@@ -1317,7 +1286,7 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg, int aer_idx)
 	ib_dma_sync_single_for_device(dev, sqe->dma, sizeof(*cmd),
 			DMA_TO_DEVICE);
 
-	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL, false);
+	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL);
 	WARN_ON_ONCE(ret);
 }
 
@@ -1602,7 +1571,6 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
 	struct nvme_rdma_qe *sqe = &req->sqe;
 	struct nvme_command *c = sqe->data;
-	bool flush = false;
 	struct ib_device *dev;
 	blk_status_t ret;
 	int err;
@@ -1634,10 +1602,8 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	ib_dma_sync_single_for_device(dev, sqe->dma,
 			sizeof(struct nvme_command), DMA_TO_DEVICE);
 
-	if (req_op(rq) == REQ_OP_FLUSH)
-		flush = true;
 	err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
-			req->mr->need_inval ? &req->reg_wr.wr : NULL, flush);
+			req->mr->need_inval ? &req->reg_wr.wr : NULL);
 	if (unlikely(err)) {
 		nvme_rdma_unmap_data(queue, rq);
 		goto err;
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 1/3] nvme-rdma: don't suppress send completions
@ 2017-11-08 10:06     ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-08 10:06 UTC (permalink / raw)


The entire completions suppress mechanism is currently
broken because the HCA might retry a send operation
(due to dropped ack) after the nvme transaction has completed.

In order to handle this, we signal all send completions (besides
async event which is not racing anything).

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 42 ++++--------------------------------------
 1 file changed, 4 insertions(+), 38 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index ef7d27b63088..c765f1d20141 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -85,7 +85,6 @@ enum nvme_rdma_queue_flags {
 
 struct nvme_rdma_queue {
 	struct nvme_rdma_qe	*rsp_ring;
-	atomic_t		sig_count;
 	int			queue_size;
 	size_t			cmnd_capsule_len;
 	struct nvme_rdma_ctrl	*ctrl;
@@ -510,7 +509,6 @@ static int nvme_rdma_alloc_queue(struct nvme_rdma_ctrl *ctrl,
 		queue->cmnd_capsule_len = sizeof(struct nvme_command);
 
 	queue->queue_size = queue_size;
-	atomic_set(&queue->sig_count, 0);
 
 	queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, queue,
 			RDMA_PS_TCP, IB_QPT_RC);
@@ -1196,21 +1194,9 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 		nvme_rdma_wr_error(cq, wc, "SEND");
 }
 
-/*
- * We want to signal completion at least every queue depth/2.  This returns the
- * largest power of two that is not above half of (queue size + 1) to optimize
- * (avoid divisions).
- */
-static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
-{
-	int limit = 1 << ilog2((queue->queue_size + 1) / 2);
-
-	return (atomic_inc_return(&queue->sig_count) & (limit - 1)) == 0;
-}
-
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 		struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
-		struct ib_send_wr *first, bool flush)
+		struct ib_send_wr *first)
 {
 	struct ib_send_wr wr, *bad_wr;
 	int ret;
@@ -1226,24 +1212,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 	wr.sg_list    = sge;
 	wr.num_sge    = num_sge;
 	wr.opcode     = IB_WR_SEND;
-	wr.send_flags = 0;
-
-	/*
-	 * Unsignalled send completions are another giant desaster in the
-	 * IB Verbs spec:  If we don't regularly post signalled sends
-	 * the send queue will fill up and only a QP reset will rescue us.
-	 * Would have been way to obvious to handle this in hardware or
-	 * at least the RDMA stack..
-	 *
-	 * Always signal the flushes. The magic request used for the flush
-	 * sequencer is not allocated in our driver's tagset and it's
-	 * triggered to be freed by blk_cleanup_queue(). So we need to
-	 * always mark it as signaled to ensure that the "wr_cqe", which is
-	 * embedded in request's payload, is not freed when __ib_process_cq()
-	 * calls wr_cqe->done().
-	 */
-	if (nvme_rdma_queue_sig_limit(queue) || flush)
-		wr.send_flags |= IB_SEND_SIGNALED;
+	wr.send_flags = IB_SEND_SIGNALED;
 
 	if (first)
 		first->next = &wr;
@@ -1317,7 +1286,7 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg, int aer_idx)
 	ib_dma_sync_single_for_device(dev, sqe->dma, sizeof(*cmd),
 			DMA_TO_DEVICE);
 
-	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL, false);
+	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL);
 	WARN_ON_ONCE(ret);
 }
 
@@ -1602,7 +1571,6 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
 	struct nvme_rdma_qe *sqe = &req->sqe;
 	struct nvme_command *c = sqe->data;
-	bool flush = false;
 	struct ib_device *dev;
 	blk_status_t ret;
 	int err;
@@ -1634,10 +1602,8 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	ib_dma_sync_single_for_device(dev, sqe->dma,
 			sizeof(struct nvme_command), DMA_TO_DEVICE);
 
-	if (req_op(rq) == REQ_OP_FLUSH)
-		flush = true;
 	err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
-			req->mr->need_inval ? &req->reg_wr.wr : NULL, flush);
+			req->mr->need_inval ? &req->reg_wr.wr : NULL);
 	if (unlikely(err)) {
 		nvme_rdma_unmap_data(queue, rq);
 		goto err;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
  2017-11-08 10:06 ` Sagi Grimberg
@ 2017-11-08 10:06     ` Sagi Grimberg
  -1 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-08 10:06 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
  Cc: Christoph Hellwig, Max Gurtuvoy

In order to guarantee that the HCA will never get an access violation
(either from invalidated rkey or from iommu) when retrying a send
operation we must complete a request only when both send completion
and the nvme cqe has arrived. We need to set the send/recv completions
flags atomically because we might have more than a single context
accessing the request concurrently (one is cq irq-poll context and
the other is user-polling used in IOCB_HIPRI).

Only then we are safe to invalidate the rkey (if needed), unmap
the host buffers, and complete the IO.

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
---
 drivers/nvme/host/rdma.c | 52 ++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 44 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index c765f1d20141..998f7cb445d9 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -67,6 +67,10 @@ struct nvme_rdma_request {
 	struct nvme_request	req;
 	struct ib_mr		*mr;
 	struct nvme_rdma_qe	sqe;
+	struct nvme_completion	cqe;
+	spinlock_t		lock;
+	bool			send_completed;
+	bool			resp_completed;
 	struct ib_sge		sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS];
 	u32			num_sge;
 	int			nents;
@@ -332,6 +336,8 @@ static int nvme_rdma_init_request(struct blk_mq_tag_set *set,
 	struct ib_device *ibdev = dev->dev;
 	int ret;
 
+	spin_lock_init(&req->lock);
+
 	ret = nvme_rdma_alloc_qe(ibdev, &req->sqe, sizeof(struct nvme_command),
 			DMA_TO_DEVICE);
 	if (ret)
@@ -1154,6 +1160,8 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 	req->num_sge = 1;
 	req->inline_data = false;
 	req->mr->need_inval = false;
+	req->send_completed = false;
+	req->resp_completed = false;
 
 	c->common.flags |= NVME_CMD_SGL_METABUF;
 
@@ -1190,13 +1198,30 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 
 static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	if (unlikely(wc->status != IB_WC_SUCCESS))
+	struct nvme_rdma_qe *qe =
+		container_of(wc->wr_cqe, struct nvme_rdma_qe, cqe);
+	struct nvme_rdma_request *req =
+		container_of(qe, struct nvme_rdma_request, sqe);
+	struct request *rq = blk_mq_rq_from_pdu(req);
+	unsigned long flags;
+	bool end;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		nvme_rdma_wr_error(cq, wc, "SEND");
+		return;
+	}
+
+	spin_lock_irqsave(&req->lock, flags);
+	req->send_completed = true;
+	end = req->resp_completed;
+	spin_unlock_irqrestore(&req->lock, flags);
+	if (end)
+		nvme_end_request(rq, req->cqe.status, req->cqe.result);
 }
 
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 		struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
-		struct ib_send_wr *first)
+		struct ib_send_wr *first, bool signal)
 {
 	struct ib_send_wr wr, *bad_wr;
 	int ret;
@@ -1212,7 +1237,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 	wr.sg_list    = sge;
 	wr.num_sge    = num_sge;
 	wr.opcode     = IB_WR_SEND;
-	wr.send_flags = IB_SEND_SIGNALED;
+	wr.send_flags = signal ? IB_SEND_SIGNALED : 0;
 
 	if (first)
 		first->next = &wr;
@@ -1286,7 +1311,7 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg, int aer_idx)
 	ib_dma_sync_single_for_device(dev, sqe->dma, sizeof(*cmd),
 			DMA_TO_DEVICE);
 
-	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL);
+	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL, false);
 	WARN_ON_ONCE(ret);
 }
 
@@ -1296,6 +1321,8 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	struct request *rq;
 	struct nvme_rdma_request *req;
 	int ret = 0;
+	unsigned long flags;
+	bool end;
 
 	rq = blk_mq_tag_to_rq(nvme_rdma_tagset(queue), cqe->command_id);
 	if (!rq) {
@@ -1307,14 +1334,23 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	}
 	req = blk_mq_rq_to_pdu(rq);
 
-	if (rq->tag == tag)
-		ret = 1;
+	req->cqe.status = cqe->status;
+	req->cqe.result = cqe->result;
 
 	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
 	    wc->ex.invalidate_rkey == req->mr->rkey)
 		req->mr->need_inval = false;
 
-	nvme_end_request(rq, cqe->status, cqe->result);
+	spin_lock_irqsave(&req->lock, flags);
+	req->resp_completed = true;
+	end = req->send_completed;
+	spin_unlock_irqrestore(&req->lock, flags);
+	if (end) {
+		if (rq->tag == tag)
+			ret = 1;
+		nvme_end_request(rq, req->cqe.status, req->cqe.result);
+	}
+
 	return ret;
 }
 
@@ -1603,7 +1639,7 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 			sizeof(struct nvme_command), DMA_TO_DEVICE);
 
 	err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
-			req->mr->need_inval ? &req->reg_wr.wr : NULL);
+			req->mr->need_inval ? &req->reg_wr.wr : NULL, true);
 	if (unlikely(err)) {
 		nvme_rdma_unmap_data(queue, rq);
 		goto err;
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
@ 2017-11-08 10:06     ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-08 10:06 UTC (permalink / raw)


In order to guarantee that the HCA will never get an access violation
(either from invalidated rkey or from iommu) when retrying a send
operation we must complete a request only when both send completion
and the nvme cqe has arrived. We need to set the send/recv completions
flags atomically because we might have more than a single context
accessing the request concurrently (one is cq irq-poll context and
the other is user-polling used in IOCB_HIPRI).

Only then we are safe to invalidate the rkey (if needed), unmap
the host buffers, and complete the IO.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 52 ++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 44 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index c765f1d20141..998f7cb445d9 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -67,6 +67,10 @@ struct nvme_rdma_request {
 	struct nvme_request	req;
 	struct ib_mr		*mr;
 	struct nvme_rdma_qe	sqe;
+	struct nvme_completion	cqe;
+	spinlock_t		lock;
+	bool			send_completed;
+	bool			resp_completed;
 	struct ib_sge		sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS];
 	u32			num_sge;
 	int			nents;
@@ -332,6 +336,8 @@ static int nvme_rdma_init_request(struct blk_mq_tag_set *set,
 	struct ib_device *ibdev = dev->dev;
 	int ret;
 
+	spin_lock_init(&req->lock);
+
 	ret = nvme_rdma_alloc_qe(ibdev, &req->sqe, sizeof(struct nvme_command),
 			DMA_TO_DEVICE);
 	if (ret)
@@ -1154,6 +1160,8 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 	req->num_sge = 1;
 	req->inline_data = false;
 	req->mr->need_inval = false;
+	req->send_completed = false;
+	req->resp_completed = false;
 
 	c->common.flags |= NVME_CMD_SGL_METABUF;
 
@@ -1190,13 +1198,30 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 
 static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	if (unlikely(wc->status != IB_WC_SUCCESS))
+	struct nvme_rdma_qe *qe =
+		container_of(wc->wr_cqe, struct nvme_rdma_qe, cqe);
+	struct nvme_rdma_request *req =
+		container_of(qe, struct nvme_rdma_request, sqe);
+	struct request *rq = blk_mq_rq_from_pdu(req);
+	unsigned long flags;
+	bool end;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		nvme_rdma_wr_error(cq, wc, "SEND");
+		return;
+	}
+
+	spin_lock_irqsave(&req->lock, flags);
+	req->send_completed = true;
+	end = req->resp_completed;
+	spin_unlock_irqrestore(&req->lock, flags);
+	if (end)
+		nvme_end_request(rq, req->cqe.status, req->cqe.result);
 }
 
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 		struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
-		struct ib_send_wr *first)
+		struct ib_send_wr *first, bool signal)
 {
 	struct ib_send_wr wr, *bad_wr;
 	int ret;
@@ -1212,7 +1237,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 	wr.sg_list    = sge;
 	wr.num_sge    = num_sge;
 	wr.opcode     = IB_WR_SEND;
-	wr.send_flags = IB_SEND_SIGNALED;
+	wr.send_flags = signal ? IB_SEND_SIGNALED : 0;
 
 	if (first)
 		first->next = &wr;
@@ -1286,7 +1311,7 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg, int aer_idx)
 	ib_dma_sync_single_for_device(dev, sqe->dma, sizeof(*cmd),
 			DMA_TO_DEVICE);
 
-	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL);
+	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL, false);
 	WARN_ON_ONCE(ret);
 }
 
@@ -1296,6 +1321,8 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	struct request *rq;
 	struct nvme_rdma_request *req;
 	int ret = 0;
+	unsigned long flags;
+	bool end;
 
 	rq = blk_mq_tag_to_rq(nvme_rdma_tagset(queue), cqe->command_id);
 	if (!rq) {
@@ -1307,14 +1334,23 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	}
 	req = blk_mq_rq_to_pdu(rq);
 
-	if (rq->tag == tag)
-		ret = 1;
+	req->cqe.status = cqe->status;
+	req->cqe.result = cqe->result;
 
 	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
 	    wc->ex.invalidate_rkey == req->mr->rkey)
 		req->mr->need_inval = false;
 
-	nvme_end_request(rq, cqe->status, cqe->result);
+	spin_lock_irqsave(&req->lock, flags);
+	req->resp_completed = true;
+	end = req->send_completed;
+	spin_unlock_irqrestore(&req->lock, flags);
+	if (end) {
+		if (rq->tag == tag)
+			ret = 1;
+		nvme_end_request(rq, req->cqe.status, req->cqe.result);
+	}
+
 	return ret;
 }
 
@@ -1603,7 +1639,7 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 			sizeof(struct nvme_command), DMA_TO_DEVICE);
 
 	err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
-			req->mr->need_inval ? &req->reg_wr.wr : NULL);
+			req->mr->need_inval ? &req->reg_wr.wr : NULL, true);
 	if (unlikely(err)) {
 		nvme_rdma_unmap_data(queue, rq);
 		goto err;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 3/3] nvme-rdma: wait for local invalidation before completing a request
  2017-11-08 10:06 ` Sagi Grimberg
@ 2017-11-08 10:06     ` Sagi Grimberg
  -1 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-08 10:06 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
  Cc: Christoph Hellwig, Max Gurtuvoy

We must not complete a request before the host memory region is
invalidated. Luckily we have send with invalidate protocol support
so we usually don't need to execute it, but in case the target
did not invalidate a memory region for us, we must wait for
the invalidation to complete before unmapping host memory and
completing the I/O.

Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
---
 drivers/nvme/host/rdma.c | 47 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 30 insertions(+), 17 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 998f7cb445d9..6ddaa7964657 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1014,8 +1014,24 @@ static void nvme_rdma_memreg_done(struct ib_cq *cq, struct ib_wc *wc)
 
 static void nvme_rdma_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	if (unlikely(wc->status != IB_WC_SUCCESS))
+	struct nvme_rdma_request *req =
+		container_of(wc->wr_cqe, struct nvme_rdma_request, reg_cqe);
+	struct request *rq = blk_mq_rq_from_pdu(req);
+	unsigned long flags;
+	bool end;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		nvme_rdma_wr_error(cq, wc, "LOCAL_INV");
+		return;
+	}
+
+	spin_lock_irqsave(&req->lock, flags);
+	req->mr->need_inval = false;
+	end = (req->resp_completed && req->send_completed);
+	spin_unlock_irqrestore(&req->lock, flags);
+	if (end)
+		nvme_end_request(rq, req->cqe.status, req->cqe.result);
+
 }
 
 static int nvme_rdma_inv_rkey(struct nvme_rdma_queue *queue,
@@ -1026,7 +1042,7 @@ static int nvme_rdma_inv_rkey(struct nvme_rdma_queue *queue,
 		.opcode		    = IB_WR_LOCAL_INV,
 		.next		    = NULL,
 		.num_sge	    = 0,
-		.send_flags	    = 0,
+		.send_flags	    = IB_SEND_SIGNALED,
 		.ex.invalidate_rkey = req->mr->rkey,
 	};
 
@@ -1040,24 +1056,12 @@ static void nvme_rdma_unmap_data(struct nvme_rdma_queue *queue,
 		struct request *rq)
 {
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
-	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
 	struct nvme_rdma_device *dev = queue->device;
 	struct ib_device *ibdev = dev->dev;
-	int res;
 
 	if (!blk_rq_bytes(rq))
 		return;
 
-	if (req->mr->need_inval && test_bit(NVME_RDMA_Q_LIVE, &req->queue->flags)) {
-		res = nvme_rdma_inv_rkey(queue, req);
-		if (unlikely(res < 0)) {
-			dev_err(ctrl->ctrl.device,
-				"Queueing INV WR for rkey %#x failed (%d)\n",
-				req->mr->rkey, res);
-			nvme_rdma_error_recovery(queue->ctrl);
-		}
-	}
-
 	ib_dma_unmap_sg(ibdev, req->sg_table.sgl,
 			req->nents, rq_data_dir(rq) ==
 				    WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
@@ -1213,7 +1217,7 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 
 	spin_lock_irqsave(&req->lock, flags);
 	req->send_completed = true;
-	end = req->resp_completed;
+	end = req->resp_completed && !req->mr->need_inval;
 	spin_unlock_irqrestore(&req->lock, flags);
 	if (end)
 		nvme_end_request(rq, req->cqe.status, req->cqe.result);
@@ -1338,12 +1342,21 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	req->cqe.result = cqe->result;
 
 	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
-	    wc->ex.invalidate_rkey == req->mr->rkey)
+	    wc->ex.invalidate_rkey == req->mr->rkey) {
 		req->mr->need_inval = false;
+	} else if (req->mr->need_inval) {
+		ret = nvme_rdma_inv_rkey(queue, req);
+		if (unlikely(ret < 0)) {
+			dev_err(queue->ctrl->ctrl.device,
+				"Queueing INV WR for rkey %#x failed (%d)\n",
+				req->mr->rkey, ret);
+			nvme_rdma_error_recovery(queue->ctrl);
+		}
+	}
 
 	spin_lock_irqsave(&req->lock, flags);
 	req->resp_completed = true;
-	end = req->send_completed;
+	end = req->send_completed && !req->mr->need_inval;
 	spin_unlock_irqrestore(&req->lock, flags);
 	if (end) {
 		if (rq->tag == tag)
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 3/3] nvme-rdma: wait for local invalidation before completing a request
@ 2017-11-08 10:06     ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-08 10:06 UTC (permalink / raw)


We must not complete a request before the host memory region is
invalidated. Luckily we have send with invalidate protocol support
so we usually don't need to execute it, but in case the target
did not invalidate a memory region for us, we must wait for
the invalidation to complete before unmapping host memory and
completing the I/O.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 47 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 30 insertions(+), 17 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 998f7cb445d9..6ddaa7964657 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1014,8 +1014,24 @@ static void nvme_rdma_memreg_done(struct ib_cq *cq, struct ib_wc *wc)
 
 static void nvme_rdma_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	if (unlikely(wc->status != IB_WC_SUCCESS))
+	struct nvme_rdma_request *req =
+		container_of(wc->wr_cqe, struct nvme_rdma_request, reg_cqe);
+	struct request *rq = blk_mq_rq_from_pdu(req);
+	unsigned long flags;
+	bool end;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		nvme_rdma_wr_error(cq, wc, "LOCAL_INV");
+		return;
+	}
+
+	spin_lock_irqsave(&req->lock, flags);
+	req->mr->need_inval = false;
+	end = (req->resp_completed && req->send_completed);
+	spin_unlock_irqrestore(&req->lock, flags);
+	if (end)
+		nvme_end_request(rq, req->cqe.status, req->cqe.result);
+
 }
 
 static int nvme_rdma_inv_rkey(struct nvme_rdma_queue *queue,
@@ -1026,7 +1042,7 @@ static int nvme_rdma_inv_rkey(struct nvme_rdma_queue *queue,
 		.opcode		    = IB_WR_LOCAL_INV,
 		.next		    = NULL,
 		.num_sge	    = 0,
-		.send_flags	    = 0,
+		.send_flags	    = IB_SEND_SIGNALED,
 		.ex.invalidate_rkey = req->mr->rkey,
 	};
 
@@ -1040,24 +1056,12 @@ static void nvme_rdma_unmap_data(struct nvme_rdma_queue *queue,
 		struct request *rq)
 {
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
-	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
 	struct nvme_rdma_device *dev = queue->device;
 	struct ib_device *ibdev = dev->dev;
-	int res;
 
 	if (!blk_rq_bytes(rq))
 		return;
 
-	if (req->mr->need_inval && test_bit(NVME_RDMA_Q_LIVE, &req->queue->flags)) {
-		res = nvme_rdma_inv_rkey(queue, req);
-		if (unlikely(res < 0)) {
-			dev_err(ctrl->ctrl.device,
-				"Queueing INV WR for rkey %#x failed (%d)\n",
-				req->mr->rkey, res);
-			nvme_rdma_error_recovery(queue->ctrl);
-		}
-	}
-
 	ib_dma_unmap_sg(ibdev, req->sg_table.sgl,
 			req->nents, rq_data_dir(rq) ==
 				    WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
@@ -1213,7 +1217,7 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 
 	spin_lock_irqsave(&req->lock, flags);
 	req->send_completed = true;
-	end = req->resp_completed;
+	end = req->resp_completed && !req->mr->need_inval;
 	spin_unlock_irqrestore(&req->lock, flags);
 	if (end)
 		nvme_end_request(rq, req->cqe.status, req->cqe.result);
@@ -1338,12 +1342,21 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	req->cqe.result = cqe->result;
 
 	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
-	    wc->ex.invalidate_rkey == req->mr->rkey)
+	    wc->ex.invalidate_rkey == req->mr->rkey) {
 		req->mr->need_inval = false;
+	} else if (req->mr->need_inval) {
+		ret = nvme_rdma_inv_rkey(queue, req);
+		if (unlikely(ret < 0)) {
+			dev_err(queue->ctrl->ctrl.device,
+				"Queueing INV WR for rkey %#x failed (%d)\n",
+				req->mr->rkey, ret);
+			nvme_rdma_error_recovery(queue->ctrl);
+		}
+	}
 
 	spin_lock_irqsave(&req->lock, flags);
 	req->resp_completed = true;
-	end = req->send_completed;
+	end = req->send_completed && !req->mr->need_inval;
 	spin_unlock_irqrestore(&req->lock, flags);
 	if (end) {
 		if (rq->tag == tag)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 1/3] nvme-rdma: don't suppress send completions
  2017-11-08 10:06     ` Sagi Grimberg
@ 2017-11-09  9:18         ` Christoph Hellwig
  -1 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-09  9:18 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

On Wed, Nov 08, 2017 at 12:06:14PM +0200, Sagi Grimberg wrote:
> The entire completions suppress mechanism is currently
> broken because the HCA might retry a send operation
> (due to dropped ack) after the nvme transaction has completed.
> 
> In order to handle this, we signal all send completions (besides
> async event which is not racing anything).

Oh well, so much for all these unsignalled completion optimizations..

So in which cases do unsignalled completions work at all?  Seems like
we need to fix up a lot of other ULPs as well.

Looks fine:

Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>

> -	 */
> -	if (nvme_rdma_queue_sig_limit(queue) || flush)
> -		wr.send_flags |= IB_SEND_SIGNALED;
> +	wr.send_flags = IB_SEND_SIGNALED;

But..  Is there any benefit in just setting IB_SIGNAL_ALL_WR on the QP?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 1/3] nvme-rdma: don't suppress send completions
@ 2017-11-09  9:18         ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-09  9:18 UTC (permalink / raw)


On Wed, Nov 08, 2017@12:06:14PM +0200, Sagi Grimberg wrote:
> The entire completions suppress mechanism is currently
> broken because the HCA might retry a send operation
> (due to dropped ack) after the nvme transaction has completed.
> 
> In order to handle this, we signal all send completions (besides
> async event which is not racing anything).

Oh well, so much for all these unsignalled completion optimizations..

So in which cases do unsignalled completions work at all?  Seems like
we need to fix up a lot of other ULPs as well.

Looks fine:

Reviewed-by: Christoph Hellwig <hch at lst.de>

> -	 */
> -	if (nvme_rdma_queue_sig_limit(queue) || flush)
> -		wr.send_flags |= IB_SEND_SIGNALED;
> +	wr.send_flags = IB_SEND_SIGNALED;

But..  Is there any benefit in just setting IB_SIGNAL_ALL_WR on the QP?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
  2017-11-08 10:06     ` Sagi Grimberg
@ 2017-11-09  9:21         ` Christoph Hellwig
  -1 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-09  9:21 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

> +	}
> +
> +	spin_lock_irqsave(&req->lock, flags);
> +	req->send_completed = true;
> +	end = req->resp_completed;
> +	spin_unlock_irqrestore(&req->lock, flags);

Wouldn't it be better to use atomic bitops (test_and_set_bit and co)
for these flags instead of needing a irqsave lock?

>  	err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
> -			req->mr->need_inval ? &req->reg_wr.wr : NULL);
> +			req->mr->need_inval ? &req->reg_wr.wr : NULL, true);

Looks like the unsignalled completions just removed in the last patch
are coming back here.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
@ 2017-11-09  9:21         ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-09  9:21 UTC (permalink / raw)


> +	}
> +
> +	spin_lock_irqsave(&req->lock, flags);
> +	req->send_completed = true;
> +	end = req->resp_completed;
> +	spin_unlock_irqrestore(&req->lock, flags);

Wouldn't it be better to use atomic bitops (test_and_set_bit and co)
for these flags instead of needing a irqsave lock?

>  	err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
> -			req->mr->need_inval ? &req->reg_wr.wr : NULL);
> +			req->mr->need_inval ? &req->reg_wr.wr : NULL, true);

Looks like the unsignalled completions just removed in the last patch
are coming back here.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 3/3] nvme-rdma: wait for local invalidation before completing a request
  2017-11-08 10:06     ` Sagi Grimberg
@ 2017-11-09  9:39         ` Christoph Hellwig
  -1 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-09  9:39 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Max Gurtuvoy

On Wed, Nov 08, 2017 at 12:06:16PM +0200, Sagi Grimberg wrote:
> We must not complete a request before the host memory region is
> invalidated. Luckily we have send with invalidate protocol support
> so we usually don't need to execute it, but in case the target
> did not invalidate a memory region for us, we must wait for
> the invalidation to complete before unmapping host memory and
> completing the I/O.

Looks good, but a change to atomic bitops would require a respin
here as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 3/3] nvme-rdma: wait for local invalidation before completing a request
@ 2017-11-09  9:39         ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-09  9:39 UTC (permalink / raw)


On Wed, Nov 08, 2017@12:06:16PM +0200, Sagi Grimberg wrote:
> We must not complete a request before the host memory region is
> invalidated. Luckily we have send with invalidate protocol support
> so we usually don't need to execute it, but in case the target
> did not invalidate a memory region for us, we must wait for
> the invalidation to complete before unmapping host memory and
> completing the I/O.

Looks good, but a change to atomic bitops would require a respin
here as well.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 1/3] nvme-rdma: don't suppress send completions
  2017-11-09  9:18         ` Christoph Hellwig
@ 2017-11-09 11:08             ` Sagi Grimberg
  -1 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-09 11:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Max Gurtuvoy


>> The entire completions suppress mechanism is currently
>> broken because the HCA might retry a send operation
>> (due to dropped ack) after the nvme transaction has completed.
>>
>> In order to handle this, we signal all send completions (besides
>> async event which is not racing anything).
> 
> Oh well, so much for all these unsignalled completion optimizations..
> 
> So in which cases do unsignalled completions work at all?

Probably in cases where no in-capsule data is used we're fine because
the command buffers mappings are long lived, and for non-RPC like
applications.

> Seems like we need to fix up a lot of other ULPs as well.

Probably only those that support in-capsule data (which I hear
SRP does too these days).

>> -	 */
>> -	if (nvme_rdma_queue_sig_limit(queue) || flush)
>> -		wr.send_flags |= IB_SEND_SIGNALED;
>> +	wr.send_flags = IB_SEND_SIGNALED;
> 
> But..  Is there any benefit in just setting IB_SIGNAL_ALL_WR on the QP?

There is one case I still don't want to signal send completion, thats
the AEN request. It doesn't have a request structure, and preferred not
to check on it for every send completion, and its not racing anything
(mentioned this above).

I can change that though if there is a strong desire.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 1/3] nvme-rdma: don't suppress send completions
@ 2017-11-09 11:08             ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-09 11:08 UTC (permalink / raw)



>> The entire completions suppress mechanism is currently
>> broken because the HCA might retry a send operation
>> (due to dropped ack) after the nvme transaction has completed.
>>
>> In order to handle this, we signal all send completions (besides
>> async event which is not racing anything).
> 
> Oh well, so much for all these unsignalled completion optimizations..
> 
> So in which cases do unsignalled completions work at all?

Probably in cases where no in-capsule data is used we're fine because
the command buffers mappings are long lived, and for non-RPC like
applications.

> Seems like we need to fix up a lot of other ULPs as well.

Probably only those that support in-capsule data (which I hear
SRP does too these days).

>> -	 */
>> -	if (nvme_rdma_queue_sig_limit(queue) || flush)
>> -		wr.send_flags |= IB_SEND_SIGNALED;
>> +	wr.send_flags = IB_SEND_SIGNALED;
> 
> But..  Is there any benefit in just setting IB_SIGNAL_ALL_WR on the QP?

There is one case I still don't want to signal send completion, thats
the AEN request. It doesn't have a request structure, and preferred not
to check on it for every send completion, and its not racing anything
(mentioned this above).

I can change that though if there is a strong desire.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
  2017-11-09  9:21         ` Christoph Hellwig
@ 2017-11-09 11:14             ` Sagi Grimberg
  -1 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-09 11:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Max Gurtuvoy



On 11/09/2017 11:21 AM, Christoph Hellwig wrote:
>> +	}
>> +
>> +	spin_lock_irqsave(&req->lock, flags);
>> +	req->send_completed = true;
>> +	end = req->resp_completed;
>> +	spin_unlock_irqrestore(&req->lock, flags);
> 
> Wouldn't it be better to use atomic bitops (test_and_set_bit and co)
> for these flags instead of needing a irqsave lock?

Here it will be better, but in the next patch, where we need to check
also for MR invalidation atomically I don't know.

The logic is:
1. on RECV: complete if (send completed and MR invalidated)
2. on SEND: complete if (recv completed and MR invalidated)
3. on INV: complete if (recv completed and send completed)

We need the conditions to be atomic because the CQ can be processed
from two contexts (and trust me, it falls apart fast if we don't protect
it). Not sure I understand how I can achieve this with atomic bitops.

We could relax the spinlock to _bh I think as we are only racing with
irq-poll.

>>   	err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
>> -			req->mr->need_inval ? &req->reg_wr.wr : NULL);
>> +			req->mr->need_inval ? &req->reg_wr.wr : NULL, true);
> 
> Looks like the unsignalled completions just removed in the last patch
> are coming back here.

That's because of the async event send, not sure why its here though,
I'll check.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
@ 2017-11-09 11:14             ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-09 11:14 UTC (permalink / raw)




On 11/09/2017 11:21 AM, Christoph Hellwig wrote:
>> +	}
>> +
>> +	spin_lock_irqsave(&req->lock, flags);
>> +	req->send_completed = true;
>> +	end = req->resp_completed;
>> +	spin_unlock_irqrestore(&req->lock, flags);
> 
> Wouldn't it be better to use atomic bitops (test_and_set_bit and co)
> for these flags instead of needing a irqsave lock?

Here it will be better, but in the next patch, where we need to check
also for MR invalidation atomically I don't know.

The logic is:
1. on RECV: complete if (send completed and MR invalidated)
2. on SEND: complete if (recv completed and MR invalidated)
3. on INV: complete if (recv completed and send completed)

We need the conditions to be atomic because the CQ can be processed
from two contexts (and trust me, it falls apart fast if we don't protect
it). Not sure I understand how I can achieve this with atomic bitops.

We could relax the spinlock to _bh I think as we are only racing with
irq-poll.

>>   	err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
>> -			req->mr->need_inval ? &req->reg_wr.wr : NULL);
>> +			req->mr->need_inval ? &req->reg_wr.wr : NULL, true);
> 
> Looks like the unsignalled completions just removed in the last patch
> are coming back here.

That's because of the async event send, not sure why its here though,
I'll check.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 0/3] Fix request completion holes
  2017-11-08 10:06 ` Sagi Grimberg
@ 2017-11-13 22:10     ` Doug Ledford
  -1 siblings, 0 replies; 50+ messages in thread
From: Doug Ledford @ 2017-11-13 22:10 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
  Cc: Christoph Hellwig, Max Gurtuvoy

[-- Attachment #1: Type: text/plain, Size: 2431 bytes --]

On Wed, 2017-11-08 at 12:06 +0200, Sagi Grimberg wrote:
> We have two holes in nvme-rdma when completing request.
> 
> 1. We never wait for send work request to complete before completing
> a request. It is possible that the HCA retries a send operation (due
> to dropped ack) after the nvme cqe has already arrived back to the host.
> If we unmap the host buffer upon reception of the cqe, the HCA might
> get iommu errors when attempting to access an unmapped host buffer.
> We must wait also for the send completion before completing a request,
> most of the time it will be before the nvme cqe has arrived back so
> we pay only for the extra cq entry processing.
> 
> 2. We don't wait for the request memory region to be fully invalidated
> in case the target didn't invalidate remotely. We must wait for the local
> invalidation to complete before completing the request.
> 
> Note that we might face two concurrent completion processing contexts for
> a single request. One is the ib_cq irq-poll context and the second is
> blk_mq_poll which is invoked from IOCB_HIPRI requests. Thus we need the
> completion flags updates (send/receive) to be atomic. A new request
> lock is introduced to guarantee the mutual exclusion of the completion
> flags updates.
> 
> Note that we could have used a per-queue lock for these updates (which
> would have generated less locks as we have less queues), but given that
> we access the request in the completion handlers we might benefit by having
> the lock local in the request. I'm open to suggestions though.
> 
> Changes from v1:
> - Added atomic send/resp_completed updated (via per-request lock)
> 
> Sagi Grimberg (3):
>   nvme-rdma: don't suppress send completions
>   nvme-rdma: don't complete requests before a send work request has
>     completed
>   nvme-rdma: wait for local invalidation before completing a request
> 
>  drivers/nvme/host/rdma.c | 125 ++++++++++++++++++++++++++---------------------
>  1 file changed, 70 insertions(+), 55 deletions(-)
> 

Sagi, are you ready for me to take this series in?  It seemed like there
was a question as to whether you might want to try atomics instead of
spin locks, or do you want to stick with spinlocks?

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 0/3] Fix request completion holes
@ 2017-11-13 22:10     ` Doug Ledford
  0 siblings, 0 replies; 50+ messages in thread
From: Doug Ledford @ 2017-11-13 22:10 UTC (permalink / raw)


On Wed, 2017-11-08@12:06 +0200, Sagi Grimberg wrote:
> We have two holes in nvme-rdma when completing request.
> 
> 1. We never wait for send work request to complete before completing
> a request. It is possible that the HCA retries a send operation (due
> to dropped ack) after the nvme cqe has already arrived back to the host.
> If we unmap the host buffer upon reception of the cqe, the HCA might
> get iommu errors when attempting to access an unmapped host buffer.
> We must wait also for the send completion before completing a request,
> most of the time it will be before the nvme cqe has arrived back so
> we pay only for the extra cq entry processing.
> 
> 2. We don't wait for the request memory region to be fully invalidated
> in case the target didn't invalidate remotely. We must wait for the local
> invalidation to complete before completing the request.
> 
> Note that we might face two concurrent completion processing contexts for
> a single request. One is the ib_cq irq-poll context and the second is
> blk_mq_poll which is invoked from IOCB_HIPRI requests. Thus we need the
> completion flags updates (send/receive) to be atomic. A new request
> lock is introduced to guarantee the mutual exclusion of the completion
> flags updates.
> 
> Note that we could have used a per-queue lock for these updates (which
> would have generated less locks as we have less queues), but given that
> we access the request in the completion handlers we might benefit by having
> the lock local in the request. I'm open to suggestions though.
> 
> Changes from v1:
> - Added atomic send/resp_completed updated (via per-request lock)
> 
> Sagi Grimberg (3):
>   nvme-rdma: don't suppress send completions
>   nvme-rdma: don't complete requests before a send work request has
>     completed
>   nvme-rdma: wait for local invalidation before completing a request
> 
>  drivers/nvme/host/rdma.c | 125 ++++++++++++++++++++++++++---------------------
>  1 file changed, 70 insertions(+), 55 deletions(-)
> 

Sagi, are you ready for me to take this series in?  It seemed like there
was a question as to whether you might want to try atomics instead of
spin locks, or do you want to stick with spinlocks?

-- 
Doug Ledford <dledford at redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20171113/72fd8cc4/attachment.sig>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 0/3] Fix request completion holes
  2017-11-13 22:10     ` Doug Ledford
@ 2017-11-16 15:39         ` Sagi Grimberg
  -1 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-16 15:39 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
  Cc: Christoph Hellwig, Max Gurtuvoy


> Sagi, are you ready for me to take this series in?  It seemed like there
> was a question as to whether you might want to try atomics instead of
> spin locks, or do you want to stick with spinlocks?

I still need a review for this, and I think we'll get it in through the
nvme tree (via the block tree). Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 0/3] Fix request completion holes
@ 2017-11-16 15:39         ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-16 15:39 UTC (permalink / raw)



> Sagi, are you ready for me to take this series in?  It seemed like there
> was a question as to whether you might want to try atomics instead of
> spin locks, or do you want to stick with spinlocks?

I still need a review for this, and I think we'll get it in through the
nvme tree (via the block tree). Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 0/3] Fix request completion holes
  2017-11-16 15:39         ` Sagi Grimberg
@ 2017-11-16 15:58             ` Doug Ledford
  -1 siblings, 0 replies; 50+ messages in thread
From: Doug Ledford @ 2017-11-16 15:58 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
  Cc: Christoph Hellwig, Max Gurtuvoy

[-- Attachment #1: Type: text/plain, Size: 811 bytes --]

On Thu, 2017-11-16 at 17:39 +0200, Sagi Grimberg wrote:
> > Sagi, are you ready for me to take this series in?  It seemed like there
> > was a question as to whether you might want to try atomics instead of
> > spin locks, or do you want to stick with spinlocks?
> 
> I still need a review for this, and I think we'll get it in through the
> nvme tree (via the block tree). Thanks.

Sorry, I didn't pay that close of attention to the files patched in each
one.  I just saw rdma in each subject line and assumed it was an rdma
destined patch series.  Looking closer I see it was really just Cc:ed to
linux-rdma for review.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 0/3] Fix request completion holes
@ 2017-11-16 15:58             ` Doug Ledford
  0 siblings, 0 replies; 50+ messages in thread
From: Doug Ledford @ 2017-11-16 15:58 UTC (permalink / raw)


On Thu, 2017-11-16@17:39 +0200, Sagi Grimberg wrote:
> > Sagi, are you ready for me to take this series in?  It seemed like there
> > was a question as to whether you might want to try atomics instead of
> > spin locks, or do you want to stick with spinlocks?
> 
> I still need a review for this, and I think we'll get it in through the
> nvme tree (via the block tree). Thanks.

Sorry, I didn't pay that close of attention to the files patched in each
one.  I just saw rdma in each subject line and assumed it was an rdma
destined patch series.  Looking closer I see it was really just Cc:ed to
linux-rdma for review.

-- 
Doug Ledford <dledford at redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20171116/3528d82b/attachment.sig>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 0/3] Fix request completion holes
  2017-11-16 15:39         ` Sagi Grimberg
@ 2017-11-20  7:37             ` Christoph Hellwig
  -1 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-20  7:37 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Max Gurtuvoy,
	Christoph Hellwig

On Thu, Nov 16, 2017 at 05:39:55PM +0200, Sagi Grimberg wrote:
> 
> > Sagi, are you ready for me to take this series in?  It seemed like there
> > was a question as to whether you might want to try atomics instead of
> > spin locks, or do you want to stick with spinlocks?
> 
> I still need a review for this, and I think we'll get it in through the
> nvme tree (via the block tree). Thanks.

I've  done an initial review pass, but I'd really like to see someone
who has more RDMA experience to take a look as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 0/3] Fix request completion holes
@ 2017-11-20  7:37             ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-20  7:37 UTC (permalink / raw)


On Thu, Nov 16, 2017@05:39:55PM +0200, Sagi Grimberg wrote:
> 
> > Sagi, are you ready for me to take this series in?  It seemed like there
> > was a question as to whether you might want to try atomics instead of
> > spin locks, or do you want to stick with spinlocks?
> 
> I still need a review for this, and I think we'll get it in through the
> nvme tree (via the block tree). Thanks.

I've  done an initial review pass, but I'd really like to see someone
who has more RDMA experience to take a look as well.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 1/3] nvme-rdma: don't suppress send completions
  2017-11-09 11:08             ` Sagi Grimberg
@ 2017-11-20  8:18                 ` Christoph Hellwig
  -1 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-20  8:18 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Max Gurtuvoy

On Thu, Nov 09, 2017 at 01:08:04PM +0200, Sagi Grimberg wrote:
> Probably only those that support in-capsule data (which I hear
> SRP does too these days).

SRP-2 adds inline data, and Bart's out of tree code has been supporting
that for a long time.

> There is one case I still don't want to signal send completion, thats
> the AEN request. It doesn't have a request structure, and preferred not
> to check on it for every send completion, and its not racing anything
> (mentioned this above).
>
> I can change that though if there is a strong desire.

I don't really like having a special case just for this slow path
special case.  So if we can avoid it without too much overhead let's
do it, otherwise we can keep it as-is.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 1/3] nvme-rdma: don't suppress send completions
@ 2017-11-20  8:18                 ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-20  8:18 UTC (permalink / raw)


On Thu, Nov 09, 2017@01:08:04PM +0200, Sagi Grimberg wrote:
> Probably only those that support in-capsule data (which I hear
> SRP does too these days).

SRP-2 adds inline data, and Bart's out of tree code has been supporting
that for a long time.

> There is one case I still don't want to signal send completion, thats
> the AEN request. It doesn't have a request structure, and preferred not
> to check on it for every send completion, and its not racing anything
> (mentioned this above).
>
> I can change that though if there is a strong desire.

I don't really like having a special case just for this slow path
special case.  So if we can avoid it without too much overhead let's
do it, otherwise we can keep it as-is.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
  2017-11-09 11:14             ` Sagi Grimberg
@ 2017-11-20  8:31                 ` Christoph Hellwig
  -1 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-20  8:31 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Max Gurtuvoy

On Thu, Nov 09, 2017 at 01:14:23PM +0200, Sagi Grimberg wrote:
>> Wouldn't it be better to use atomic bitops (test_and_set_bit and co)
>> for these flags instead of needing a irqsave lock?
>
> Here it will be better, but in the next patch, where we need to check
> also for MR invalidation atomically I don't know.
>
> The logic is:
> 1. on RECV: complete if (send completed and MR invalidated)
> 2. on SEND: complete if (recv completed and MR invalidated)
> 3. on INV: complete if (recv completed and send completed)
>
> We need the conditions to be atomic because the CQ can be processed
> from two contexts (and trust me, it falls apart fast if we don't protect
> it). Not sure I understand how I can achieve this with atomic bitops.
>
> We could relax the spinlock to _bh I think as we are only racing with
> irq-poll.

Both send and recv completed only ever go from not set to set on a live
requests, so that's easy.  need_inval only goes from set to cleared
so I'm not too worried about it either, but due to the lack of atomic
bitops there it will need better memory barries, or just a switch
to bitops..

But looking at this in a little more detail I wonder if we just want
a recount on the nvme_rdma_request, untested patch below.  With a little
more work we can probably also remove the need_inval flag, but I think
I want to wait for the mr pool patch from Israel for that.

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index ef3373e56602..5627d81735d2 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -60,9 +60,7 @@ struct nvme_rdma_request {
 	struct ib_mr		*mr;
 	struct nvme_rdma_qe	sqe;
 	struct nvme_completion	cqe;
-	spinlock_t		lock;
-	bool			send_completed;
-	bool			resp_completed;
+	atomic_t		ref;
 	struct ib_sge		sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS];
 	u32			num_sge;
 	int			nents;
@@ -315,8 +313,6 @@ static int nvme_rdma_init_request(struct blk_mq_tag_set *set,
 	struct ib_device *ibdev = dev->dev;
 	int ret;
 
-	spin_lock_init(&req->lock);
-
 	ret = nvme_rdma_alloc_qe(ibdev, &req->sqe, sizeof(struct nvme_command),
 			DMA_TO_DEVICE);
 	if (ret)
@@ -1025,19 +1021,13 @@ static void nvme_rdma_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
 	struct nvme_rdma_request *req =
 		container_of(wc->wr_cqe, struct nvme_rdma_request, reg_cqe);
 	struct request *rq = blk_mq_rq_from_pdu(req);
-	unsigned long flags;
-	bool end;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		nvme_rdma_wr_error(cq, wc, "LOCAL_INV");
 		return;
 	}
 
-	spin_lock_irqsave(&req->lock, flags);
-	req->mr->need_inval = false;
-	end = (req->resp_completed && req->send_completed);
-	spin_unlock_irqrestore(&req->lock, flags);
-	if (end)
+	if (atomic_dec_and_test(&req->ref))
 		nvme_end_request(rq, req->cqe.status, req->cqe.result);
 
 }
@@ -1151,6 +1141,7 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
 			     IB_ACCESS_REMOTE_WRITE;
 
 	req->mr->need_inval = true;
+	atomic_inc(&req->ref);
 
 	sg->addr = cpu_to_le64(req->mr->iova);
 	put_unaligned_le24(req->mr->length, sg->length);
@@ -1172,8 +1163,7 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 	req->num_sge = 1;
 	req->inline_data = false;
 	req->mr->need_inval = false;
-	req->send_completed = false;
-	req->resp_completed = false;
+	atomic_set(&req->ref, 2);
 
 	c->common.flags |= NVME_CMD_SGL_METABUF;
 
@@ -1215,19 +1205,13 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 	struct nvme_rdma_request *req =
 		container_of(qe, struct nvme_rdma_request, sqe);
 	struct request *rq = blk_mq_rq_from_pdu(req);
-	unsigned long flags;
-	bool end;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		nvme_rdma_wr_error(cq, wc, "SEND");
 		return;
 	}
 
-	spin_lock_irqsave(&req->lock, flags);
-	req->send_completed = true;
-	end = req->resp_completed && !req->mr->need_inval;
-	spin_unlock_irqrestore(&req->lock, flags);
-	if (end)
+	if (atomic_dec_and_test(&req->ref))
 		nvme_end_request(rq, req->cqe.status, req->cqe.result);
 }
 
@@ -1330,8 +1314,6 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	struct request *rq;
 	struct nvme_rdma_request *req;
 	int ret = 0;
-	unsigned long flags;
-	bool end;
 
 	rq = blk_mq_tag_to_rq(nvme_rdma_tagset(queue), cqe->command_id);
 	if (!rq) {
@@ -1348,7 +1330,7 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 
 	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
 	    wc->ex.invalidate_rkey == req->mr->rkey) {
-		req->mr->need_inval = false;
+		atomic_dec(&req->ref);
 	} else if (req->mr->need_inval) {
 		ret = nvme_rdma_inv_rkey(queue, req);
 		if (unlikely(ret < 0)) {
@@ -1359,11 +1341,7 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 		}
 	}
 
-	spin_lock_irqsave(&req->lock, flags);
-	req->resp_completed = true;
-	end = req->send_completed && !req->mr->need_inval;
-	spin_unlock_irqrestore(&req->lock, flags);
-	if (end) {
+	if (atomic_dec_and_test(&req->ref)) {
 		if (rq->tag == tag)
 			ret = 1;
 		nvme_end_request(rq, req->cqe.status, req->cqe.result);
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
@ 2017-11-20  8:31                 ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-20  8:31 UTC (permalink / raw)


On Thu, Nov 09, 2017@01:14:23PM +0200, Sagi Grimberg wrote:
>> Wouldn't it be better to use atomic bitops (test_and_set_bit and co)
>> for these flags instead of needing a irqsave lock?
>
> Here it will be better, but in the next patch, where we need to check
> also for MR invalidation atomically I don't know.
>
> The logic is:
> 1. on RECV: complete if (send completed and MR invalidated)
> 2. on SEND: complete if (recv completed and MR invalidated)
> 3. on INV: complete if (recv completed and send completed)
>
> We need the conditions to be atomic because the CQ can be processed
> from two contexts (and trust me, it falls apart fast if we don't protect
> it). Not sure I understand how I can achieve this with atomic bitops.
>
> We could relax the spinlock to _bh I think as we are only racing with
> irq-poll.

Both send and recv completed only ever go from not set to set on a live
requests, so that's easy.  need_inval only goes from set to cleared
so I'm not too worried about it either, but due to the lack of atomic
bitops there it will need better memory barries, or just a switch
to bitops..

But looking at this in a little more detail I wonder if we just want
a recount on the nvme_rdma_request, untested patch below.  With a little
more work we can probably also remove the need_inval flag, but I think
I want to wait for the mr pool patch from Israel for that.

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index ef3373e56602..5627d81735d2 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -60,9 +60,7 @@ struct nvme_rdma_request {
 	struct ib_mr		*mr;
 	struct nvme_rdma_qe	sqe;
 	struct nvme_completion	cqe;
-	spinlock_t		lock;
-	bool			send_completed;
-	bool			resp_completed;
+	atomic_t		ref;
 	struct ib_sge		sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS];
 	u32			num_sge;
 	int			nents;
@@ -315,8 +313,6 @@ static int nvme_rdma_init_request(struct blk_mq_tag_set *set,
 	struct ib_device *ibdev = dev->dev;
 	int ret;
 
-	spin_lock_init(&req->lock);
-
 	ret = nvme_rdma_alloc_qe(ibdev, &req->sqe, sizeof(struct nvme_command),
 			DMA_TO_DEVICE);
 	if (ret)
@@ -1025,19 +1021,13 @@ static void nvme_rdma_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
 	struct nvme_rdma_request *req =
 		container_of(wc->wr_cqe, struct nvme_rdma_request, reg_cqe);
 	struct request *rq = blk_mq_rq_from_pdu(req);
-	unsigned long flags;
-	bool end;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		nvme_rdma_wr_error(cq, wc, "LOCAL_INV");
 		return;
 	}
 
-	spin_lock_irqsave(&req->lock, flags);
-	req->mr->need_inval = false;
-	end = (req->resp_completed && req->send_completed);
-	spin_unlock_irqrestore(&req->lock, flags);
-	if (end)
+	if (atomic_dec_and_test(&req->ref))
 		nvme_end_request(rq, req->cqe.status, req->cqe.result);
 
 }
@@ -1151,6 +1141,7 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
 			     IB_ACCESS_REMOTE_WRITE;
 
 	req->mr->need_inval = true;
+	atomic_inc(&req->ref);
 
 	sg->addr = cpu_to_le64(req->mr->iova);
 	put_unaligned_le24(req->mr->length, sg->length);
@@ -1172,8 +1163,7 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 	req->num_sge = 1;
 	req->inline_data = false;
 	req->mr->need_inval = false;
-	req->send_completed = false;
-	req->resp_completed = false;
+	atomic_set(&req->ref, 2);
 
 	c->common.flags |= NVME_CMD_SGL_METABUF;
 
@@ -1215,19 +1205,13 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 	struct nvme_rdma_request *req =
 		container_of(qe, struct nvme_rdma_request, sqe);
 	struct request *rq = blk_mq_rq_from_pdu(req);
-	unsigned long flags;
-	bool end;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		nvme_rdma_wr_error(cq, wc, "SEND");
 		return;
 	}
 
-	spin_lock_irqsave(&req->lock, flags);
-	req->send_completed = true;
-	end = req->resp_completed && !req->mr->need_inval;
-	spin_unlock_irqrestore(&req->lock, flags);
-	if (end)
+	if (atomic_dec_and_test(&req->ref))
 		nvme_end_request(rq, req->cqe.status, req->cqe.result);
 }
 
@@ -1330,8 +1314,6 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	struct request *rq;
 	struct nvme_rdma_request *req;
 	int ret = 0;
-	unsigned long flags;
-	bool end;
 
 	rq = blk_mq_tag_to_rq(nvme_rdma_tagset(queue), cqe->command_id);
 	if (!rq) {
@@ -1348,7 +1330,7 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 
 	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
 	    wc->ex.invalidate_rkey == req->mr->rkey) {
-		req->mr->need_inval = false;
+		atomic_dec(&req->ref);
 	} else if (req->mr->need_inval) {
 		ret = nvme_rdma_inv_rkey(queue, req);
 		if (unlikely(ret < 0)) {
@@ -1359,11 +1341,7 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 		}
 	}
 
-	spin_lock_irqsave(&req->lock, flags);
-	req->resp_completed = true;
-	end = req->send_completed && !req->mr->need_inval;
-	spin_unlock_irqrestore(&req->lock, flags);
-	if (end) {
+	if (atomic_dec_and_test(&req->ref)) {
 		if (rq->tag == tag)
 			ret = 1;
 		nvme_end_request(rq, req->cqe.status, req->cqe.result);

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 1/3] nvme-rdma: don't suppress send completions
  2017-11-20  8:18                 ` Christoph Hellwig
@ 2017-11-20  8:33                     ` Sagi Grimberg
  -1 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-20  8:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Max Gurtuvoy


>> There is one case I still don't want to signal send completion, thats
>> the AEN request. It doesn't have a request structure, and preferred not
>> to check on it for every send completion, and its not racing anything
>> (mentioned this above).
>>
>> I can change that though if there is a strong desire.
> 
> I don't really like having a special case just for this slow path
> special case.  So if we can avoid it without too much overhead let's
> do it, otherwise we can keep it as-is.

Saving the state of the request completions adds complication in
general, and we don't even have a request for AENs so it would mean to
keep it under the queue, and we don't really race anything because we
don't have inline data there. So I think its simpler to keep it as is.

Sending a v3 that relaxes the req->lock to spin_lock_bh
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 1/3] nvme-rdma: don't suppress send completions
@ 2017-11-20  8:33                     ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-20  8:33 UTC (permalink / raw)



>> There is one case I still don't want to signal send completion, thats
>> the AEN request. It doesn't have a request structure, and preferred not
>> to check on it for every send completion, and its not racing anything
>> (mentioned this above).
>>
>> I can change that though if there is a strong desire.
> 
> I don't really like having a special case just for this slow path
> special case.  So if we can avoid it without too much overhead let's
> do it, otherwise we can keep it as-is.

Saving the state of the request completions adds complication in
general, and we don't even have a request for AENs so it would mean to
keep it under the queue, and we don't really race anything because we
don't have inline data there. So I think its simpler to keep it as is.

Sending a v3 that relaxes the req->lock to spin_lock_bh

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 0/3] Fix request completion holes
  2017-11-20  7:37             ` Christoph Hellwig
@ 2017-11-20  8:33                 ` Sagi Grimberg
  -1 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-20  8:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Max Gurtuvoy,
	Christoph Hellwig


>>> Sagi, are you ready for me to take this series in?  It seemed like there
>>> was a question as to whether you might want to try atomics instead of
>>> spin locks, or do you want to stick with spinlocks?
>>
>> I still need a review for this, and I think we'll get it in through the
>> nvme tree (via the block tree). Thanks.
> 
> I've  done an initial review pass, but I'd really like to see someone
> who has more RDMA experience to take a look as well.

Perhaps Max can have a look?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 0/3] Fix request completion holes
@ 2017-11-20  8:33                 ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-20  8:33 UTC (permalink / raw)



>>> Sagi, are you ready for me to take this series in?  It seemed like there
>>> was a question as to whether you might want to try atomics instead of
>>> spin locks, or do you want to stick with spinlocks?
>>
>> I still need a review for this, and I think we'll get it in through the
>> nvme tree (via the block tree). Thanks.
> 
> I've  done an initial review pass, but I'd really like to see someone
> who has more RDMA experience to take a look as well.

Perhaps Max can have a look?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
  2017-11-20  8:31                 ` Christoph Hellwig
@ 2017-11-20  8:37                     ` Sagi Grimberg
  -1 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-20  8:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Max Gurtuvoy


> Both send and recv completed only ever go from not set to set on a live
> requests, so that's easy.  need_inval only goes from set to cleared
> so I'm not too worried about it either, but due to the lack of atomic
> bitops there it will need better memory barries, or just a switch
> to bitops..
> 
> But looking at this in a little more detail I wonder if we just want
> a recount on the nvme_rdma_request, untested patch below.

I'll give it a shot.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
@ 2017-11-20  8:37                     ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-20  8:37 UTC (permalink / raw)



> Both send and recv completed only ever go from not set to set on a live
> requests, so that's easy.  need_inval only goes from set to cleared
> so I'm not too worried about it either, but due to the lack of atomic
> bitops there it will need better memory barries, or just a switch
> to bitops..
> 
> But looking at this in a little more detail I wonder if we just want
> a recount on the nvme_rdma_request, untested patch below.

I'll give it a shot.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
  2017-11-20  8:37                     ` Sagi Grimberg
@ 2017-11-20  8:41                         ` Christoph Hellwig
  -1 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-20  8:41 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Max Gurtuvoy

On Mon, Nov 20, 2017 at 10:37:54AM +0200, Sagi Grimberg wrote:
>
>> Both send and recv completed only ever go from not set to set on a live
>> requests, so that's easy.  need_inval only goes from set to cleared
>> so I'm not too worried about it either, but due to the lack of atomic
>> bitops there it will need better memory barries, or just a switch
>> to bitops..
>>
>> But looking at this in a little more detail I wonder if we just want
>> a recount on the nvme_rdma_request, untested patch below.
>
> I'll give it a shot.

Btw, in our brave new world that should probably be a refcnt_t instead
of an atomic_t, but that's a trivial search and replace.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
@ 2017-11-20  8:41                         ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-20  8:41 UTC (permalink / raw)


On Mon, Nov 20, 2017@10:37:54AM +0200, Sagi Grimberg wrote:
>
>> Both send and recv completed only ever go from not set to set on a live
>> requests, so that's easy.  need_inval only goes from set to cleared
>> so I'm not too worried about it either, but due to the lack of atomic
>> bitops there it will need better memory barries, or just a switch
>> to bitops..
>>
>> But looking at this in a little more detail I wonder if we just want
>> a recount on the nvme_rdma_request, untested patch below.
>
> I'll give it a shot.

Btw, in our brave new world that should probably be a refcnt_t instead
of an atomic_t, but that's a trivial search and replace.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
  2017-11-20  8:41                         ` Christoph Hellwig
@ 2017-11-20  9:04                             ` Sagi Grimberg
  -1 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-20  9:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Max Gurtuvoy


> Btw, in our brave new world that should probably be a refcnt_t instead
> of an atomic_t, but that's a trivial search and replace.

Yea, will change that.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
@ 2017-11-20  9:04                             ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-20  9:04 UTC (permalink / raw)



> Btw, in our brave new world that should probably be a refcnt_t instead
> of an atomic_t, but that's a trivial search and replace.

Yea, will change that.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
  2017-11-20  8:41                         ` Christoph Hellwig
@ 2017-11-20  9:28                             ` Sagi Grimberg
  -1 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-20  9:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Max Gurtuvoy

>>> Both send and recv completed only ever go from not set to set on a live
>>> requests, so that's easy.  need_inval only goes from set to cleared
>>> so I'm not too worried about it either, but due to the lack of atomic
>>> bitops there it will need better memory barries, or just a switch
>>> to bitops..
>>>
>>> But looking at this in a little more detail I wonder if we just want
>>> a recount on the nvme_rdma_request, untested patch below.
>>
>> I'll give it a shot.
> 
> Btw, in our brave new world that should probably be a refcnt_t instead
> of an atomic_t, but that's a trivial search and replace.

Works like a charm :)

Sending a newer v3 soon.

Thanks Christoph!
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
@ 2017-11-20  9:28                             ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-20  9:28 UTC (permalink / raw)


>>> Both send and recv completed only ever go from not set to set on a live
>>> requests, so that's easy.  need_inval only goes from set to cleared
>>> so I'm not too worried about it either, but due to the lack of atomic
>>> bitops there it will need better memory barries, or just a switch
>>> to bitops..
>>>
>>> But looking at this in a little more detail I wonder if we just want
>>> a recount on the nvme_rdma_request, untested patch below.
>>
>> I'll give it a shot.
> 
> Btw, in our brave new world that should probably be a refcnt_t instead
> of an atomic_t, but that's a trivial search and replace.

Works like a charm :)

Sending a newer v3 soon.

Thanks Christoph!

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 1/3] nvme-rdma: don't suppress send completions
  2017-11-20  8:33                     ` Sagi Grimberg
@ 2017-11-20  9:32                         ` Christoph Hellwig
  -1 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-20  9:32 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Max Gurtuvoy

On Mon, Nov 20, 2017 at 10:33:02AM +0200, Sagi Grimberg wrote:
>> I don't really like having a special case just for this slow path
>> special case.  So if we can avoid it without too much overhead let's
>> do it, otherwise we can keep it as-is.
>
> Saving the state of the request completions adds complication in
> general, and we don't even have a request for AENs so it would mean to
> keep it under the queue, and we don't really race anything because we
> don't have inline data there. So I think its simpler to keep it as is.

Ok, let's keep it.  But please add a comment explaining why the
non-signalled completions are fine for the AER but no one else.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 1/3] nvme-rdma: don't suppress send completions
@ 2017-11-20  9:32                         ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-20  9:32 UTC (permalink / raw)


On Mon, Nov 20, 2017@10:33:02AM +0200, Sagi Grimberg wrote:
>> I don't really like having a special case just for this slow path
>> special case.  So if we can avoid it without too much overhead let's
>> do it, otherwise we can keep it as-is.
>
> Saving the state of the request completions adds complication in
> general, and we don't even have a request for AENs so it would mean to
> keep it under the queue, and we don't really race anything because we
> don't have inline data there. So I think its simpler to keep it as is.

Ok, let's keep it.  But please add a comment explaining why the
non-signalled completions are fine for the AER but no one else.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
  2017-11-20  9:28                             ` Sagi Grimberg
@ 2017-11-20 10:49                                 ` Christoph Hellwig
  -1 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-20 10:49 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Max Gurtuvoy

Btw, I think we can avoid 2 atomic ops for the remote invalidation
path with a simple update like the one below:


diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 5627d81735d2..2032cd8ad431 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1141,7 +1141,6 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
 			     IB_ACCESS_REMOTE_WRITE;
 
 	req->mr->need_inval = true;
-	atomic_inc(&req->ref);
 
 	sg->addr = cpu_to_le64(req->mr->iova);
 	put_unaligned_le24(req->mr->length, sg->length);
@@ -1328,10 +1327,9 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	req->cqe.status = cqe->status;
 	req->cqe.result = cqe->result;
 
-	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
-	    wc->ex.invalidate_rkey == req->mr->rkey) {
-		atomic_dec(&req->ref);
-	} else if (req->mr->need_inval) {
+	if (req->mr->need_inval &&
+	    (!(wc->wc_flags & IB_WC_WITH_INVALIDATE) ||
+	     wc->ex.invalidate_rkey != req->mr->rkey)) {
 		ret = nvme_rdma_inv_rkey(queue, req);
 		if (unlikely(ret < 0)) {
 			dev_err(queue->ctrl->ctrl.device,
@@ -1339,12 +1337,12 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 				req->mr->rkey, ret);
 			nvme_rdma_error_recovery(queue->ctrl);
 		}
-	}
-
-	if (atomic_dec_and_test(&req->ref)) {
-		if (rq->tag == tag)
-			ret = 1;
-		nvme_end_request(rq, req->cqe.status, req->cqe.result);
+	} else {
+		if (atomic_dec_and_test(&req->ref)) {
+			if (rq->tag == tag)
+				ret = 1;
+			nvme_end_request(rq, req->cqe.status, req->cqe.result);
+		}
 	}
 
 	return ret;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
@ 2017-11-20 10:49                                 ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-20 10:49 UTC (permalink / raw)


Btw, I think we can avoid 2 atomic ops for the remote invalidation
path with a simple update like the one below:


diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 5627d81735d2..2032cd8ad431 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1141,7 +1141,6 @@ static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
 			     IB_ACCESS_REMOTE_WRITE;
 
 	req->mr->need_inval = true;
-	atomic_inc(&req->ref);
 
 	sg->addr = cpu_to_le64(req->mr->iova);
 	put_unaligned_le24(req->mr->length, sg->length);
@@ -1328,10 +1327,9 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	req->cqe.status = cqe->status;
 	req->cqe.result = cqe->result;
 
-	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
-	    wc->ex.invalidate_rkey == req->mr->rkey) {
-		atomic_dec(&req->ref);
-	} else if (req->mr->need_inval) {
+	if (req->mr->need_inval &&
+	    (!(wc->wc_flags & IB_WC_WITH_INVALIDATE) ||
+	     wc->ex.invalidate_rkey != req->mr->rkey)) {
 		ret = nvme_rdma_inv_rkey(queue, req);
 		if (unlikely(ret < 0)) {
 			dev_err(queue->ctrl->ctrl.device,
@@ -1339,12 +1337,12 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 				req->mr->rkey, ret);
 			nvme_rdma_error_recovery(queue->ctrl);
 		}
-	}
-
-	if (atomic_dec_and_test(&req->ref)) {
-		if (rq->tag == tag)
-			ret = 1;
-		nvme_end_request(rq, req->cqe.status, req->cqe.result);
+	} else {
+		if (atomic_dec_and_test(&req->ref)) {
+			if (rq->tag == tag)
+				ret = 1;
+			nvme_end_request(rq, req->cqe.status, req->cqe.result);
+		}
 	}
 
 	return ret;

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
  2017-11-20 10:49                                 ` Christoph Hellwig
@ 2017-11-20 11:12                                     ` Sagi Grimberg
  -1 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-20 11:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Max Gurtuvoy


> -	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
> -	    wc->ex.invalidate_rkey == req->mr->rkey) {
> -		atomic_dec(&req->ref);
> -	} else if (req->mr->need_inval) {
> +	if (req->mr->need_inval &&
> +	    (!(wc->wc_flags & IB_WC_WITH_INVALIDATE) ||
> +	     wc->ex.invalidate_rkey != req->mr->rkey)) {

I've been meaning to change this path because getting a remote
invalidation on a bogus MR is a protocol error and we want to
detect it regardless of req->mr->need_inval.

However, I do agree that we can save a atomic if we bail out
after we queue a local invalidate:

How about this (should be split into two patches):
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index a3802de0f248..d0a5a397ad66 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1148,8 +1148,6 @@ static int nvme_rdma_map_sg_fr(struct 
nvme_rdma_queue *queue,
         sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) |
                         NVME_SGL_FMT_INVALIDATE;

-       refcount_inc(&req->ref);
-
         return 0;
  }

@@ -1335,10 +1333,14 @@ static int nvme_rdma_process_nvme_rsp(struct 
nvme_rdma_queue *queue,
         req->cqe.status = cqe->status;
         req->cqe.result = cqe->result;

-       if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
-           wc->ex.invalidate_rkey == req->mr->rkey) {
+       if (wc->wc_flags & IB_WC_WITH_INVALIDATE) {
+               if (unlikely(wc->ex.invalidate_rkey == req->mr->rkey)) {
+                       dev_err(queue->ctrl->ctrl.device,
+                               "Bogus remote invalidation for rkey %#x\n",
+                               req->mr->rkey);
+                       nvme_rdma_error_recovery(queue->ctrl);
+               }
                 req->mr->need_inval = false;
-               refcount_dec(&req->ref);
         } else if (req->mr->need_inval) {
                 ret = nvme_rdma_inv_rkey(queue, req);
                 if (unlikely(ret < 0)) {
@@ -1347,6 +1349,8 @@ static int nvme_rdma_process_nvme_rsp(struct 
nvme_rdma_queue *queue,
                                 req->mr->rkey, ret);
                         nvme_rdma_error_recovery(queue->ctrl);
                 }
+               /* return here as we can't really end the request */
+               return 0;
         }

         if (refcount_dec_and_test(&req->ref)) {
--
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
@ 2017-11-20 11:12                                     ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2017-11-20 11:12 UTC (permalink / raw)



> -	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
> -	    wc->ex.invalidate_rkey == req->mr->rkey) {
> -		atomic_dec(&req->ref);
> -	} else if (req->mr->need_inval) {
> +	if (req->mr->need_inval &&
> +	    (!(wc->wc_flags & IB_WC_WITH_INVALIDATE) ||
> +	     wc->ex.invalidate_rkey != req->mr->rkey)) {

I've been meaning to change this path because getting a remote
invalidation on a bogus MR is a protocol error and we want to
detect it regardless of req->mr->need_inval.

However, I do agree that we can save a atomic if we bail out
after we queue a local invalidate:

How about this (should be split into two patches):
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index a3802de0f248..d0a5a397ad66 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1148,8 +1148,6 @@ static int nvme_rdma_map_sg_fr(struct 
nvme_rdma_queue *queue,
         sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) |
                         NVME_SGL_FMT_INVALIDATE;

-       refcount_inc(&req->ref);
-
         return 0;
  }

@@ -1335,10 +1333,14 @@ static int nvme_rdma_process_nvme_rsp(struct 
nvme_rdma_queue *queue,
         req->cqe.status = cqe->status;
         req->cqe.result = cqe->result;

-       if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
-           wc->ex.invalidate_rkey == req->mr->rkey) {
+       if (wc->wc_flags & IB_WC_WITH_INVALIDATE) {
+               if (unlikely(wc->ex.invalidate_rkey == req->mr->rkey)) {
+                       dev_err(queue->ctrl->ctrl.device,
+                               "Bogus remote invalidation for rkey %#x\n",
+                               req->mr->rkey);
+                       nvme_rdma_error_recovery(queue->ctrl);
+               }
                 req->mr->need_inval = false;
-               refcount_dec(&req->ref);
         } else if (req->mr->need_inval) {
                 ret = nvme_rdma_inv_rkey(queue, req);
                 if (unlikely(ret < 0)) {
@@ -1347,6 +1349,8 @@ static int nvme_rdma_process_nvme_rsp(struct 
nvme_rdma_queue *queue,
                                 req->mr->rkey, ret);
                         nvme_rdma_error_recovery(queue->ctrl);
                 }
+               /* return here as we can't really end the request */
+               return 0;
         }

         if (refcount_dec_and_test(&req->ref)) {
--

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
  2017-11-20 11:12                                     ` Sagi Grimberg
@ 2017-11-20 11:16                                         ` Christoph Hellwig
  -1 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-20 11:16 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Max Gurtuvoy

> +       if (wc->wc_flags & IB_WC_WITH_INVALIDATE) {
> +               if (unlikely(wc->ex.invalidate_rkey == req->mr->rkey)) {

The == here should be a != I think.

> +                       dev_err(queue->ctrl->ctrl.device,
> +                               "Bogus remote invalidation for rkey %#x\n",
> +                               req->mr->rkey);
> +                       nvme_rdma_error_recovery(queue->ctrl);
> +               }
>                 req->mr->need_inval = false;
> -               refcount_dec(&req->ref);
>         } else if (req->mr->need_inval) {
>                 ret = nvme_rdma_inv_rkey(queue, req);
>                 if (unlikely(ret < 0)) {
> @@ -1347,6 +1349,8 @@ static int nvme_rdma_process_nvme_rsp(struct 
> nvme_rdma_queue *queue,
>                                 req->mr->rkey, ret);
>                         nvme_rdma_error_recovery(queue->ctrl);
>                 }
> +               /* return here as we can't really end the request */
> +               return 0;

Maybe replace the comment with

		/* the local invalidation completion will end the request */

Otherwise this looks fine to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed
@ 2017-11-20 11:16                                         ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2017-11-20 11:16 UTC (permalink / raw)


> +       if (wc->wc_flags & IB_WC_WITH_INVALIDATE) {
> +               if (unlikely(wc->ex.invalidate_rkey == req->mr->rkey)) {

The == here should be a != I think.

> +                       dev_err(queue->ctrl->ctrl.device,
> +                               "Bogus remote invalidation for rkey %#x\n",
> +                               req->mr->rkey);
> +                       nvme_rdma_error_recovery(queue->ctrl);
> +               }
>                 req->mr->need_inval = false;
> -               refcount_dec(&req->ref);
>         } else if (req->mr->need_inval) {
>                 ret = nvme_rdma_inv_rkey(queue, req);
>                 if (unlikely(ret < 0)) {
> @@ -1347,6 +1349,8 @@ static int nvme_rdma_process_nvme_rsp(struct 
> nvme_rdma_queue *queue,
>                                 req->mr->rkey, ret);
>                         nvme_rdma_error_recovery(queue->ctrl);
>                 }
> +               /* return here as we can't really end the request */
> +               return 0;

Maybe replace the comment with

		/* the local invalidation completion will end the request */

Otherwise this looks fine to me.

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2017-11-20 11:16 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-08 10:06 [PATCH v2 0/3] Fix request completion holes Sagi Grimberg
2017-11-08 10:06 ` Sagi Grimberg
     [not found] ` <20171108100616.26605-1-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-08 10:06   ` [PATCH v2 1/3] nvme-rdma: don't suppress send completions Sagi Grimberg
2017-11-08 10:06     ` Sagi Grimberg
     [not found]     ` <20171108100616.26605-2-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-09  9:18       ` Christoph Hellwig
2017-11-09  9:18         ` Christoph Hellwig
     [not found]         ` <20171109091858.GA16966-jcswGhMUV9g@public.gmane.org>
2017-11-09 11:08           ` Sagi Grimberg
2017-11-09 11:08             ` Sagi Grimberg
     [not found]             ` <921c4331-9456-3783-423a-9c5c5e405131-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-20  8:18               ` Christoph Hellwig
2017-11-20  8:18                 ` Christoph Hellwig
     [not found]                 ` <20171120081809.GB27552-jcswGhMUV9g@public.gmane.org>
2017-11-20  8:33                   ` Sagi Grimberg
2017-11-20  8:33                     ` Sagi Grimberg
     [not found]                     ` <4ac0bc2e-498b-0fc1-4e30-7ff76bd036f1-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-20  9:32                       ` Christoph Hellwig
2017-11-20  9:32                         ` Christoph Hellwig
2017-11-08 10:06   ` [PATCH v2 2/3] nvme-rdma: don't complete requests before a send work request has completed Sagi Grimberg
2017-11-08 10:06     ` Sagi Grimberg
     [not found]     ` <20171108100616.26605-3-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-09  9:21       ` Christoph Hellwig
2017-11-09  9:21         ` Christoph Hellwig
     [not found]         ` <20171109092110.GB16966-jcswGhMUV9g@public.gmane.org>
2017-11-09 11:14           ` Sagi Grimberg
2017-11-09 11:14             ` Sagi Grimberg
     [not found]             ` <0f368bc9-2e9f-4008-316c-46b85661a274-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-20  8:31               ` Christoph Hellwig
2017-11-20  8:31                 ` Christoph Hellwig
     [not found]                 ` <20171120083130.GC27552-jcswGhMUV9g@public.gmane.org>
2017-11-20  8:37                   ` Sagi Grimberg
2017-11-20  8:37                     ` Sagi Grimberg
     [not found]                     ` <384d8a51-aa2f-5954-c9fd-a0c88d7e5364-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-20  8:41                       ` Christoph Hellwig
2017-11-20  8:41                         ` Christoph Hellwig
     [not found]                         ` <20171120084102.GA28456-jcswGhMUV9g@public.gmane.org>
2017-11-20  9:04                           ` Sagi Grimberg
2017-11-20  9:04                             ` Sagi Grimberg
2017-11-20  9:28                           ` Sagi Grimberg
2017-11-20  9:28                             ` Sagi Grimberg
     [not found]                             ` <58bdc9c0-f98e-9d9f-f81e-fbed572f922e-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-20 10:49                               ` Christoph Hellwig
2017-11-20 10:49                                 ` Christoph Hellwig
     [not found]                                 ` <20171120104921.GA31309-jcswGhMUV9g@public.gmane.org>
2017-11-20 11:12                                   ` Sagi Grimberg
2017-11-20 11:12                                     ` Sagi Grimberg
     [not found]                                     ` <df9c9545-7c00-328b-9f98-535d8e889715-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-20 11:16                                       ` Christoph Hellwig
2017-11-20 11:16                                         ` Christoph Hellwig
2017-11-08 10:06   ` [PATCH v2 3/3] nvme-rdma: wait for local invalidation before completing a request Sagi Grimberg
2017-11-08 10:06     ` Sagi Grimberg
     [not found]     ` <20171108100616.26605-4-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-09  9:39       ` Christoph Hellwig
2017-11-09  9:39         ` Christoph Hellwig
2017-11-13 22:10   ` [PATCH v2 0/3] Fix request completion holes Doug Ledford
2017-11-13 22:10     ` Doug Ledford
     [not found]     ` <1510611044.3735.49.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-11-16 15:39       ` Sagi Grimberg
2017-11-16 15:39         ` Sagi Grimberg
     [not found]         ` <e89f72c6-12ce-c149-c81e-f3adb43a8e9e-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-16 15:58           ` Doug Ledford
2017-11-16 15:58             ` Doug Ledford
2017-11-20  7:37           ` Christoph Hellwig
2017-11-20  7:37             ` Christoph Hellwig
     [not found]             ` <20171120073755.GA2236-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2017-11-20  8:33               ` Sagi Grimberg
2017-11-20  8:33                 ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.