linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v4 0/4] Add support for list issue
@ 2021-12-16 16:05 Jens Axboe
  2021-12-16 16:05 ` [PATCH 1/4] block: add mq_ops->queue_rqs hook Jens Axboe
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Jens Axboe @ 2021-12-16 16:05 UTC (permalink / raw)
  To: io-uring, linux-block, linux-nvme

With the support in 5.16-rc1 for allocating and completing batches of
IO, the one missing piece is passing down a list of requests for issue.
Drivers can take advantage of this by defining an mq_ops->queue_rqs()
hook.

This implements it for NVMe, allowing copy of multiple commands in one
swoop.

This is good for around a 500K IOPS/core improvement in my testing,
which is around a 5-6% improvement in efficiency.

Note to Christoph - I kept the copy helper, since it's used in 3
spots and I _think_ you ended up being fine with that...

Changes since v3:
- Use nvme_sq_copy_cmd() in nvme_submit_cmds()
- Add reviewed-by's

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] block: add mq_ops->queue_rqs hook
  2021-12-16 16:05 [PATCHSET v4 0/4] Add support for list issue Jens Axboe
@ 2021-12-16 16:05 ` Jens Axboe
  2021-12-16 16:05 ` [PATCH 2/4] nvme: split command copy into a helper Jens Axboe
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2021-12-16 16:05 UTC (permalink / raw)
  To: io-uring, linux-block, linux-nvme; +Cc: Jens Axboe, Christoph Hellwig

If we have a list of requests in our plug list, send it to the driver in
one go, if possible. The driver must set mq_ops->queue_rqs() to support
this, if not the usual one-by-one path is used.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq.c         | 26 +++++++++++++++++++++++---
 include/linux/blk-mq.h |  8 ++++++++
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75154cc788db..51991232824a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2553,6 +2553,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 {
 	struct blk_mq_hw_ctx *this_hctx;
 	struct blk_mq_ctx *this_ctx;
+	struct request *rq;
 	unsigned int depth;
 	LIST_HEAD(list);
 
@@ -2561,7 +2562,28 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	plug->rq_count = 0;
 
 	if (!plug->multiple_queues && !plug->has_elevator && !from_schedule) {
-		struct request_queue *q = rq_list_peek(&plug->mq_list)->q;
+		struct request_queue *q;
+
+		rq = rq_list_peek(&plug->mq_list);
+		q = rq->q;
+
+		/*
+		 * Peek first request and see if we have a ->queue_rqs() hook.
+		 * If we do, we can dispatch the whole plug list in one go. We
+		 * already know at this point that all requests belong to the
+		 * same queue, caller must ensure that's the case.
+		 *
+		 * Since we pass off the full list to the driver at this point,
+		 * we do not increment the active request count for the queue.
+		 * Bypass shared tags for now because of that.
+		 */
+		if (q->mq_ops->queue_rqs &&
+		    !(rq->mq_hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
+			blk_mq_run_dispatch_ops(q,
+				q->mq_ops->queue_rqs(&plug->mq_list));
+			if (rq_list_empty(plug->mq_list))
+				return;
+		}
 
 		blk_mq_run_dispatch_ops(q,
 				blk_mq_plug_issue_direct(plug, false));
@@ -2573,8 +2595,6 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	this_ctx = NULL;
 	depth = 0;
 	do {
-		struct request *rq;
-
 		rq = rq_list_pop(&plug->mq_list);
 
 		if (!this_hctx) {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 772f8f921526..550996cf419c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -492,6 +492,14 @@ struct blk_mq_ops {
 	 */
 	void (*commit_rqs)(struct blk_mq_hw_ctx *);
 
+	/**
+	 * @queue_rqs: Queue a list of new requests. Driver is guaranteed
+	 * that each request belongs to the same queue. If the driver doesn't
+	 * empty the @rqlist completely, then the rest will be queued
+	 * individually by the block layer upon return.
+	 */
+	void (*queue_rqs)(struct request **rqlist);
+
 	/**
 	 * @get_budget: Reserve budget before queue request, once .queue_rq is
 	 * run, it is driver's responsibility to release the
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] nvme: split command copy into a helper
  2021-12-16 16:05 [PATCHSET v4 0/4] Add support for list issue Jens Axboe
  2021-12-16 16:05 ` [PATCH 1/4] block: add mq_ops->queue_rqs hook Jens Axboe
@ 2021-12-16 16:05 ` Jens Axboe
  2021-12-16 16:05 ` [PATCH 3/4] nvme: separate command prep and issue Jens Axboe
  2021-12-16 16:05 ` [PATCH 4/4] nvme: add support for mq_ops->queue_rqs() Jens Axboe
  3 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2021-12-16 16:05 UTC (permalink / raw)
  To: io-uring, linux-block, linux-nvme
  Cc: Jens Axboe, Chaitanya Kulkarni, Hannes Reinecke, Max Gurtovoy

We'll need it for batched submit as well.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/nvme/host/pci.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8637538f3fd5..09ea21f75439 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -500,6 +500,15 @@ static inline void nvme_write_sq_db(struct nvme_queue *nvmeq, bool write_sq)
 	nvmeq->last_sq_tail = nvmeq->sq_tail;
 }
 
+static inline void nvme_sq_copy_cmd(struct nvme_queue *nvmeq,
+				    struct nvme_command *cmd)
+{
+	memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes), cmd,
+		sizeof(*cmd));
+	if (++nvmeq->sq_tail == nvmeq->q_depth)
+		nvmeq->sq_tail = 0;
+}
+
 /**
  * nvme_submit_cmd() - Copy a command into a queue and ring the doorbell
  * @nvmeq: The queue to use
@@ -510,10 +519,7 @@ static void nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd,
 			    bool write_sq)
 {
 	spin_lock(&nvmeq->sq_lock);
-	memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
-	       cmd, sizeof(*cmd));
-	if (++nvmeq->sq_tail == nvmeq->q_depth)
-		nvmeq->sq_tail = 0;
+	nvme_sq_copy_cmd(nvmeq, cmd);
 	nvme_write_sq_db(nvmeq, write_sq);
 	spin_unlock(&nvmeq->sq_lock);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] nvme: separate command prep and issue
  2021-12-16 16:05 [PATCHSET v4 0/4] Add support for list issue Jens Axboe
  2021-12-16 16:05 ` [PATCH 1/4] block: add mq_ops->queue_rqs hook Jens Axboe
  2021-12-16 16:05 ` [PATCH 2/4] nvme: split command copy into a helper Jens Axboe
@ 2021-12-16 16:05 ` Jens Axboe
  2021-12-16 16:05 ` [PATCH 4/4] nvme: add support for mq_ops->queue_rqs() Jens Axboe
  3 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2021-12-16 16:05 UTC (permalink / raw)
  To: io-uring, linux-block, linux-nvme
  Cc: Jens Axboe, Hannes Reinecke, Christoph Hellwig

Add a nvme_prep_rq() helper to setup a command, and nvme_queue_rq() is
adapted to use this helper.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/nvme/host/pci.c | 57 ++++++++++++++++++++++++-----------------
 1 file changed, 33 insertions(+), 24 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 09ea21f75439..6be6b1ab4285 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -918,52 +918,32 @@ static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req,
 	return BLK_STS_OK;
 }
 
-/*
- * NOTE: ns is NULL when called on the admin queue.
- */
-static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
-			 const struct blk_mq_queue_data *bd)
+static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 {
-	struct nvme_ns *ns = hctx->queue->queuedata;
-	struct nvme_queue *nvmeq = hctx->driver_data;
-	struct nvme_dev *dev = nvmeq->dev;
-	struct request *req = bd->rq;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	struct nvme_command *cmnd = &iod->cmd;
 	blk_status_t ret;
 
 	iod->aborted = 0;
 	iod->npages = -1;
 	iod->nents = 0;
 
-	/*
-	 * We should not need to do this, but we're still using this to
-	 * ensure we can drain requests on a dying queue.
-	 */
-	if (unlikely(!test_bit(NVMEQ_ENABLED, &nvmeq->flags)))
-		return BLK_STS_IOERR;
-
-	if (!nvme_check_ready(&dev->ctrl, req, true))
-		return nvme_fail_nonready_command(&dev->ctrl, req);
-
-	ret = nvme_setup_cmd(ns, req);
+	ret = nvme_setup_cmd(req->q->queuedata, req);
 	if (ret)
 		return ret;
 
 	if (blk_rq_nr_phys_segments(req)) {
-		ret = nvme_map_data(dev, req, cmnd);
+		ret = nvme_map_data(dev, req, &iod->cmd);
 		if (ret)
 			goto out_free_cmd;
 	}
 
 	if (blk_integrity_rq(req)) {
-		ret = nvme_map_metadata(dev, req, cmnd);
+		ret = nvme_map_metadata(dev, req, &iod->cmd);
 		if (ret)
 			goto out_unmap_data;
 	}
 
 	blk_mq_start_request(req);
-	nvme_submit_cmd(nvmeq, cmnd, bd->last);
 	return BLK_STS_OK;
 out_unmap_data:
 	nvme_unmap_data(dev, req);
@@ -972,6 +952,35 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
+/*
+ * NOTE: ns is NULL when called on the admin queue.
+ */
+static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
+			 const struct blk_mq_queue_data *bd)
+{
+	struct nvme_queue *nvmeq = hctx->driver_data;
+	struct nvme_dev *dev = nvmeq->dev;
+	struct request *req = bd->rq;
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	blk_status_t ret;
+
+	/*
+	 * We should not need to do this, but we're still using this to
+	 * ensure we can drain requests on a dying queue.
+	 */
+	if (unlikely(!test_bit(NVMEQ_ENABLED, &nvmeq->flags)))
+		return BLK_STS_IOERR;
+
+	if (unlikely(!nvme_check_ready(&dev->ctrl, req, true)))
+		return nvme_fail_nonready_command(&dev->ctrl, req);
+
+	ret = nvme_prep_rq(dev, req);
+	if (unlikely(ret))
+		return ret;
+	nvme_submit_cmd(nvmeq, &iod->cmd, bd->last);
+	return BLK_STS_OK;
+}
+
 static __always_inline void nvme_pci_unmap_rq(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] nvme: add support for mq_ops->queue_rqs()
  2021-12-16 16:05 [PATCHSET v4 0/4] Add support for list issue Jens Axboe
                   ` (2 preceding siblings ...)
  2021-12-16 16:05 ` [PATCH 3/4] nvme: separate command prep and issue Jens Axboe
@ 2021-12-16 16:05 ` Jens Axboe
  2021-12-16 16:11   ` [PATCH v2 " Jens Axboe
  3 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2021-12-16 16:05 UTC (permalink / raw)
  To: io-uring, linux-block, linux-nvme
  Cc: Jens Axboe, Hannes Reinecke, Keith Busch

This enables the block layer to send us a full plug list of requests
that need submitting. The block layer guarantees that they all belong
to the same queue, but we do have to check the hardware queue mapping
for each request.

If errors are encountered, leave them in the passed in list. Then the
block layer will handle them individually.

This is good for about a 4% improvement in peak performance, taking us
from 9.6M to 10M IOPS/core.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/nvme/host/pci.c | 58 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6be6b1ab4285..e34ad67c4c41 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -981,6 +981,63 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_STS_OK;
 }
 
+static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct request **rqlist)
+{
+	spin_lock(&nvmeq->sq_lock);
+	while (!rq_list_empty(*rqlist)) {
+		struct request *req = rq_list_pop(rqlist);
+		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+
+		nvme_sq_copy_cmd(nvmeq, &iod->cmd);
+	}
+	nvme_write_sq_db(nvmeq, true);
+	spin_unlock(&nvmeq->sq_lock);
+}
+
+static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
+{
+	/*
+	 * We should not need to do this, but we're still using this to
+	 * ensure we can drain requests on a dying queue.
+	 */
+	if (unlikely(!test_bit(NVMEQ_ENABLED, &nvmeq->flags)))
+		return false;
+	if (unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)))
+		return false;
+
+	req->mq_hctx->tags->rqs[req->tag] = req;
+	return nvme_prep_rq(nvmeq->dev, req) == BLK_STS_OK;
+}
+
+static void nvme_queue_rqs(struct request **rqlist)
+{
+	struct request *req = rq_list_peek(rqlist), *prev = NULL;
+	struct request *requeue_list = NULL;
+
+	do {
+		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+
+		if (!nvme_prep_rq_batch(nvmeq, req)) {
+			/* detach 'req' and add to remainder list */
+			if (prev)
+				prev->rq_next = req->rq_next;
+			rq_list_add(&requeue_list, req);
+		} else {
+			prev = req;
+		}
+
+		req = rq_list_next(req);
+		if (!req || (prev && req->mq_hctx != prev->mq_hctx)) {
+			/* detach rest of list, and submit */
+			prev->rq_next = NULL;
+			nvme_submit_cmds(nvmeq, rqlist);
+			*rqlist = req;
+		}
+	} while (req);
+
+	*rqlist = requeue_list;
+}
+
 static __always_inline void nvme_pci_unmap_rq(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1678,6 +1735,7 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 
 static const struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
+	.queue_rqs	= nvme_queue_rqs,
 	.complete	= nvme_pci_complete_rq,
 	.commit_rqs	= nvme_commit_rqs,
 	.init_hctx	= nvme_init_hctx,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 4/4] nvme: add support for mq_ops->queue_rqs()
  2021-12-16 16:05 ` [PATCH 4/4] nvme: add support for mq_ops->queue_rqs() Jens Axboe
@ 2021-12-16 16:11   ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2021-12-16 16:11 UTC (permalink / raw)
  To: io-uring, linux-block, linux-nvme; +Cc: Hannes Reinecke, Keith Busch

This enables the block layer to send us a full plug list of requests
that need submitting. The block layer guarantees that they all belong
to the same queue, but we do have to check the hardware queue mapping
for each request.

If errors are encountered, leave them in the passed in list. Then the
block layer will handle them individually.

This is good for about a 4% improvement in peak performance, taking us
from 9.6M to 10M IOPS/core.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

Added the prev check suggested by Max.

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6be6b1ab4285..58d97660374a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -981,6 +981,64 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_STS_OK;
 }
 
+static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct request **rqlist)
+{
+	spin_lock(&nvmeq->sq_lock);
+	while (!rq_list_empty(*rqlist)) {
+		struct request *req = rq_list_pop(rqlist);
+		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+
+		nvme_sq_copy_cmd(nvmeq, &iod->cmd);
+	}
+	nvme_write_sq_db(nvmeq, true);
+	spin_unlock(&nvmeq->sq_lock);
+}
+
+static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
+{
+	/*
+	 * We should not need to do this, but we're still using this to
+	 * ensure we can drain requests on a dying queue.
+	 */
+	if (unlikely(!test_bit(NVMEQ_ENABLED, &nvmeq->flags)))
+		return false;
+	if (unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)))
+		return false;
+
+	req->mq_hctx->tags->rqs[req->tag] = req;
+	return nvme_prep_rq(nvmeq->dev, req) == BLK_STS_OK;
+}
+
+static void nvme_queue_rqs(struct request **rqlist)
+{
+	struct request *req = rq_list_peek(rqlist), *prev = NULL;
+	struct request *requeue_list = NULL;
+
+	do {
+		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+
+		if (!nvme_prep_rq_batch(nvmeq, req)) {
+			/* detach 'req' and add to remainder list */
+			if (prev)
+				prev->rq_next = req->rq_next;
+			rq_list_add(&requeue_list, req);
+		} else {
+			prev = req;
+		}
+
+		req = rq_list_next(req);
+		if (!req || (prev && req->mq_hctx != prev->mq_hctx)) {
+			/* detach rest of list, and submit */
+			if (prev)
+				prev->rq_next = NULL;
+			nvme_submit_cmds(nvmeq, rqlist);
+			*rqlist = req;
+		}
+	} while (req);
+
+	*rqlist = requeue_list;
+}
+
 static __always_inline void nvme_pci_unmap_rq(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1678,6 +1736,7 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 
 static const struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
+	.queue_rqs	= nvme_queue_rqs,
 	.complete	= nvme_pci_complete_rq,
 	.commit_rqs	= nvme_commit_rqs,
 	.init_hctx	= nvme_init_hctx,

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] nvme: add support for mq_ops->queue_rqs()
  2021-12-16 16:39 ` [PATCH 4/4] nvme: add support for mq_ops->queue_rqs() Jens Axboe
@ 2021-12-16 17:53   ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2021-12-16 17:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: io-uring, linux-block, linux-nvme, Hannes Reinecke, Keith Busch

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 4/4] nvme: add support for mq_ops->queue_rqs()
  2021-12-16 16:38 [PATCHSET v5 0/4] Add support for list issue Jens Axboe
@ 2021-12-16 16:39 ` Jens Axboe
  2021-12-16 17:53   ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2021-12-16 16:39 UTC (permalink / raw)
  To: io-uring, linux-block, linux-nvme
  Cc: Jens Axboe, Hannes Reinecke, Keith Busch

This enables the block layer to send us a full plug list of requests
that need submitting. The block layer guarantees that they all belong
to the same queue, but we do have to check the hardware queue mapping
for each request.

If errors are encountered, leave them in the passed in list. Then the
block layer will handle them individually.

This is good for about a 4% improvement in peak performance, taking us
from 9.6M to 10M IOPS/core.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/nvme/host/pci.c | 59 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7062128c8204..51a903d91d92 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -969,6 +969,64 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_STS_OK;
 }
 
+static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct request **rqlist)
+{
+	spin_lock(&nvmeq->sq_lock);
+	while (!rq_list_empty(*rqlist)) {
+		struct request *req = rq_list_pop(rqlist);
+		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+
+		nvme_sq_copy_cmd(nvmeq, &iod->cmd);
+	}
+	nvme_write_sq_db(nvmeq, true);
+	spin_unlock(&nvmeq->sq_lock);
+}
+
+static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
+{
+	/*
+	 * We should not need to do this, but we're still using this to
+	 * ensure we can drain requests on a dying queue.
+	 */
+	if (unlikely(!test_bit(NVMEQ_ENABLED, &nvmeq->flags)))
+		return false;
+	if (unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)))
+		return false;
+
+	req->mq_hctx->tags->rqs[req->tag] = req;
+	return nvme_prep_rq(nvmeq->dev, req) == BLK_STS_OK;
+}
+
+static void nvme_queue_rqs(struct request **rqlist)
+{
+	struct request *req = rq_list_peek(rqlist), *prev = NULL;
+	struct request *requeue_list = NULL;
+
+	do {
+		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+
+		if (!nvme_prep_rq_batch(nvmeq, req)) {
+			/* detach 'req' and add to remainder list */
+			if (prev)
+				prev->rq_next = req->rq_next;
+			rq_list_add(&requeue_list, req);
+		} else {
+			prev = req;
+		}
+
+		req = rq_list_next(req);
+		if (!req || (prev && req->mq_hctx != prev->mq_hctx)) {
+			/* detach rest of list, and submit */
+			if (prev)
+				prev->rq_next = NULL;
+			nvme_submit_cmds(nvmeq, rqlist);
+			*rqlist = req;
+		}
+	} while (req);
+
+	*rqlist = requeue_list;
+}
+
 static __always_inline void nvme_pci_unmap_rq(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1670,6 +1728,7 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 
 static const struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
+	.queue_rqs	= nvme_queue_rqs,
 	.complete	= nvme_pci_complete_rq,
 	.commit_rqs	= nvme_commit_rqs,
 	.init_hctx	= nvme_init_hctx,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] nvme: add support for mq_ops->queue_rqs()
  2021-12-06  7:40   ` Christoph Hellwig
@ 2021-12-06 16:33     ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2021-12-06 16:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block, linux-nvme

On 12/6/21 12:40 AM, Christoph Hellwig wrote:
> On Fri, Dec 03, 2021 at 02:45:44PM -0700, Jens Axboe wrote:
>> This enables the block layer to send us a full plug list of requests
>> that need submitting. The block layer guarantees that they all belong
>> to the same queue, but we do have to check the hardware queue mapping
>> for each request.
>>
>> If errors are encountered, leave them in the passed in list. Then the
>> block layer will handle them individually.
>>
>> This is good for about a 4% improvement in peak performance, taking us
>> from 9.6M to 10M IOPS/core.
> 
> This looks pretty similar to my proposed cleanups (which is nice), but
> back then you mentioned the cleaner version was much slower.  Do you
> know what brought the speed back in this version?

Yes, it has that folded in and tweaked on top. Current version seems
to be fine.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] nvme: add support for mq_ops->queue_rqs()
  2021-12-03 21:45 ` [PATCH 4/4] nvme: add support for mq_ops->queue_rqs() Jens Axboe
  2021-12-04 10:47   ` Hannes Reinecke
@ 2021-12-06  7:40   ` Christoph Hellwig
  2021-12-06 16:33     ` Jens Axboe
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2021-12-06  7:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-nvme

On Fri, Dec 03, 2021 at 02:45:44PM -0700, Jens Axboe wrote:
> This enables the block layer to send us a full plug list of requests
> that need submitting. The block layer guarantees that they all belong
> to the same queue, but we do have to check the hardware queue mapping
> for each request.
> 
> If errors are encountered, leave them in the passed in list. Then the
> block layer will handle them individually.
> 
> This is good for about a 4% improvement in peak performance, taking us
> from 9.6M to 10M IOPS/core.

This looks pretty similar to my proposed cleanups (which is nice), but
back then you mentioned the cleaner version was much slower.  Do you
know what brought the speed back in this version?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] nvme: add support for mq_ops->queue_rqs()
  2021-12-03 21:45 ` [PATCH 4/4] nvme: add support for mq_ops->queue_rqs() Jens Axboe
@ 2021-12-04 10:47   ` Hannes Reinecke
  2021-12-06  7:40   ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Hannes Reinecke @ 2021-12-04 10:47 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme

On 12/3/21 10:45 PM, Jens Axboe wrote:
> This enables the block layer to send us a full plug list of requests
> that need submitting. The block layer guarantees that they all belong
> to the same queue, but we do have to check the hardware queue mapping
> for each request.
> 
> If errors are encountered, leave them in the passed in list. Then the
> block layer will handle them individually.
> 
> This is good for about a 4% improvement in peak performance, taking us
> from 9.6M to 10M IOPS/core.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>   drivers/nvme/host/pci.c | 61 +++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 61 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 4/4] nvme: add support for mq_ops->queue_rqs()
  2021-12-03 21:45 [PATCHSET v2 0/4] Add support for list issue Jens Axboe
@ 2021-12-03 21:45 ` Jens Axboe
  2021-12-04 10:47   ` Hannes Reinecke
  2021-12-06  7:40   ` Christoph Hellwig
  0 siblings, 2 replies; 17+ messages in thread
From: Jens Axboe @ 2021-12-03 21:45 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: Jens Axboe

This enables the block layer to send us a full plug list of requests
that need submitting. The block layer guarantees that they all belong
to the same queue, but we do have to check the hardware queue mapping
for each request.

If errors are encountered, leave them in the passed in list. Then the
block layer will handle them individually.

This is good for about a 4% improvement in peak performance, taking us
from 9.6M to 10M IOPS/core.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/nvme/host/pci.c | 61 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6be6b1ab4285..197aa45ef7ef 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -981,6 +981,66 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_STS_OK;
 }
 
+static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct request **rqlist)
+{
+	spin_lock(&nvmeq->sq_lock);
+	while (!rq_list_empty(*rqlist)) {
+		struct request *req = rq_list_pop(rqlist);
+		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+
+		memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
+				absolute_pointer(&iod->cmd), sizeof(iod->cmd));
+		if (++nvmeq->sq_tail == nvmeq->q_depth)
+			nvmeq->sq_tail = 0;
+	}
+	nvme_write_sq_db(nvmeq, true);
+	spin_unlock(&nvmeq->sq_lock);
+}
+
+static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
+{
+	/*
+	 * We should not need to do this, but we're still using this to
+	 * ensure we can drain requests on a dying queue.
+	 */
+	if (unlikely(!test_bit(NVMEQ_ENABLED, &nvmeq->flags)))
+		return false;
+	if (unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)))
+		return false;
+
+	req->mq_hctx->tags->rqs[req->tag] = req;
+	return nvme_prep_rq(nvmeq->dev, req) == BLK_STS_OK;
+}
+
+static void nvme_queue_rqs(struct request **rqlist)
+{
+	struct request *req = rq_list_peek(rqlist), *prev = NULL;
+	struct request *requeue_list = NULL;
+
+	do {
+		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+
+		if (!nvme_prep_rq_batch(nvmeq, req)) {
+			/* detach 'req' and add to remainder list */
+			if (prev)
+				prev->rq_next = req->rq_next;
+			rq_list_add(&requeue_list, req);
+		} else {
+			prev = req;
+		}
+
+		req = rq_list_next(req);
+		if (!req || (prev && req->mq_hctx != prev->mq_hctx)) {
+			/* detach rest of list, and submit */
+			prev->rq_next = NULL;
+			nvme_submit_cmds(nvmeq, rqlist);
+			*rqlist = req;
+		}
+	} while (req);
+
+	*rqlist = requeue_list;
+}
+
 static __always_inline void nvme_pci_unmap_rq(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1678,6 +1738,7 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 
 static const struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
+	.queue_rqs	= nvme_queue_rqs,
 	.complete	= nvme_pci_complete_rq,
 	.commit_rqs	= nvme_commit_rqs,
 	.init_hctx	= nvme_init_hctx,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] nvme: add support for mq_ops->queue_rqs()
  2021-11-17  3:38 ` [PATCH 4/4] nvme: add support for mq_ops->queue_rqs() Jens Axboe
  2021-11-17  8:39   ` Christoph Hellwig
@ 2021-11-17 19:41   ` Keith Busch
  1 sibling, 0 replies; 17+ messages in thread
From: Keith Busch @ 2021-11-17 19:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, hch

[adding linux-nvme]

On Tue, Nov 16, 2021 at 08:38:07PM -0700, Jens Axboe wrote:
> This enables the block layer to send us a full plug list of requests
> that need submitting. The block layer guarantees that they all belong
> to the same queue, but we do have to check the hardware queue mapping
> for each request.

So this means that the nvme namespace will always be the same for all
the requests in the list, but the rqlist may contain requests allocated
from different CPUs?
 
> If errors are encountered, leave them in the passed in list. Then the
> block layer will handle them individually.
> 
> This is good for about a 4% improvement in peak performance, taking us
> from 9.6M to 10M IOPS/core.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  drivers/nvme/host/pci.c | 67 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 67 insertions(+)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index d2b654fc3603..2eedd04b1f90 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1004,6 +1004,72 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	return ret;
>  }
>  
> +static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct request **rqlist)
> +{
> +	spin_lock(&nvmeq->sq_lock);
> +	while (!rq_list_empty(*rqlist)) {
> +		struct request *req = rq_list_pop(rqlist);
> +		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +
> +		nvme_copy_cmd(nvmeq, absolute_pointer(&iod->cmd));
> +	}
> +	nvme_write_sq_db(nvmeq, true);
> +	spin_unlock(&nvmeq->sq_lock);
> +}
> +
> +static void nvme_queue_rqs(struct request **rqlist)
> +{
> +	struct request *requeue_list = NULL, *req, *prev = NULL;
> +	struct blk_mq_hw_ctx *hctx;
> +	struct nvme_queue *nvmeq;
> +	struct nvme_ns *ns;
> +
> +restart:
> +	req = rq_list_peek(rqlist);
> +	hctx = req->mq_hctx;
> +	nvmeq = hctx->driver_data;
> +	ns = hctx->queue->queuedata;
> +
> +	/*
> +	 * We should not need to do this, but we're still using this to
> +	 * ensure we can drain requests on a dying queue.
> +	 */
> +	if (unlikely(!test_bit(NVMEQ_ENABLED, &nvmeq->flags)))
> +		return;
> +
> +	rq_list_for_each(rqlist, req) {
> +		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +		blk_status_t ret;
> +
> +		if (unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)))
> +			goto requeue;
> +
> +		if (req->mq_hctx != hctx) {
> +			/* detach rest of list, and submit */
> +			prev->rq_next = NULL;
> +			nvme_submit_cmds(nvmeq, rqlist);
> +			/* req now start of new list for this hw queue */
> +			*rqlist = req;
> +			goto restart;
> +		}
> +
> +		hctx->tags->rqs[req->tag] = req;

After checking how this is handled previously, it appears this new
.queue_rqs() skips incrementing active requests, and bypasses the
hctx_lock(). Won't that break quiesce?

> +		ret = nvme_prep_rq(nvmeq->dev, ns, req, &iod->cmd);
> +		if (ret == BLK_STS_OK) {
> +			prev = req;
> +			continue;
> +		}
> +requeue:
> +		/* detach 'req' and add to remainder list */
> +		if (prev)
> +			prev->rq_next = req->rq_next;
> +		rq_list_add(&requeue_list, req);
> +	}
> +
> +	nvme_submit_cmds(nvmeq, rqlist);
> +	*rqlist = requeue_list;
> +}
> +
>  static __always_inline void nvme_pci_unmap_rq(struct request *req)
>  {
>  	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> @@ -1741,6 +1807,7 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
>  
>  static const struct blk_mq_ops nvme_mq_ops = {
>  	.queue_rq	= nvme_queue_rq,
> +	.queue_rqs	= nvme_queue_rqs,
>  	.complete	= nvme_pci_complete_rq,
>  	.commit_rqs	= nvme_commit_rqs,
>  	.init_hctx	= nvme_init_hctx,
> -- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] nvme: add support for mq_ops->queue_rqs()
  2021-11-17 15:55     ` Jens Axboe
@ 2021-11-17 15:58       ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2021-11-17 15:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block

On 11/17/21 8:55 AM, Jens Axboe wrote:
> On 11/17/21 1:39 AM, Christoph Hellwig wrote:
>> On Tue, Nov 16, 2021 at 08:38:07PM -0700, Jens Axboe wrote:
>>> This enables the block layer to send us a full plug list of requests
>>> that need submitting. The block layer guarantees that they all belong
>>> to the same queue, but we do have to check the hardware queue mapping
>>> for each request.
>>>
>>> If errors are encountered, leave them in the passed in list. Then the
>>> block layer will handle them individually.
>>>
>>> This is good for about a 4% improvement in peak performance, taking us
>>> from 9.6M to 10M IOPS/core.
>>
>> The concept looks sensible, but the loop in nvme_queue_rqs is a complete
>> mess to follow. What about something like this (untested) on top?
> 
> Let me take a closer look.

Something changed, efficiency is way down:

     2.26%     +4.34%  [nvme]            [k] nvme_queue_rqs


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] nvme: add support for mq_ops->queue_rqs()
  2021-11-17  8:39   ` Christoph Hellwig
@ 2021-11-17 15:55     ` Jens Axboe
  2021-11-17 15:58       ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2021-11-17 15:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block

On 11/17/21 1:39 AM, Christoph Hellwig wrote:
> On Tue, Nov 16, 2021 at 08:38:07PM -0700, Jens Axboe wrote:
>> This enables the block layer to send us a full plug list of requests
>> that need submitting. The block layer guarantees that they all belong
>> to the same queue, but we do have to check the hardware queue mapping
>> for each request.
>>
>> If errors are encountered, leave them in the passed in list. Then the
>> block layer will handle them individually.
>>
>> This is good for about a 4% improvement in peak performance, taking us
>> from 9.6M to 10M IOPS/core.
> 
> The concept looks sensible, but the loop in nvme_queue_rqs is a complete
> mess to follow. What about something like this (untested) on top?

Let me take a closer look.

> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 13722cc400c2c..555a7609580c7 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -509,21 +509,6 @@ static inline void nvme_copy_cmd(struct nvme_queue *nvmeq,
>  		nvmeq->sq_tail = 0;
>  }
>  
> -/**
> - * nvme_submit_cmd() - Copy a command into a queue and ring the doorbell
> - * @nvmeq: The queue to use
> - * @cmd: The command to send
> - * @write_sq: whether to write to the SQ doorbell
> - */
> -static void nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd,
> -			    bool write_sq)
> -{
> -	spin_lock(&nvmeq->sq_lock);
> -	nvme_copy_cmd(nvmeq, cmd);
> -	nvme_write_sq_db(nvmeq, write_sq);
> -	spin_unlock(&nvmeq->sq_lock);
> -}

You really don't like helpers? Code generation wise it doesn't matter,
but without this and the copy helper we do end up having some trivial
duplicated code...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] nvme: add support for mq_ops->queue_rqs()
  2021-11-17  3:38 ` [PATCH 4/4] nvme: add support for mq_ops->queue_rqs() Jens Axboe
@ 2021-11-17  8:39   ` Christoph Hellwig
  2021-11-17 15:55     ` Jens Axboe
  2021-11-17 19:41   ` Keith Busch
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2021-11-17  8:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, hch

On Tue, Nov 16, 2021 at 08:38:07PM -0700, Jens Axboe wrote:
> This enables the block layer to send us a full plug list of requests
> that need submitting. The block layer guarantees that they all belong
> to the same queue, but we do have to check the hardware queue mapping
> for each request.
> 
> If errors are encountered, leave them in the passed in list. Then the
> block layer will handle them individually.
> 
> This is good for about a 4% improvement in peak performance, taking us
> from 9.6M to 10M IOPS/core.

The concept looks sensible, but the loop in nvme_queue_rqs is a complete
mess to follow. What about something like this (untested) on top?

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 13722cc400c2c..555a7609580c7 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -509,21 +509,6 @@ static inline void nvme_copy_cmd(struct nvme_queue *nvmeq,
 		nvmeq->sq_tail = 0;
 }
 
-/**
- * nvme_submit_cmd() - Copy a command into a queue and ring the doorbell
- * @nvmeq: The queue to use
- * @cmd: The command to send
- * @write_sq: whether to write to the SQ doorbell
- */
-static void nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd,
-			    bool write_sq)
-{
-	spin_lock(&nvmeq->sq_lock);
-	nvme_copy_cmd(nvmeq, cmd);
-	nvme_write_sq_db(nvmeq, write_sq);
-	spin_unlock(&nvmeq->sq_lock);
-}
-
 static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
 {
 	struct nvme_queue *nvmeq = hctx->driver_data;
@@ -918,8 +903,7 @@ static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req,
 	return BLK_STS_OK;
 }
 
-static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct nvme_ns *ns,
-				 struct request *req, struct nvme_command *cmnd)
+static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	blk_status_t ret;
@@ -928,18 +912,18 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct nvme_ns *ns,
 	iod->npages = -1;
 	iod->nents = 0;
 
-	ret = nvme_setup_cmd(ns, req);
+	ret = nvme_setup_cmd(req->q->queuedata, req);
 	if (ret)
 		return ret;
 
 	if (blk_rq_nr_phys_segments(req)) {
-		ret = nvme_map_data(dev, req, cmnd);
+		ret = nvme_map_data(dev, req, &iod->cmd);
 		if (ret)
 			goto out_free_cmd;
 	}
 
 	if (blk_integrity_rq(req)) {
-		ret = nvme_map_metadata(dev, req, cmnd);
+		ret = nvme_map_metadata(dev, req, &iod->cmd);
 		if (ret)
 			goto out_unmap_data;
 	}
@@ -959,7 +943,6 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct nvme_ns *ns,
 static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 			 const struct blk_mq_queue_data *bd)
 {
-	struct nvme_ns *ns = hctx->queue->queuedata;
 	struct nvme_queue *nvmeq = hctx->driver_data;
 	struct nvme_dev *dev = nvmeq->dev;
 	struct request *req = bd->rq;
@@ -976,12 +959,15 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (!nvme_check_ready(&dev->ctrl, req, true))
 		return nvme_fail_nonready_command(&dev->ctrl, req);
 
-	ret = nvme_prep_rq(dev, ns, req, &iod->cmd);
-	if (ret == BLK_STS_OK) {
-		nvme_submit_cmd(nvmeq, &iod->cmd, bd->last);
-		return BLK_STS_OK;
-	}
-	return ret;
+	ret = nvme_prep_rq(dev, req);
+	if (ret)
+		return ret;
+
+	spin_lock(&nvmeq->sq_lock);
+	nvme_copy_cmd(nvmeq, &iod->cmd);
+	nvme_write_sq_db(nvmeq, bd->last);
+	spin_unlock(&nvmeq->sq_lock);
+	return BLK_STS_OK;
 }
 
 static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct request **rqlist)
@@ -997,56 +983,47 @@ static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct request **rqlist)
 	spin_unlock(&nvmeq->sq_lock);
 }
 
-static void nvme_queue_rqs(struct request **rqlist)
+static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
 {
-	struct request *requeue_list = NULL, *req, *prev = NULL;
-	struct blk_mq_hw_ctx *hctx;
-	struct nvme_queue *nvmeq;
-	struct nvme_ns *ns;
-
-restart:
-	req = rq_list_peek(rqlist);
-	hctx = req->mq_hctx;
-	nvmeq = hctx->driver_data;
-	ns = hctx->queue->queuedata;
-
 	/*
 	 * We should not need to do this, but we're still using this to
 	 * ensure we can drain requests on a dying queue.
 	 */
 	if (unlikely(!test_bit(NVMEQ_ENABLED, &nvmeq->flags)))
-		return;
+		return false;
+	if (unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)))
+		return false;
 
-	rq_list_for_each(rqlist, req) {
-		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-		blk_status_t ret;
+	req->mq_hctx->tags->rqs[req->tag] = req;
+	return nvme_prep_rq(nvmeq->dev, req) == BLK_STS_OK;
+}
 
-		if (unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)))
-			goto requeue;
+static void nvme_queue_rqs(struct request **rqlist)
+{
+	struct request *req = rq_list_peek(rqlist), *prev = NULL;
+	struct request *requeue_list = NULL;
+
+	do {
+		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+
+		if (!nvme_prep_rq_batch(nvmeq, req)) {
+			/* detach 'req' and add to remainder list */
+			if (prev)
+				prev->rq_next = req->rq_next;
+			rq_list_add(&requeue_list, req);
+		} else {
+			prev = req;
+		}
 
-		if (req->mq_hctx != hctx) {
+		req = rq_list_next(req);
+		if (!req || (prev && req->mq_hctx != prev->mq_hctx)) {
 			/* detach rest of list, and submit */
 			prev->rq_next = NULL;
 			nvme_submit_cmds(nvmeq, rqlist);
-			/* req now start of new list for this hw queue */
 			*rqlist = req;
-			goto restart;
-		}
-
-		hctx->tags->rqs[req->tag] = req;
-		ret = nvme_prep_rq(nvmeq->dev, ns, req, &iod->cmd);
-		if (ret == BLK_STS_OK) {
-			prev = req;
-			continue;
 		}
-requeue:
-		/* detach 'req' and add to remainder list */
-		if (prev)
-			prev->rq_next = req->rq_next;
-		rq_list_add(&requeue_list, req);
-	}
+	} while (req);
 
-	nvme_submit_cmds(nvmeq, rqlist);
 	*rqlist = requeue_list;
 }
 
@@ -1224,7 +1201,11 @@ static void nvme_pci_submit_async_event(struct nvme_ctrl *ctrl)
 
 	c.common.opcode = nvme_admin_async_event;
 	c.common.command_id = NVME_AQ_BLK_MQ_DEPTH;
-	nvme_submit_cmd(nvmeq, &c, true);
+
+	spin_lock(&nvmeq->sq_lock);
+	nvme_copy_cmd(nvmeq, &c);
+	nvme_write_sq_db(nvmeq, true);
+	spin_unlock(&nvmeq->sq_lock);
 }
 
 static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id)

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] nvme: add support for mq_ops->queue_rqs()
  2021-11-17  3:38 [PATCHSET 0/4] Add support for list issue Jens Axboe
@ 2021-11-17  3:38 ` Jens Axboe
  2021-11-17  8:39   ` Christoph Hellwig
  2021-11-17 19:41   ` Keith Busch
  0 siblings, 2 replies; 17+ messages in thread
From: Jens Axboe @ 2021-11-17  3:38 UTC (permalink / raw)
  To: linux-block; +Cc: hch, Jens Axboe

This enables the block layer to send us a full plug list of requests
that need submitting. The block layer guarantees that they all belong
to the same queue, but we do have to check the hardware queue mapping
for each request.

If errors are encountered, leave them in the passed in list. Then the
block layer will handle them individually.

This is good for about a 4% improvement in peak performance, taking us
from 9.6M to 10M IOPS/core.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/nvme/host/pci.c | 67 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d2b654fc3603..2eedd04b1f90 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1004,6 +1004,72 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
+static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct request **rqlist)
+{
+	spin_lock(&nvmeq->sq_lock);
+	while (!rq_list_empty(*rqlist)) {
+		struct request *req = rq_list_pop(rqlist);
+		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+
+		nvme_copy_cmd(nvmeq, absolute_pointer(&iod->cmd));
+	}
+	nvme_write_sq_db(nvmeq, true);
+	spin_unlock(&nvmeq->sq_lock);
+}
+
+static void nvme_queue_rqs(struct request **rqlist)
+{
+	struct request *requeue_list = NULL, *req, *prev = NULL;
+	struct blk_mq_hw_ctx *hctx;
+	struct nvme_queue *nvmeq;
+	struct nvme_ns *ns;
+
+restart:
+	req = rq_list_peek(rqlist);
+	hctx = req->mq_hctx;
+	nvmeq = hctx->driver_data;
+	ns = hctx->queue->queuedata;
+
+	/*
+	 * We should not need to do this, but we're still using this to
+	 * ensure we can drain requests on a dying queue.
+	 */
+	if (unlikely(!test_bit(NVMEQ_ENABLED, &nvmeq->flags)))
+		return;
+
+	rq_list_for_each(rqlist, req) {
+		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+		blk_status_t ret;
+
+		if (unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)))
+			goto requeue;
+
+		if (req->mq_hctx != hctx) {
+			/* detach rest of list, and submit */
+			prev->rq_next = NULL;
+			nvme_submit_cmds(nvmeq, rqlist);
+			/* req now start of new list for this hw queue */
+			*rqlist = req;
+			goto restart;
+		}
+
+		hctx->tags->rqs[req->tag] = req;
+		ret = nvme_prep_rq(nvmeq->dev, ns, req, &iod->cmd);
+		if (ret == BLK_STS_OK) {
+			prev = req;
+			continue;
+		}
+requeue:
+		/* detach 'req' and add to remainder list */
+		if (prev)
+			prev->rq_next = req->rq_next;
+		rq_list_add(&requeue_list, req);
+	}
+
+	nvme_submit_cmds(nvmeq, rqlist);
+	*rqlist = requeue_list;
+}
+
 static __always_inline void nvme_pci_unmap_rq(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -1741,6 +1807,7 @@ static const struct blk_mq_ops nvme_mq_admin_ops = {
 
 static const struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
+	.queue_rqs	= nvme_queue_rqs,
 	.complete	= nvme_pci_complete_rq,
 	.commit_rqs	= nvme_commit_rqs,
 	.init_hctx	= nvme_init_hctx,
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-12-16 17:53 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-16 16:05 [PATCHSET v4 0/4] Add support for list issue Jens Axboe
2021-12-16 16:05 ` [PATCH 1/4] block: add mq_ops->queue_rqs hook Jens Axboe
2021-12-16 16:05 ` [PATCH 2/4] nvme: split command copy into a helper Jens Axboe
2021-12-16 16:05 ` [PATCH 3/4] nvme: separate command prep and issue Jens Axboe
2021-12-16 16:05 ` [PATCH 4/4] nvme: add support for mq_ops->queue_rqs() Jens Axboe
2021-12-16 16:11   ` [PATCH v2 " Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2021-12-16 16:38 [PATCHSET v5 0/4] Add support for list issue Jens Axboe
2021-12-16 16:39 ` [PATCH 4/4] nvme: add support for mq_ops->queue_rqs() Jens Axboe
2021-12-16 17:53   ` Christoph Hellwig
2021-12-03 21:45 [PATCHSET v2 0/4] Add support for list issue Jens Axboe
2021-12-03 21:45 ` [PATCH 4/4] nvme: add support for mq_ops->queue_rqs() Jens Axboe
2021-12-04 10:47   ` Hannes Reinecke
2021-12-06  7:40   ` Christoph Hellwig
2021-12-06 16:33     ` Jens Axboe
2021-11-17  3:38 [PATCHSET 0/4] Add support for list issue Jens Axboe
2021-11-17  3:38 ` [PATCH 4/4] nvme: add support for mq_ops->queue_rqs() Jens Axboe
2021-11-17  8:39   ` Christoph Hellwig
2021-11-17 15:55     ` Jens Axboe
2021-11-17 15:58       ` Jens Axboe
2021-11-17 19:41   ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).