All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/8] Improve zoned storage write performance
@ 2022-06-27 23:43 Bart Van Assche
  2022-06-27 23:43 ` [PATCH v3 1/8] block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq() Bart Van Assche
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Bart Van Assche @ 2022-06-27 23:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche

Hi Jens,

Measurements have shown that limiting the queue depth to one per sequential
zone has a significant negative performance impact for zoned UFS devices. Hence
this patch series that increases the queue depth for write commands for
sequential zones when using the mq-deadline scheduler.

Please consider this patch series for kernel v5.20.

Thanks,

Bart.

Changes compared to v2:
- Included the patches again that enable write pipelining for NVMe ZNS SSDs.
- Disabled merging in the null_blk test script. As a result, the test script
  now shows a 14x improvement with pipelining enabled.
- Renamed blk_rq_is_zoned_seq_write() into blk_rq_is_seq_zone_write().

Changes compared to v1:
- Left out the patch for the UFS driver.
- Included patches for the null_blk driver.

Bart Van Assche (8):
  block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq()
  block: Introduce the blk_rq_is_seq_zone_write() function
  block: Introduce a request queue flag for pipelining zoned writes
  block/mq-deadline: Only use zone locking if necessary
  block/null_blk: Refactor null_queue_rq()
  block/null_blk: Add support for pipelining zoned writes
  nvme: Make the number of retries command specific
  nvme: Enable pipelining of zoned writes

 block/blk-zoned.c                 | 17 +++--------------
 block/mq-deadline.c               | 18 ++++++++++++------
 drivers/block/null_blk/main.c     | 30 ++++++++++++++++++++----------
 drivers/block/null_blk/null_blk.h |  3 +++
 drivers/block/null_blk/zoned.c    | 13 ++++++++++++-
 drivers/nvme/host/core.c          |  9 ++++++++-
 drivers/nvme/host/nvme.h          |  1 +
 drivers/nvme/host/zns.c           |  2 ++
 include/linux/blk-mq.h            | 30 ++++++++++++++++++++++++++++++
 include/linux/blkdev.h            | 16 ++++++++++++++++
 10 files changed, 107 insertions(+), 32 deletions(-)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 1/8] block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq()
  2022-06-27 23:43 [PATCH v3 0/8] Improve zoned storage write performance Bart Van Assche
@ 2022-06-27 23:43 ` Bart Van Assche
  2022-06-28  0:19   ` Chaitanya Kulkarni
  2022-06-27 23:43 ` [PATCH v3 2/8] block: Introduce the blk_rq_is_seq_zone_write() function Bart Van Assche
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2022-06-27 23:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal

Since it is nontrivial to figure out how blk_queue_zone_is_seq() and
blk_rq_zone_is_seq() handle sequential write preferred zones, document
this.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 include/linux/blk-mq.h | 7 +++++++
 include/linux/blkdev.h | 9 +++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e2d9daf7e8dd..909d47e34b7c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1124,6 +1124,13 @@ static inline unsigned int blk_rq_zone_no(struct request *rq)
 	return blk_queue_zone_no(rq->q, blk_rq_pos(rq));
 }
 
+/**
+ * blk_rq_zone_is_seq() - Whether a request is for a sequential zone.
+ * @rq: Request pointer.
+ *
+ * Return: true if and only if blk_rq_pos(@rq) refers either to a sequential
+ * write required or a sequential write preferred zone.
+ */
 static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
 {
 	return blk_queue_zone_is_seq(rq->q, blk_rq_pos(rq));
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b2d42201bd5d..6491250ba286 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -687,6 +687,15 @@ static inline unsigned int blk_queue_zone_no(struct request_queue *q,
 	return sector >> ilog2(q->limits.chunk_sectors);
 }
 
+/**
+ * blk_queue_zone_is_seq() - Whether a logical block is in a sequential zone.
+ * @q: Request queue pointer.
+ * @sector: Offset from start of block device in 512 byte units.
+ *
+ * Return: true if and only if @q is associated with a zoned block device and
+ * @sector refers either to a sequential write required or a sequential write
+ * preferred zone.
+ */
 static inline bool blk_queue_zone_is_seq(struct request_queue *q,
 					 sector_t sector)
 {

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 2/8] block: Introduce the blk_rq_is_seq_zone_write() function
  2022-06-27 23:43 [PATCH v3 0/8] Improve zoned storage write performance Bart Van Assche
  2022-06-27 23:43 ` [PATCH v3 1/8] block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq() Bart Van Assche
@ 2022-06-27 23:43 ` Bart Van Assche
  2022-06-28  0:20   ` Chaitanya Kulkarni
  2022-06-27 23:43 ` [PATCH v3 3/8] block: Introduce a request queue flag for pipelining zoned writes Bart Van Assche
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2022-06-27 23:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal

Introduce a function that makes it easy to verify whether a write
request is for a sequential write required or sequential write preferred
zone. Simplify blk_req_needs_zone_write_lock() by using the new function.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c      | 14 +-------------
 include/linux/blk-mq.h | 25 ++++++++++++++++++++++++-
 2 files changed, 25 insertions(+), 14 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 38cd840d8838..9da8cf1bb378 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -57,19 +57,7 @@ EXPORT_SYMBOL_GPL(blk_zone_cond_str);
  */
 bool blk_req_needs_zone_write_lock(struct request *rq)
 {
-	if (!rq->q->seq_zones_wlock)
-		return false;
-
-	if (blk_rq_is_passthrough(rq))
-		return false;
-
-	switch (req_op(rq)) {
-	case REQ_OP_WRITE_ZEROES:
-	case REQ_OP_WRITE:
-		return blk_rq_zone_is_seq(rq);
-	default:
-		return false;
-	}
+	return rq->q->seq_zones_wlock && blk_rq_is_seq_zone_write(rq);
 }
 EXPORT_SYMBOL_GPL(blk_req_needs_zone_write_lock);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 909d47e34b7c..ccfcac9db879 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1129,13 +1129,31 @@ static inline unsigned int blk_rq_zone_no(struct request *rq)
  * @rq: Request pointer.
  *
  * Return: true if and only if blk_rq_pos(@rq) refers either to a sequential
- * write required or a sequential write preferred zone.
+ * write required or to a sequential write preferred zone.
  */
 static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
 {
 	return blk_queue_zone_is_seq(rq->q, blk_rq_pos(rq));
 }
 
+/**
+ * blk_rq_is_seq_zone_write() - Whether @rq is a write request for a sequential zone.
+ * @rq: Request to examine.
+ *
+ * In this context sequential zone means either a sequential write required or
+ * to a sequential write preferred zone.
+ */
+static inline bool blk_rq_is_seq_zone_write(struct request *rq)
+{
+	switch (req_op(rq)) {
+	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
+		return blk_rq_zone_is_seq(rq);
+	default:
+		return false;
+	}
+}
+
 bool blk_req_needs_zone_write_lock(struct request *rq);
 bool blk_req_zone_write_trylock(struct request *rq);
 void __blk_req_zone_write_lock(struct request *rq);
@@ -1166,6 +1184,11 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
 	return !blk_req_zone_is_write_locked(rq);
 }
 #else /* CONFIG_BLK_DEV_ZONED */
+static inline bool blk_rq_is_seq_zone_write(struct request *rq)
+{
+	return false;
+}
+
 static inline bool blk_req_needs_zone_write_lock(struct request *rq)
 {
 	return false;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 3/8] block: Introduce a request queue flag for pipelining zoned writes
  2022-06-27 23:43 [PATCH v3 0/8] Improve zoned storage write performance Bart Van Assche
  2022-06-27 23:43 ` [PATCH v3 1/8] block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq() Bart Van Assche
  2022-06-27 23:43 ` [PATCH v3 2/8] block: Introduce the blk_rq_is_seq_zone_write() function Bart Van Assche
@ 2022-06-27 23:43 ` Bart Van Assche
  2022-06-28  0:36   ` Chaitanya Kulkarni
  2022-06-27 23:43 ` [PATCH v3 4/8] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2022-06-27 23:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal

Writes in sequential write required zones must happen at the write
pointer. Even if the submitter of the write commands (e.g. a filesystem)
submits writes for sequential write required zones in order, the block
layer or the storage controller may reorder these write commands.

The zone locking mechanism in the mq-deadline I/O scheduler serializes
write commands for sequential zones. Some but not all storage controllers
require this serialization. Introduce a new flag such that block drivers
can request pipelining of writes for sequential write required zones.

An example of a storage controller standard that requires write
serialization is AHCI (Advanced Host Controller Interface). Submitting
commands to AHCI controllers happens by writing a bit pattern into a
register. Each set bit corresponds to an active command. This mechanism
does not preserve command ordering information.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 include/linux/blkdev.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6491250ba286..7ee59e507fbf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -580,6 +580,8 @@ struct request_queue {
 #define QUEUE_FLAG_HCTX_ACTIVE	28	/* at least one blk-mq hctx is active */
 #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
 #define QUEUE_FLAG_SQ_SCHED     30	/* single queue style io dispatch */
+/* Writes for sequential write required zones may be pipelined. */
+#define QUEUE_FLAG_PIPELINE_ZONED_WRITES 31
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP) |		\
@@ -623,6 +625,11 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
 #define blk_queue_sq_sched(q)	test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags)
 
+static inline bool blk_queue_pipeline_zoned_writes(struct request_queue *q)
+{
+	return test_bit(QUEUE_FLAG_PIPELINE_ZONED_WRITES, &(q)->queue_flags);
+}
+
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 4/8] block/mq-deadline: Only use zone locking if necessary
  2022-06-27 23:43 [PATCH v3 0/8] Improve zoned storage write performance Bart Van Assche
                   ` (2 preceding siblings ...)
  2022-06-27 23:43 ` [PATCH v3 3/8] block: Introduce a request queue flag for pipelining zoned writes Bart Van Assche
@ 2022-06-27 23:43 ` Bart Van Assche
  2022-06-27 23:43 ` [PATCH v3 5/8] block/null_blk: Refactor null_queue_rq() Bart Van Assche
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2022-06-27 23:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal

Measurements have shown that limiting the queue depth to one for zoned
writes has a significant negative performance impact on zoned UFS devices.
Hence this patch that disables zone locking from the mq-deadline scheduler
for storage controllers that support pipelining zoned writes. This patch is
based on the following assumptions:
- Applications submit write requests to sequential write required zones
  in order.
- The I/O priority of all pipelined write requests is the same per zone.
- Pipelined zoned write requests are submitted to a single hardware
  queue per zone.
- If such write requests get reordered by the software or hardware queue
  mechanism, nr_requests - 1 retries are sufficient to reorder the write
  requests.
- It happens infrequently that zoned write requests are reordered by the
  block layer.
- Either no I/O scheduler is used or an I/O scheduler is used that
  submits write requests per zone in LBA order.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c   |  3 ++-
 block/mq-deadline.c | 18 ++++++++++++------
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 9da8cf1bb378..63c730a18ac4 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -513,7 +513,8 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 		break;
 	case BLK_ZONE_TYPE_SEQWRITE_REQ:
 	case BLK_ZONE_TYPE_SEQWRITE_PREF:
-		if (!args->seq_zones_wlock) {
+		if (!blk_queue_pipeline_zoned_writes(q) &&
+		    !args->seq_zones_wlock) {
 			args->seq_zones_wlock =
 				blk_alloc_zone_bitmap(q->node, args->nr_zones);
 			if (!args->seq_zones_wlock)
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 1a9e835e816c..aaef07a55984 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -292,7 +292,8 @@ deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 		return NULL;
 
 	rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
-	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
+	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q) ||
+	    blk_queue_pipeline_zoned_writes(rq->q))
 		return rq;
 
 	/*
@@ -326,7 +327,8 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 	if (!rq)
 		return NULL;
 
-	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
+	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q) ||
+	    blk_queue_pipeline_zoned_writes(rq->q))
 		return rq;
 
 	/*
@@ -445,8 +447,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	}
 
 	/*
-	 * For a zoned block device, if we only have writes queued and none of
-	 * them can be dispatched, rq will be NULL.
+	 * For a zoned block device that requires write serialization, if we
+	 * only have writes queued and none of them can be dispatched, rq will
+	 * be NULL.
 	 */
 	if (!rq)
 		return NULL;
@@ -719,6 +722,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
 	struct dd_per_prio *per_prio;
 	enum dd_prio prio;
+	bool pipelined_seq_write = blk_queue_pipeline_zoned_writes(q) &&
+		blk_rq_is_seq_zone_write(rq);
 	LIST_HEAD(free);
 
 	lockdep_assert_held(&dd->lock);
@@ -743,7 +748,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 
 	trace_block_rq_insert(rq);
 
-	if (at_head) {
+	if (at_head && !pipelined_seq_write) {
 		list_add(&rq->queuelist, &per_prio->dispatch);
 		rq->fifo_time = jiffies;
 	} else {
@@ -823,7 +828,8 @@ static void dd_finish_request(struct request *rq)
 
 	atomic_inc(&per_prio->stats.completed);
 
-	if (blk_queue_is_zoned(q)) {
+	if (blk_queue_is_zoned(rq->q) &&
+	    !blk_queue_pipeline_zoned_writes(q)) {
 		unsigned long flags;
 
 		spin_lock_irqsave(&dd->zone_lock, flags);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 5/8] block/null_blk: Refactor null_queue_rq()
  2022-06-27 23:43 [PATCH v3 0/8] Improve zoned storage write performance Bart Van Assche
                   ` (3 preceding siblings ...)
  2022-06-27 23:43 ` [PATCH v3 4/8] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
@ 2022-06-27 23:43 ` Bart Van Assche
  2022-06-28  0:37   ` Chaitanya Kulkarni
  2022-06-27 23:43 ` [PATCH v3 6/8] block/null_blk: Add support for pipelining zoned writes Bart Van Assche
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2022-06-27 23:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal

Introduce a local variable for the expression bd->rq since that expression
occurs multiple times. This patch does not change any functionality.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/block/null_blk/main.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 6b67088f4ea7..fd68e6f4637f 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -1609,10 +1609,11 @@ static enum blk_eh_timer_return null_timeout_rq(struct request *rq, bool res)
 static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
 			 const struct blk_mq_queue_data *bd)
 {
-	struct nullb_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
+	struct request *rq = bd->rq;
+	struct nullb_cmd *cmd = blk_mq_rq_to_pdu(rq);
 	struct nullb_queue *nq = hctx->driver_data;
-	sector_t nr_sectors = blk_rq_sectors(bd->rq);
-	sector_t sector = blk_rq_pos(bd->rq);
+	sector_t nr_sectors = blk_rq_sectors(rq);
+	sector_t sector = blk_rq_pos(rq);
 	const bool is_poll = hctx->type == HCTX_TYPE_POLL;
 
 	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
@@ -1621,14 +1622,14 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
 		hrtimer_init(&cmd->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 		cmd->timer.function = null_cmd_timer_expired;
 	}
-	cmd->rq = bd->rq;
+	cmd->rq = rq;
 	cmd->error = BLK_STS_OK;
 	cmd->nq = nq;
-	cmd->fake_timeout = should_timeout_request(bd->rq);
+	cmd->fake_timeout = should_timeout_request(rq);
 
-	blk_mq_start_request(bd->rq);
+	blk_mq_start_request(rq);
 
-	if (should_requeue_request(bd->rq)) {
+	if (should_requeue_request(rq)) {
 		/*
 		 * Alternate between hitting the core BUSY path, and the
 		 * driver driven requeue path
@@ -1637,21 +1638,21 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
 		if (nq->requeue_selection & 1)
 			return BLK_STS_RESOURCE;
 		else {
-			blk_mq_requeue_request(bd->rq, true);
+			blk_mq_requeue_request(rq, true);
 			return BLK_STS_OK;
 		}
 	}
 
 	if (is_poll) {
 		spin_lock(&nq->poll_lock);
-		list_add_tail(&bd->rq->queuelist, &nq->poll_list);
+		list_add_tail(&rq->queuelist, &nq->poll_list);
 		spin_unlock(&nq->poll_lock);
 		return BLK_STS_OK;
 	}
 	if (cmd->fake_timeout)
 		return BLK_STS_OK;
 
-	return null_handle_cmd(cmd, sector, nr_sectors, req_op(bd->rq));
+	return null_handle_cmd(cmd, sector, nr_sectors, req_op(rq));
 }
 
 static void cleanup_queue(struct nullb_queue *nq)

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 6/8] block/null_blk: Add support for pipelining zoned writes
  2022-06-27 23:43 [PATCH v3 0/8] Improve zoned storage write performance Bart Van Assche
                   ` (4 preceding siblings ...)
  2022-06-27 23:43 ` [PATCH v3 5/8] block/null_blk: Refactor null_queue_rq() Bart Van Assche
@ 2022-06-27 23:43 ` Bart Van Assche
  2022-06-28  0:39   ` Chaitanya Kulkarni
  2022-06-27 23:43 ` [PATCH v3 7/8] nvme: Make the number of retries command specific Bart Van Assche
  2022-06-27 23:43 ` [PATCH v3 8/8] nvme: Enable pipelining of zoned writes Bart Van Assche
  7 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2022-06-27 23:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal

Add a new configfs attribute for enabling pipelining of zoned writes. If
that attribute has been set, retry zoned writes that are not aligned with
the write pointer. The test script below reports 244 K IOPS with no I/O
scheduler, 5.79 K IOPS with mq-deadline and pipelining disabled and 79.6 K
IOPS with mq-deadline and pipelining enabled. This shows that pipelining
results in about 14 times more IOPS for this particular test case.

    #!/bin/bash

    for mode in "none 0" "mq-deadline 0" "mq-deadline 1"; do
        set +e
        for d in /sys/kernel/config/nullb/*; do
            [ -d "$d" ] && rmdir "$d"
        done
        modprobe -r null_blk
        set -e
        read -r iosched pipelining <<<"$mode"
        modprobe null_blk nr_devices=0
        (
            cd /sys/kernel/config/nullb
            mkdir nullb0
            cd nullb0
            params=(
                completion_nsec=100000
                hw_queue_depth=64
                irqmode=2                # NULL_IRQ_TIMER
                max_sectors=$((4096/512))
                memory_backed=1
                pipeline_zoned_writes="${pipelining}"
                size=1
                submit_queues=1
                zone_size=1
                zoned=1
                power=1
            )
            for p in "${params[@]}"; do
                echo "${p//*=}" > "${p//=*}"
            done
        )
        udevadm settle
        dev=/dev/nullb0
        [ -b "${dev}" ]
        params=(
            --direct=1
            --filename="${dev}"
            --iodepth=64
            --iodepth_batch=16
            --ioengine=io_uring
            --ioscheduler="${iosched}"
            --gtod_reduce=1
            --hipri=0
            --name=nullb0
            --runtime=30
            --rw=write
            --time_based=1
            --zonemode=zbd
        )
        fio "${params[@]}"
    done

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/block/null_blk/main.c     |  9 +++++++++
 drivers/block/null_blk/null_blk.h |  3 +++
 drivers/block/null_blk/zoned.c    | 13 ++++++++++++-
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index fd68e6f4637f..de9ed9a4ca10 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -408,6 +408,7 @@ NULLB_DEVICE_ATTR(zone_capacity, ulong, NULL);
 NULLB_DEVICE_ATTR(zone_nr_conv, uint, NULL);
 NULLB_DEVICE_ATTR(zone_max_open, uint, NULL);
 NULLB_DEVICE_ATTR(zone_max_active, uint, NULL);
+NULLB_DEVICE_ATTR(pipeline_zoned_writes, bool, NULL);
 NULLB_DEVICE_ATTR(virt_boundary, bool, NULL);
 
 static ssize_t nullb_device_power_show(struct config_item *item, char *page)
@@ -531,6 +532,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
 	&nullb_device_attr_zone_nr_conv,
 	&nullb_device_attr_zone_max_open,
 	&nullb_device_attr_zone_max_active,
+	&nullb_device_attr_pipeline_zoned_writes,
 	&nullb_device_attr_virt_boundary,
 	NULL,
 };
@@ -1626,6 +1628,11 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
 	cmd->error = BLK_STS_OK;
 	cmd->nq = nq;
 	cmd->fake_timeout = should_timeout_request(rq);
+	if (!(rq->rq_flags & RQF_DONTPREP)) {
+		rq->rq_flags |= RQF_DONTPREP;
+		cmd->retries = 0;
+		cmd->max_attempts = rq->q->nr_requests;
+	}
 
 	blk_mq_start_request(rq);
 
@@ -2042,6 +2049,8 @@ static int null_add_dev(struct nullb_device *dev)
 	nullb->q->queuedata = nullb;
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, nullb->q);
 	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, nullb->q);
+	if (dev->pipeline_zoned_writes)
+		blk_queue_flag_set(QUEUE_FLAG_PIPELINE_ZONED_WRITES, nullb->q);
 
 	mutex_lock(&lock);
 	nullb->index = ida_simple_get(&nullb_indexes, 0, 0, GFP_KERNEL);
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 8359b43842f2..bbe2cb17bdbd 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -23,6 +23,8 @@ struct nullb_cmd {
 	unsigned int tag;
 	blk_status_t error;
 	bool fake_timeout;
+	u16 retries;
+	u16 max_attempts;
 	struct nullb_queue *nq;
 	struct hrtimer timer;
 };
@@ -112,6 +114,7 @@ struct nullb_device {
 	bool memory_backed; /* if data is stored in memory */
 	bool discard; /* if support discard */
 	bool zoned; /* if device is zoned */
+	bool pipeline_zoned_writes;
 	bool virt_boundary; /* virtual boundary on/off for the device */
 };
 
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 2fdd7b20c224..7138954dc01c 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -403,7 +403,18 @@ static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
 		else
 			cmd->bio->bi_iter.bi_sector = sector;
 	} else if (sector != zone->wp) {
-		ret = BLK_STS_IOERR;
+		/*
+		 * In case of a misaligned write and if pipelining of zoned
+		 * writes has been enabled, request the block layer to retry
+		 * until the maximum number of attempts has been reached. If
+		 * the maximum number of attempts has been reached, fail the
+		 * misaligned write.
+		 */
+		if (dev->pipeline_zoned_writes &&
+		    ++cmd->retries < cmd->max_attempts)
+			ret = BLK_STS_DEV_RESOURCE;
+		else
+			ret = BLK_STS_IOERR;
 		goto unlock;
 	}
 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 7/8] nvme: Make the number of retries command specific
  2022-06-27 23:43 [PATCH v3 0/8] Improve zoned storage write performance Bart Van Assche
                   ` (5 preceding siblings ...)
  2022-06-27 23:43 ` [PATCH v3 6/8] block/null_blk: Add support for pipelining zoned writes Bart Van Assche
@ 2022-06-27 23:43 ` Bart Van Assche
  2022-06-28  0:48   ` Chaitanya Kulkarni
  2022-06-27 23:43 ` [PATCH v3 8/8] nvme: Enable pipelining of zoned writes Bart Van Assche
  7 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2022-06-27 23:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Keith Busch, Sagi Grimberg, Chaitanya Kulkarni

Add support for specifying the number of retries per NVMe command.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/nvme/host/core.c | 3 ++-
 drivers/nvme/host/nvme.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b3d9c29aba1e..df9ac7fab9b8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -339,7 +339,7 @@ static inline enum nvme_disposition nvme_decide_disposition(struct request *req)
 
 	if (blk_noretry_request(req) ||
 	    (nvme_req(req)->status & NVME_SC_DNR) ||
-	    nvme_req(req)->retries >= nvme_max_retries)
+	    nvme_req(req)->retries >= nvme_req(req)->max_retries)
 		return COMPLETE;
 
 	if (req->cmd_flags & REQ_NVME_MPATH) {
@@ -632,6 +632,7 @@ static inline void nvme_clear_nvme_request(struct request *req)
 {
 	nvme_req(req)->status = 0;
 	nvme_req(req)->retries = 0;
+	nvme_req(req)->max_retries = nvme_max_retries;
 	nvme_req(req)->flags = 0;
 	req->rq_flags |= RQF_DONTPREP;
 }
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0da94b233fed..ca415cd9571e 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -160,6 +160,7 @@ struct nvme_request {
 	union nvme_result	result;
 	u8			genctr;
 	u8			retries;
+	u8			max_retries;
 	u8			flags;
 	u16			status;
 	struct nvme_ctrl	*ctrl;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 8/8] nvme: Enable pipelining of zoned writes
  2022-06-27 23:43 [PATCH v3 0/8] Improve zoned storage write performance Bart Van Assche
                   ` (6 preceding siblings ...)
  2022-06-27 23:43 ` [PATCH v3 7/8] nvme: Make the number of retries command specific Bart Van Assche
@ 2022-06-27 23:43 ` Bart Van Assche
  2022-06-28  3:16   ` Keith Busch
  2022-06-28  4:49   ` Christoph Hellwig
  7 siblings, 2 replies; 25+ messages in thread
From: Bart Van Assche @ 2022-06-27 23:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Keith Busch, Sagi Grimberg, Chaitanya Kulkarni

Enabling pipelining for zoned writes. Increase the number of retries
for zoned writes to the maximum number of outstanding commands per hardware
queue.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/nvme/host/core.c | 6 ++++++
 drivers/nvme/host/zns.c  | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index df9ac7fab9b8..3f71d3d82d5e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -854,6 +854,12 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 	if (req->cmd_flags & REQ_RAHEAD)
 		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
 
+	if (blk_queue_pipeline_zoned_writes(req->q) &&
+	    blk_rq_is_seq_zone_write(req))
+		nvme_req(req)->max_retries =
+			min(0UL + type_max(typeof(nvme_req(req)->max_retries)),
+			    nvme_req(req)->max_retries + req->q->nr_requests);
+
 	cmnd->rw.opcode = op;
 	cmnd->rw.flags = 0;
 	cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 9f81beb4df4e..0b10db3b8d3a 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -54,6 +54,8 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
 	struct nvme_id_ns_zns *id;
 	int status;
 
+	blk_queue_flag_set(QUEUE_FLAG_PIPELINE_ZONED_WRITES, ns->queue);
+
 	/* Driver requires zone append support */
 	if ((le32_to_cpu(log->iocs[nvme_cmd_zone_append]) &
 			NVME_CMD_EFFECTS_CSUPP)) {

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/8] block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq()
  2022-06-27 23:43 ` [PATCH v3 1/8] block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq() Bart Van Assche
@ 2022-06-28  0:19   ` Chaitanya Kulkarni
  0 siblings, 0 replies; 25+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-28  0:19 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal

On 6/27/22 16:43, Bart Van Assche wrote:
> Since it is nontrivial to figure out how blk_queue_zone_is_seq() and
> blk_rq_zone_is_seq() handle sequential write preferred zones, document
> this.
> 
> Reviewed-by: Damien Le Moal<damien.lemoal@opensource.wdc.com>
> Signed-off-by: Bart Van Assche<bvanassche@acm.org>


Looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/8] block: Introduce the blk_rq_is_seq_zone_write() function
  2022-06-27 23:43 ` [PATCH v3 2/8] block: Introduce the blk_rq_is_seq_zone_write() function Bart Van Assche
@ 2022-06-28  0:20   ` Chaitanya Kulkarni
  0 siblings, 0 replies; 25+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-28  0:20 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal

On 6/27/22 16:43, Bart Van Assche wrote:
> Introduce a function that makes it easy to verify whether a write
> request is for a sequential write required or sequential write preferred
> zone. Simplify blk_req_needs_zone_write_lock() by using the new function.
> 
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---

Looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/8] block: Introduce a request queue flag for pipelining zoned writes
  2022-06-27 23:43 ` [PATCH v3 3/8] block: Introduce a request queue flag for pipelining zoned writes Bart Van Assche
@ 2022-06-28  0:36   ` Chaitanya Kulkarni
  2022-06-28  2:49     ` Bart Van Assche
  0 siblings, 1 reply; 25+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-28  0:36 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, Jens Axboe, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal


>   #define QUEUE_FLAG_SQ_SCHED     30	/* single queue style io dispatch */
> +/* Writes for sequential write required zones may be pipelined. */
> +#define QUEUE_FLAG_PIPELINE_ZONED_WRITES 31
>   
>   #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
>   				 (1 << QUEUE_FLAG_SAME_COMP) |		\
> @@ -623,6 +625,11 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
>   #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
>   #define blk_queue_sq_sched(q)	test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags)
>   
> +static inline bool blk_queue_pipeline_zoned_writes(struct request_queue *q)
> +{
> +	return test_bit(QUEUE_FLAG_PIPELINE_ZONED_WRITES, &(q)->queue_flags);
> +}
> +

 From the comments that I've received, when introducing a new helper or
a flag should be a part of the patch that actually uses it,
is there a specific reason that this has made as a separate patch ?

-ck


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/8] block/null_blk: Refactor null_queue_rq()
  2022-06-27 23:43 ` [PATCH v3 5/8] block/null_blk: Refactor null_queue_rq() Bart Van Assche
@ 2022-06-28  0:37   ` Chaitanya Kulkarni
  0 siblings, 0 replies; 25+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-28  0:37 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, Jens Axboe, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal

On 6/27/22 16:43, Bart Van Assche wrote:
> Introduce a local variable for the expression bd->rq since that expression
> occurs multiple times. This patch does not change any functionality.
> 
> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 6/8] block/null_blk: Add support for pipelining zoned writes
  2022-06-27 23:43 ` [PATCH v3 6/8] block/null_blk: Add support for pipelining zoned writes Bart Van Assche
@ 2022-06-28  0:39   ` Chaitanya Kulkarni
  2022-06-28 16:17     ` Bart Van Assche
  0 siblings, 1 reply; 25+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-28  0:39 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, Christoph Hellwig, Jens Axboe, Jaegeuk Kim, Damien Le Moal

On 6/27/22 16:43, Bart Van Assche wrote:
> Add a new configfs attribute for enabling pipelining of zoned writes. If
> that attribute has been set, retry zoned writes that are not aligned with
> the write pointer. The test script below reports 244 K IOPS with no I/O
> scheduler, 5.79 K IOPS with mq-deadline and pipelining disabled and 79.6 K
> IOPS with mq-deadline and pipelining enabled. This shows that pipelining
> results in about 14 times more IOPS for this particular test case.
> 
>      #!/bin/bash
> 
>      for mode in "none 0" "mq-deadline 0" "mq-deadline 1"; do
>          set +e
>          for d in /sys/kernel/config/nullb/*; do
>              [ -d "$d" ] && rmdir "$d"
>          done
>          modprobe -r null_blk
>          set -e
>          read -r iosched pipelining <<<"$mode"
>          modprobe null_blk nr_devices=0
>          (
>              cd /sys/kernel/config/nullb
>              mkdir nullb0
>              cd nullb0
>              params=(
>                  completion_nsec=100000
>                  hw_queue_depth=64
>                  irqmode=2                # NULL_IRQ_TIMER
>                  max_sectors=$((4096/512))
>                  memory_backed=1
>                  pipeline_zoned_writes="${pipelining}"
>                  size=1
>                  submit_queues=1
>                  zone_size=1
>                  zoned=1
>                  power=1
>              )
>              for p in "${params[@]}"; do
>                  echo "${p//*=}" > "${p//=*}"
>              done
>          )
>          udevadm settle
>          dev=/dev/nullb0
>          [ -b "${dev}" ]
>          params=(
>              --direct=1
>              --filename="${dev}"
>              --iodepth=64
>              --iodepth_batch=16
>              --ioengine=io_uring
>              --ioscheduler="${iosched}"
>              --gtod_reduce=1
>              --hipri=0
>              --name=nullb0
>              --runtime=30
>              --rw=write
>              --time_based=1
>              --zonemode=zbd
>          )
>          fio "${params[@]}"
>      done
> 

Is it possible to move this into the blktests ?

-ck


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 7/8] nvme: Make the number of retries command specific
  2022-06-27 23:43 ` [PATCH v3 7/8] nvme: Make the number of retries command specific Bart Van Assche
@ 2022-06-28  0:48   ` Chaitanya Kulkarni
  2022-06-28  2:48     ` Bart Van Assche
  0 siblings, 1 reply; 25+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-28  0:48 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, Jens Axboe, Christoph Hellwig, Jaegeuk Kim,
	Keith Busch, Sagi Grimberg, Chaitanya Kulkarni


> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 0da94b233fed..ca415cd9571e 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -160,6 +160,7 @@ struct nvme_request {
>   	union nvme_result	result;
>   	u8			genctr;
>   	u8			retries;
> +	u8			max_retries;
>   	u8			flags;
>   	u16			status;
>   	struct nvme_ctrl	*ctrl;

If I understand correctly then per command max_retries count is only
needed for zoned devices.

why not make struct nvme_request->max_retries field and subsequent code
configurable under CONFIG_BLK_DEV_ZONED ?

That will avoid increasing size of the nvme_request for
!CONFIG_BLK_DEV_ZONED case where per command
nvme_request->max_retries has no use.

-ck


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 7/8] nvme: Make the number of retries command specific
  2022-06-28  0:48   ` Chaitanya Kulkarni
@ 2022-06-28  2:48     ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2022-06-28  2:48 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, Jens Axboe, Christoph Hellwig, Jaegeuk Kim,
	Keith Busch, Sagi Grimberg

On 6/27/22 17:48, Chaitanya Kulkarni wrote:
> 
>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>> index 0da94b233fed..ca415cd9571e 100644
>> --- a/drivers/nvme/host/nvme.h
>> +++ b/drivers/nvme/host/nvme.h
>> @@ -160,6 +160,7 @@ struct nvme_request {
>>    	union nvme_result	result;
>>    	u8			genctr;
>>    	u8			retries;
>> +	u8			max_retries;
>>    	u8			flags;
>>    	u16			status;
>>    	struct nvme_ctrl	*ctrl;
> 
> If I understand correctly then per command max_retries count is only
> needed for zoned devices.
> 
> why not make struct nvme_request->max_retries field and subsequent code
> configurable under CONFIG_BLK_DEV_ZONED ?
> 
> That will avoid increasing size of the nvme_request for
> !CONFIG_BLK_DEV_ZONED case where per command
> nvme_request->max_retries has no use.

Hi Chaitanya,

Thanks for the review.

We may disagree about whether or not this patch increases the size
of struct nvme_request. I think the new member fills an existing
hole and hence does not increase the size of struct nvme_request :-)

Bart.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/8] block: Introduce a request queue flag for pipelining zoned writes
  2022-06-28  0:36   ` Chaitanya Kulkarni
@ 2022-06-28  2:49     ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2022-06-28  2:49 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, Jens Axboe, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal

On 6/27/22 17:36, Chaitanya Kulkarni wrote:
> From the comments that I've received, when introducing a new helper or
> a flag should be a part of the patch that actually uses it,
> is there a specific reason that this has made as a separate patch ?

Hi Chaitanya,

This patch is a separate patch to make it easy to review this patch series.
I'm concerned that if I would combine this patch with the next patch that
the result would be harder to verify than two separate patches.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 8/8] nvme: Enable pipelining of zoned writes
  2022-06-27 23:43 ` [PATCH v3 8/8] nvme: Enable pipelining of zoned writes Bart Van Assche
@ 2022-06-28  3:16   ` Keith Busch
  2022-06-28  9:00     ` Chaitanya Kulkarni
  2022-06-28 17:44     ` Bart Van Assche
  2022-06-28  4:49   ` Christoph Hellwig
  1 sibling, 2 replies; 25+ messages in thread
From: Keith Busch @ 2022-06-28  3:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Sagi Grimberg, Chaitanya Kulkarni

On Mon, Jun 27, 2022 at 04:43:35PM -0700, Bart Van Assche wrote:
> @@ -854,6 +854,12 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>  	if (req->cmd_flags & REQ_RAHEAD)
>  		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
>  
> +	if (blk_queue_pipeline_zoned_writes(req->q) &&
> +	    blk_rq_is_seq_zone_write(req))
> +		nvme_req(req)->max_retries =
> +			min(0UL + type_max(typeof(nvme_req(req)->max_retries)),
> +			    nvme_req(req)->max_retries + req->q->nr_requests);

I can't make much sense of what the above is trying to accomplish. This
reevaluates max_retries every time the request is retried, and the new
max_retries is based on the previous max_retries?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 8/8] nvme: Enable pipelining of zoned writes
  2022-06-27 23:43 ` [PATCH v3 8/8] nvme: Enable pipelining of zoned writes Bart Van Assche
  2022-06-28  3:16   ` Keith Busch
@ 2022-06-28  4:49   ` Christoph Hellwig
  2022-06-28 16:30     ` Bart Van Assche
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2022-06-28  4:49 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Keith Busch, Sagi Grimberg, Chaitanya Kulkarni

On Mon, Jun 27, 2022 at 04:43:35PM -0700, Bart Van Assche wrote:
> Enabling pipelining for zoned writes. Increase the number of retries
> for zoned writes to the maximum number of outstanding commands per hardware
> queue.

How is this supposed to work?  NVMe controllers are completely free
to reorder.  It also doesn't make sense as all zoned writes in Linux
either use zone append or block layer based zone locking.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 8/8] nvme: Enable pipelining of zoned writes
  2022-06-28  3:16   ` Keith Busch
@ 2022-06-28  9:00     ` Chaitanya Kulkarni
  2022-06-28 17:44     ` Bart Van Assche
  1 sibling, 0 replies; 25+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-28  9:00 UTC (permalink / raw)
  To: Keith Busch, Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Sagi Grimberg, Chaitanya Kulkarni

On 6/27/2022 8:16 PM, Keith Busch wrote:
> On Mon, Jun 27, 2022 at 04:43:35PM -0700, Bart Van Assche wrote:
>> @@ -854,6 +854,12 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>>   	if (req->cmd_flags & REQ_RAHEAD)
>>   		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
>>   
>> +	if (blk_queue_pipeline_zoned_writes(req->q) &&
>> +	    blk_rq_is_seq_zone_write(req))
>> +		nvme_req(req)->max_retries =
>> +			min(0UL + type_max(typeof(nvme_req(req)->max_retries)),
>> +			    nvme_req(req)->max_retries + req->q->nr_requests);
> 
> I can't make much sense of what the above is trying to accomplish. This
> reevaluates max_retries every time the request is retried, and the new
> max_retries is based on the previous max_retries?

I also had a hard time quickly reading the code but that maybe just me.

Perhaps a well documented comment or a helper with comment explaining 
the logic will be helpful here ?

-ck



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 6/8] block/null_blk: Add support for pipelining zoned writes
  2022-06-28  0:39   ` Chaitanya Kulkarni
@ 2022-06-28 16:17     ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2022-06-28 16:17 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, Christoph Hellwig, Jens Axboe, Jaegeuk Kim, Damien Le Moal

On 6/27/22 17:39, Chaitanya Kulkarni wrote:
> Is it possible to move this into the blktests ?

That sounds like a good idea to me. I will look into that after this 
patch series has been merged.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 8/8] nvme: Enable pipelining of zoned writes
  2022-06-28  4:49   ` Christoph Hellwig
@ 2022-06-28 16:30     ` Bart Van Assche
  2022-06-29  5:51       ` Christoph Hellwig
  2022-06-29  6:10       ` Christoph Hellwig
  0 siblings, 2 replies; 25+ messages in thread
From: Bart Van Assche @ 2022-06-28 16:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Keith Busch, Sagi Grimberg,
	Chaitanya Kulkarni

On 6/27/22 21:49, Christoph Hellwig wrote:
> On Mon, Jun 27, 2022 at 04:43:35PM -0700, Bart Van Assche wrote:
>> Enabling pipelining for zoned writes. Increase the number of retries
>> for zoned writes to the maximum number of outstanding commands per hardware
>> queue.
> 
> How is this supposed to work?  NVMe controllers are completely free
> to reorder.  It also doesn't make sense as all zoned writes in Linux
> either use zone append or block layer based zone locking.

Agreed that the NVMe specification allows to reorder outstanding 
commands. Are there any NVMe controllers that do this if multiple zoned 
write commands are outstanding for a single zone? I do not expect that 
an NVMe controller would reorder write commands in such a way that an 
I/O error is introduced.

Regarding zoned writes in Linux, Android devices use F2FS. F2FS submits 
regular writes (REQ_OP_WRITE) to zoned devices. Patch 4/8 from this 
patch series disables zone locking in the mq-deadline scheduler if the 
block driver has set the QUEUE_FLAG_PIPELINE_ZONED_WRITES flag.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 8/8] nvme: Enable pipelining of zoned writes
  2022-06-28  3:16   ` Keith Busch
  2022-06-28  9:00     ` Chaitanya Kulkarni
@ 2022-06-28 17:44     ` Bart Van Assche
  1 sibling, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2022-06-28 17:44 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Sagi Grimberg, Chaitanya Kulkarni

On 6/27/22 20:16, Keith Busch wrote:
> On Mon, Jun 27, 2022 at 04:43:35PM -0700, Bart Van Assche wrote:
>> @@ -854,6 +854,12 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>>   	if (req->cmd_flags & REQ_RAHEAD)
>>   		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
>>   
>> +	if (blk_queue_pipeline_zoned_writes(req->q) &&
>> +	    blk_rq_is_seq_zone_write(req))
>> +		nvme_req(req)->max_retries =
>> +			min(0UL + type_max(typeof(nvme_req(req)->max_retries)),
>> +			    nvme_req(req)->max_retries + req->q->nr_requests);
> 
> I can't make much sense of what the above is trying to accomplish. This
> reevaluates max_retries every time the request is retried, and the new
> max_retries is based on the previous max_retries?

Hi Keith,

Thanks for your question. It made me realize that the above code should be
moved. How about replacing patch 8/8 with the (untested) patch below?

Thanks,

Bart.


Subject: [PATCH] nvme: Enable pipelining of zoned writes

Enabling pipelining for zoned writes. Increase the number of retries
for zoned writes to the maximum number of outstanding commands per hardware
queue.


diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index aeeaca1c3197..6a0b824a343f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -630,10 +630,15 @@ EXPORT_SYMBOL_NS_GPL(nvme_put_ns, NVME_TARGET_PASSTHRU);

  static inline void nvme_clear_nvme_request(struct request *req)
  {
+	u32 max_retries = nvme_max_retries;
+
  	nvme_req(req)->status = 0;
  	nvme_req(req)->retries = 0;
  	nvme_req(req)->flags = 0;
-	nvme_req(req)->max_retries = nvme_max_retries;
+	if (blk_queue_pipeline_zoned_writes(req->q) &&
+	    blk_rq_is_seq_zone_write(req))
+		max_retries += req->q->nr_requests;
+	nvme_req(req)->max_retries = min(0xffU, max_retries);
  	req->rq_flags |= RQF_DONTPREP;
  }

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 9f81beb4df4e..0b10db3b8d3a 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -54,6 +54,8 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
  	struct nvme_id_ns_zns *id;
  	int status;

+	blk_queue_flag_set(QUEUE_FLAG_PIPELINE_ZONED_WRITES, ns->queue);
+
  	/* Driver requires zone append support */
  	if ((le32_to_cpu(log->iocs[nvme_cmd_zone_append]) &
  			NVME_CMD_EFFECTS_CSUPP)) {

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 8/8] nvme: Enable pipelining of zoned writes
  2022-06-28 16:30     ` Bart Van Assche
@ 2022-06-29  5:51       ` Christoph Hellwig
  2022-06-29  6:10       ` Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-06-29  5:51 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Jaegeuk Kim,
	Keith Busch, Sagi Grimberg, Chaitanya Kulkarni

On Tue, Jun 28, 2022 at 09:30:09AM -0700, Bart Van Assche wrote:
> Regarding zoned writes in Linux, Android devices use F2FS. F2FS submits 
> regular writes (REQ_OP_WRITE) to zoned devices. Patch 4/8 from this patch 
> series disables zone locking in the mq-deadline scheduler if the block 
> driver has set the QUEUE_FLAG_PIPELINE_ZONED_WRITES flag.

Please fix f2fs to use zone appends rather than working around the stack.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 8/8] nvme: Enable pipelining of zoned writes
  2022-06-28 16:30     ` Bart Van Assche
  2022-06-29  5:51       ` Christoph Hellwig
@ 2022-06-29  6:10       ` Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-06-29  6:10 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Jaegeuk Kim,
	Keith Busch, Sagi Grimberg, Chaitanya Kulkarni

[look like my previously reply accidentally lost the first half,
 so here is is]

On Tue, Jun 28, 2022 at 09:30:09AM -0700, Bart Van Assche wrote:
> Agreed that the NVMe specification allows to reorder outstanding commands. 
> Are there any NVMe controllers that do this if multiple zoned write 
> commands are outstanding for a single zone? I do not expect that an NVMe 
> controller would reorder write commands in such a way that an I/O error is 
> introduced.

NVMe not only allows reordering, but actually requires it due to the
round robin command arbitrary. Moreoever once you are on IP based
transports there is plenty of reordering that can and will go on before
even reaching the controller.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-06-29  6:11 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-27 23:43 [PATCH v3 0/8] Improve zoned storage write performance Bart Van Assche
2022-06-27 23:43 ` [PATCH v3 1/8] block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq() Bart Van Assche
2022-06-28  0:19   ` Chaitanya Kulkarni
2022-06-27 23:43 ` [PATCH v3 2/8] block: Introduce the blk_rq_is_seq_zone_write() function Bart Van Assche
2022-06-28  0:20   ` Chaitanya Kulkarni
2022-06-27 23:43 ` [PATCH v3 3/8] block: Introduce a request queue flag for pipelining zoned writes Bart Van Assche
2022-06-28  0:36   ` Chaitanya Kulkarni
2022-06-28  2:49     ` Bart Van Assche
2022-06-27 23:43 ` [PATCH v3 4/8] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
2022-06-27 23:43 ` [PATCH v3 5/8] block/null_blk: Refactor null_queue_rq() Bart Van Assche
2022-06-28  0:37   ` Chaitanya Kulkarni
2022-06-27 23:43 ` [PATCH v3 6/8] block/null_blk: Add support for pipelining zoned writes Bart Van Assche
2022-06-28  0:39   ` Chaitanya Kulkarni
2022-06-28 16:17     ` Bart Van Assche
2022-06-27 23:43 ` [PATCH v3 7/8] nvme: Make the number of retries command specific Bart Van Assche
2022-06-28  0:48   ` Chaitanya Kulkarni
2022-06-28  2:48     ` Bart Van Assche
2022-06-27 23:43 ` [PATCH v3 8/8] nvme: Enable pipelining of zoned writes Bart Van Assche
2022-06-28  3:16   ` Keith Busch
2022-06-28  9:00     ` Chaitanya Kulkarni
2022-06-28 17:44     ` Bart Van Assche
2022-06-28  4:49   ` Christoph Hellwig
2022-06-28 16:30     ` Bart Van Assche
2022-06-29  5:51       ` Christoph Hellwig
2022-06-29  6:10       ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.