All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/6] Improve zoned storage write performance
@ 2022-06-23 23:25 Bart Van Assche
  2022-06-23 23:25 ` [PATCH v2 1/6] block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq() Bart Van Assche
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Bart Van Assche @ 2022-06-23 23:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Bart Van Assche

Hi Jens,

Measurements have shown that limiting the queue depth to one per sequential
zone has a significant negative performance impact on zoned UFS devices. Hence
this patch series that increases the queue depth for write commands for
sequential zones when using the mq-deadline scheduler.

SCSI and UFS patches will be submitted at a later time (after multiqueue
support is available for the UFS driver).

Please consider this patch series for kernel v5.20.

Thanks,

Bart.

Changes compared to v1:
- Left out the SCSI and NVMe patches.
- Added a null_blk patch.
- Included measurement results.

Bart Van Assche (6):
  block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq()
  block: Introduce the blk_rq_is_zoned_seq_write() function
  block: Introduce a request queue flag for pipelining zoned writes
  block/mq-deadline: Only use zone locking if necessary
  block/null_blk: Refactor null_queue_rq()
  block/null_blk: Add support for pipelining zoned writes

 block/blk-zoned.c                 | 17 +++--------------
 block/mq-deadline.c               | 15 +++++++++------
 drivers/block/null_blk/main.c     | 30 ++++++++++++++++++++----------
 drivers/block/null_blk/null_blk.h |  3 +++
 drivers/block/null_blk/zoned.c    |  4 +++-
 include/linux/blk-mq.h            | 30 ++++++++++++++++++++++++++++++
 include/linux/blkdev.h            | 16 ++++++++++++++++
 7 files changed, 84 insertions(+), 31 deletions(-)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/6] block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq()
  2022-06-23 23:25 [PATCH v2 0/6] Improve zoned storage write performance Bart Van Assche
@ 2022-06-23 23:25 ` Bart Van Assche
  2022-06-24  0:10   ` Damien Le Moal
  2022-06-23 23:25 ` [PATCH v2 2/6] block: Introduce the blk_rq_is_zoned_seq_write() function Bart Van Assche
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2022-06-23 23:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Damien Le Moal

Since it is nontrivial to figure out how blk_queue_zone_is_seq() and
blk_rq_zone_is_seq() handle sequential write preferred zones, document
this.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 include/linux/blk-mq.h | 7 +++++++
 include/linux/blkdev.h | 9 +++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e2d9daf7e8dd..909d47e34b7c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1124,6 +1124,13 @@ static inline unsigned int blk_rq_zone_no(struct request *rq)
 	return blk_queue_zone_no(rq->q, blk_rq_pos(rq));
 }
 
+/**
+ * blk_rq_zone_is_seq() - Whether a request is for a sequential zone.
+ * @rq: Request pointer.
+ *
+ * Return: true if and only if blk_rq_pos(@rq) refers either to a sequential
+ * write required or a sequential write preferred zone.
+ */
 static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
 {
 	return blk_queue_zone_is_seq(rq->q, blk_rq_pos(rq));
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 92b3bffad328..2904100d2485 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -688,6 +688,15 @@ static inline unsigned int blk_queue_zone_no(struct request_queue *q,
 	return sector >> ilog2(q->limits.chunk_sectors);
 }
 
+/**
+ * blk_queue_zone_is_seq() - Whether a logical block is in a sequential zone.
+ * @q: Request queue pointer.
+ * @sector: Offset from start of block device in 512 byte units.
+ *
+ * Return: true if and only if @q is associated with a zoned block device and
+ * @sector refers either to a sequential write required or a sequential write
+ * preferred zone.
+ */
 static inline bool blk_queue_zone_is_seq(struct request_queue *q,
 					 sector_t sector)
 {

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/6] block: Introduce the blk_rq_is_zoned_seq_write() function
  2022-06-23 23:25 [PATCH v2 0/6] Improve zoned storage write performance Bart Van Assche
  2022-06-23 23:25 ` [PATCH v2 1/6] block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq() Bart Van Assche
@ 2022-06-23 23:25 ` Bart Van Assche
  2022-06-24  0:15   ` Damien Le Moal
  2022-06-24  6:07   ` Christoph Hellwig
  2022-06-23 23:26 ` [PATCH v2 3/6] block: Introduce a request queue flag for pipelining zoned writes Bart Van Assche
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 20+ messages in thread
From: Bart Van Assche @ 2022-06-23 23:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Damien Le Moal

Introduce a function that makes it easy to verify whether a write
request is for a sequential write required or sequential write preferred
zone. Simplify blk_req_needs_zone_write_lock() by using the new function.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c      | 14 +-------------
 include/linux/blk-mq.h | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 38cd840d8838..cafcbc508dfb 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -57,19 +57,7 @@ EXPORT_SYMBOL_GPL(blk_zone_cond_str);
  */
 bool blk_req_needs_zone_write_lock(struct request *rq)
 {
-	if (!rq->q->seq_zones_wlock)
-		return false;
-
-	if (blk_rq_is_passthrough(rq))
-		return false;
-
-	switch (req_op(rq)) {
-	case REQ_OP_WRITE_ZEROES:
-	case REQ_OP_WRITE:
-		return blk_rq_zone_is_seq(rq);
-	default:
-		return false;
-	}
+	return rq->q->seq_zones_wlock && blk_rq_is_zoned_seq_write(rq);
 }
 EXPORT_SYMBOL_GPL(blk_req_needs_zone_write_lock);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 909d47e34b7c..d5930797b84d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1136,6 +1136,24 @@ static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
 	return blk_queue_zone_is_seq(rq->q, blk_rq_pos(rq));
 }
 
+/**
+ * blk_rq_is_zoned_seq_write() - Whether @rq is a write request for a sequential zone.
+ * @rq: Request to examine.
+ *
+ * In this context sequential zone means either a sequential write required or
+ * a sequential write preferred zone.
+ */
+static inline bool blk_rq_is_zoned_seq_write(struct request *rq)
+{
+	switch (req_op(rq)) {
+	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
+		return blk_rq_zone_is_seq(rq);
+	default:
+		return false;
+	}
+}
+
 bool blk_req_needs_zone_write_lock(struct request *rq);
 bool blk_req_zone_write_trylock(struct request *rq);
 void __blk_req_zone_write_lock(struct request *rq);
@@ -1166,6 +1184,11 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
 	return !blk_req_zone_is_write_locked(rq);
 }
 #else /* CONFIG_BLK_DEV_ZONED */
+static inline bool blk_rq_is_zoned_seq_write(struct request *rq)
+{
+	return false;
+}
+
 static inline bool blk_req_needs_zone_write_lock(struct request *rq)
 {
 	return false;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/6] block: Introduce a request queue flag for pipelining zoned writes
  2022-06-23 23:25 [PATCH v2 0/6] Improve zoned storage write performance Bart Van Assche
  2022-06-23 23:25 ` [PATCH v2 1/6] block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq() Bart Van Assche
  2022-06-23 23:25 ` [PATCH v2 2/6] block: Introduce the blk_rq_is_zoned_seq_write() function Bart Van Assche
@ 2022-06-23 23:26 ` Bart Van Assche
  2022-06-24  0:19   ` Damien Le Moal
  2022-06-23 23:26 ` [PATCH v2 4/6] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2022-06-23 23:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Damien Le Moal

Writes in sequential write required zones must happen at the write
pointer. Even if the submitter of the write commands (e.g. a filesystem)
submits writes for sequential write required zones in order, the block
layer or the storage controller may reorder these write commands.

The zone locking mechanism in the mq-deadline I/O scheduler serializes
write commands for sequential zones. Some but not all storage controllers
require this serialization. Introduce a new flag such that block drivers
can request pipelining of writes for sequential write required zones.

An example of a storage controller standard that requires write
serialization is AHCI (Advanced Host Controller Interface). Submitting
commands to AHCI controllers happens by writing a bit pattern into a
register. Each set bit corresponds to an active command. This mechanism
does not preserve command ordering information.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 include/linux/blkdev.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2904100d2485..fcaa06b9c65a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -581,6 +581,8 @@ struct request_queue {
 #define QUEUE_FLAG_HCTX_ACTIVE	28	/* at least one blk-mq hctx is active */
 #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
 #define QUEUE_FLAG_SQ_SCHED     30	/* single queue style io dispatch */
+/* Writes for sequential write required zones may be pipelined. */
+#define QUEUE_FLAG_PIPELINE_ZONED_WRITES 31
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP) |		\
@@ -624,6 +626,11 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
 #define blk_queue_sq_sched(q)	test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags)
 
+static inline bool blk_queue_pipeline_zoned_writes(struct request_queue *q)
+{
+	return test_bit(QUEUE_FLAG_PIPELINE_ZONED_WRITES, &(q)->queue_flags);
+}
+
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 4/6] block/mq-deadline: Only use zone locking if necessary
  2022-06-23 23:25 [PATCH v2 0/6] Improve zoned storage write performance Bart Van Assche
                   ` (2 preceding siblings ...)
  2022-06-23 23:26 ` [PATCH v2 3/6] block: Introduce a request queue flag for pipelining zoned writes Bart Van Assche
@ 2022-06-23 23:26 ` Bart Van Assche
  2022-06-24  0:24   ` Damien Le Moal
  2022-06-23 23:26 ` [PATCH v2 5/6] block/null_blk: Refactor null_queue_rq() Bart Van Assche
  2022-06-23 23:26 ` [PATCH v2 6/6] block/null_blk: Add support for pipelining zoned writes Bart Van Assche
  5 siblings, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2022-06-23 23:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Damien Le Moal

Measurements have shown that limiting the queue depth to one for zoned
writes has a significant negative performance impact on zoned UFS devices.
Hence this patch that disables zone locking from the mq-deadline scheduler
for storage controllers that support pipelining zoned writes. This patch is
based on the following assumptions:
- Applications submit write requests to sequential write required zones
  in order.
- The I/O priority of all pipelined write requests is the same per zone.
- If such write requests get reordered by the software or hardware queue
  mechanism, nr_hw_queues * nr_requests - 1 retries are sufficient to
  reorder the write requests.
- It happens infrequently that zoned write requests are reordered by the
  block layer.
- Either no I/O scheduler is used or an I/O scheduler is used that
  submits write requests per zone in LBA order.

See also commit 5700f69178e9 ("mq-deadline: Introduce zone locking
support").

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c   |  3 ++-
 block/mq-deadline.c | 15 +++++++++------
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index cafcbc508dfb..88a0610ba0c3 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -513,7 +513,8 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 		break;
 	case BLK_ZONE_TYPE_SEQWRITE_REQ:
 	case BLK_ZONE_TYPE_SEQWRITE_PREF:
-		if (!args->seq_zones_wlock) {
+		if (!blk_queue_pipeline_zoned_writes(q) &&
+		    !args->seq_zones_wlock) {
 			args->seq_zones_wlock =
 				blk_alloc_zone_bitmap(q->node, args->nr_zones);
 			if (!args->seq_zones_wlock)
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 1a9e835e816c..8ab9694c8f3a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -292,7 +292,7 @@ deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 		return NULL;
 
 	rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
-	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
+	if (data_dir == DD_READ || blk_queue_pipeline_zoned_writes(rq->q))
 		return rq;
 
 	/*
@@ -326,7 +326,7 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 	if (!rq)
 		return NULL;
 
-	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
+	if (data_dir == DD_READ || blk_queue_pipeline_zoned_writes(rq->q))
 		return rq;
 
 	/*
@@ -445,8 +445,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	}
 
 	/*
-	 * For a zoned block device, if we only have writes queued and none of
-	 * them can be dispatched, rq will be NULL.
+	 * For a zoned block device that requires write serialization, if we
+	 * only have writes queued and none of them can be dispatched, rq will
+	 * be NULL.
 	 */
 	if (!rq)
 		return NULL;
@@ -719,6 +720,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
 	struct dd_per_prio *per_prio;
 	enum dd_prio prio;
+	bool pipelined_seq_write = blk_queue_pipeline_zoned_writes(q) &&
+		blk_rq_is_zoned_seq_write(rq);
 	LIST_HEAD(free);
 
 	lockdep_assert_held(&dd->lock);
@@ -743,7 +746,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 
 	trace_block_rq_insert(rq);
 
-	if (at_head) {
+	if (at_head && !pipelined_seq_write) {
 		list_add(&rq->queuelist, &per_prio->dispatch);
 		rq->fifo_time = jiffies;
 	} else {
@@ -823,7 +826,7 @@ static void dd_finish_request(struct request *rq)
 
 	atomic_inc(&per_prio->stats.completed);
 
-	if (blk_queue_is_zoned(q)) {
+	if (!blk_queue_pipeline_zoned_writes(q)) {
 		unsigned long flags;
 
 		spin_lock_irqsave(&dd->zone_lock, flags);

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 5/6] block/null_blk: Refactor null_queue_rq()
  2022-06-23 23:25 [PATCH v2 0/6] Improve zoned storage write performance Bart Van Assche
                   ` (3 preceding siblings ...)
  2022-06-23 23:26 ` [PATCH v2 4/6] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
@ 2022-06-23 23:26 ` Bart Van Assche
  2022-06-24  0:26   ` Damien Le Moal
  2022-06-23 23:26 ` [PATCH v2 6/6] block/null_blk: Add support for pipelining zoned writes Bart Van Assche
  5 siblings, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2022-06-23 23:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Damien Le Moal

Introduce a local variable for the expression bd->rq since that expression
occurs multiple times. This patch does not change any functionality.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/block/null_blk/main.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 6b67088f4ea7..fd68e6f4637f 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -1609,10 +1609,11 @@ static enum blk_eh_timer_return null_timeout_rq(struct request *rq, bool res)
 static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
 			 const struct blk_mq_queue_data *bd)
 {
-	struct nullb_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
+	struct request *rq = bd->rq;
+	struct nullb_cmd *cmd = blk_mq_rq_to_pdu(rq);
 	struct nullb_queue *nq = hctx->driver_data;
-	sector_t nr_sectors = blk_rq_sectors(bd->rq);
-	sector_t sector = blk_rq_pos(bd->rq);
+	sector_t nr_sectors = blk_rq_sectors(rq);
+	sector_t sector = blk_rq_pos(rq);
 	const bool is_poll = hctx->type == HCTX_TYPE_POLL;
 
 	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
@@ -1621,14 +1622,14 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
 		hrtimer_init(&cmd->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 		cmd->timer.function = null_cmd_timer_expired;
 	}
-	cmd->rq = bd->rq;
+	cmd->rq = rq;
 	cmd->error = BLK_STS_OK;
 	cmd->nq = nq;
-	cmd->fake_timeout = should_timeout_request(bd->rq);
+	cmd->fake_timeout = should_timeout_request(rq);
 
-	blk_mq_start_request(bd->rq);
+	blk_mq_start_request(rq);
 
-	if (should_requeue_request(bd->rq)) {
+	if (should_requeue_request(rq)) {
 		/*
 		 * Alternate between hitting the core BUSY path, and the
 		 * driver driven requeue path
@@ -1637,21 +1638,21 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
 		if (nq->requeue_selection & 1)
 			return BLK_STS_RESOURCE;
 		else {
-			blk_mq_requeue_request(bd->rq, true);
+			blk_mq_requeue_request(rq, true);
 			return BLK_STS_OK;
 		}
 	}
 
 	if (is_poll) {
 		spin_lock(&nq->poll_lock);
-		list_add_tail(&bd->rq->queuelist, &nq->poll_list);
+		list_add_tail(&rq->queuelist, &nq->poll_list);
 		spin_unlock(&nq->poll_lock);
 		return BLK_STS_OK;
 	}
 	if (cmd->fake_timeout)
 		return BLK_STS_OK;
 
-	return null_handle_cmd(cmd, sector, nr_sectors, req_op(bd->rq));
+	return null_handle_cmd(cmd, sector, nr_sectors, req_op(rq));
 }
 
 static void cleanup_queue(struct nullb_queue *nq)

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 6/6] block/null_blk: Add support for pipelining zoned writes
  2022-06-23 23:25 [PATCH v2 0/6] Improve zoned storage write performance Bart Van Assche
                   ` (4 preceding siblings ...)
  2022-06-23 23:26 ` [PATCH v2 5/6] block/null_blk: Refactor null_queue_rq() Bart Van Assche
@ 2022-06-23 23:26 ` Bart Van Assche
  2022-06-24  0:29   ` Damien Le Moal
  2022-06-24  6:06   ` Christoph Hellwig
  5 siblings, 2 replies; 20+ messages in thread
From: Bart Van Assche @ 2022-06-23 23:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Damien Le Moal

Add a new configfs attribute for enabling pipelining of zoned writes. If
that attribute has been set, retry zoned writes that are not aligned with
the write pointer. The test script below reports 236 K IOPS with no I/O
scheduler, 81 K IOPS with mq-deadline and pipelining disabled and 121 K
IOPS with mq-deadline and pipelining enabled (+49%).

    #!/bin/bash

    for d in /sys/kernel/config/nullb/*; do
        [ -d "$d" ] && rmdir "$d"
    done
    modprobe -r null_blk
    set -e
    modprobe null_blk nr_devices=0
    udevadm settle
    (
        cd /sys/kernel/config/nullb
        mkdir nullb0
        cd nullb0
        params=(
	    completion_nsec=100000
	    hw_queue_depth=64
	    irqmode=2
	    memory_backed=1
	    pipeline_zoned_writes=1
	    size=1
	    submit_queues=1
	    zone_size=1
	    zoned=1
	    power=1
        )
        for p in "${params[@]}"; do
	    echo "${p//*=}" > "${p//=*}"
        done
    )
    params=(
        --direct=1
        --filename=/dev/nullb0
        --iodepth=64
        --iodepth_batch=16
        --ioengine=io_uring
        --ioscheduler=mq-deadline
	--hipri=0
        --name=nullb0
        --runtime=30
        --rw=write
        --time_based=1
        --zonemode=zbd
    )
    fio "${params[@]}"

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/block/null_blk/main.c     | 9 +++++++++
 drivers/block/null_blk/null_blk.h | 3 +++
 drivers/block/null_blk/zoned.c    | 4 +++-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index fd68e6f4637f..d5fc651ffc3d 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -408,6 +408,7 @@ NULLB_DEVICE_ATTR(zone_capacity, ulong, NULL);
 NULLB_DEVICE_ATTR(zone_nr_conv, uint, NULL);
 NULLB_DEVICE_ATTR(zone_max_open, uint, NULL);
 NULLB_DEVICE_ATTR(zone_max_active, uint, NULL);
+NULLB_DEVICE_ATTR(pipeline_zoned_writes, bool, NULL);
 NULLB_DEVICE_ATTR(virt_boundary, bool, NULL);
 
 static ssize_t nullb_device_power_show(struct config_item *item, char *page)
@@ -531,6 +532,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
 	&nullb_device_attr_zone_nr_conv,
 	&nullb_device_attr_zone_max_open,
 	&nullb_device_attr_zone_max_active,
+	&nullb_device_attr_pipeline_zoned_writes,
 	&nullb_device_attr_virt_boundary,
 	NULL,
 };
@@ -1626,6 +1628,11 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
 	cmd->error = BLK_STS_OK;
 	cmd->nq = nq;
 	cmd->fake_timeout = should_timeout_request(rq);
+	if (!(rq->rq_flags & RQF_DONTPREP)) {
+		rq->rq_flags |= RQF_DONTPREP;
+		cmd->retries = 0;
+		cmd->max_attempts = rq->q->nr_hw_queues * rq->q->nr_requests;
+	}
 
 	blk_mq_start_request(rq);
 
@@ -2042,6 +2049,8 @@ static int null_add_dev(struct nullb_device *dev)
 	nullb->q->queuedata = nullb;
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, nullb->q);
 	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, nullb->q);
+	if (dev->pipeline_zoned_writes)
+		blk_queue_flag_set(QUEUE_FLAG_PIPELINE_ZONED_WRITES, nullb->q);
 
 	mutex_lock(&lock);
 	nullb->index = ida_simple_get(&nullb_indexes, 0, 0, GFP_KERNEL);
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 8359b43842f2..bbe2cb17bdbd 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -23,6 +23,8 @@ struct nullb_cmd {
 	unsigned int tag;
 	blk_status_t error;
 	bool fake_timeout;
+	u16 retries;
+	u16 max_attempts;
 	struct nullb_queue *nq;
 	struct hrtimer timer;
 };
@@ -112,6 +114,7 @@ struct nullb_device {
 	bool memory_backed; /* if data is stored in memory */
 	bool discard; /* if support discard */
 	bool zoned; /* if device is zoned */
+	bool pipeline_zoned_writes;
 	bool virt_boundary; /* virtual boundary on/off for the device */
 };
 
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 2fdd7b20c224..8d0a5e16f4b1 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -403,7 +403,9 @@ static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
 		else
 			cmd->bio->bi_iter.bi_sector = sector;
 	} else if (sector != zone->wp) {
-		ret = BLK_STS_IOERR;
+		ret = dev->pipeline_zoned_writes &&
+			++cmd->retries < cmd->max_attempts ?
+			BLK_STS_DEV_RESOURCE : BLK_STS_IOERR;
 		goto unlock;
 	}
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/6] block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq()
  2022-06-23 23:25 ` [PATCH v2 1/6] block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq() Bart Van Assche
@ 2022-06-24  0:10   ` Damien Le Moal
  0 siblings, 0 replies; 20+ messages in thread
From: Damien Le Moal @ 2022-06-24  0:10 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, Christoph Hellwig

On 6/24/22 08:25, Bart Van Assche wrote:
> Since it is nontrivial to figure out how blk_queue_zone_is_seq() and
> blk_rq_zone_is_seq() handle sequential write preferred zones, document
> this.
> 
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Looks good.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

> ---
>  include/linux/blk-mq.h | 7 +++++++
>  include/linux/blkdev.h | 9 +++++++++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index e2d9daf7e8dd..909d47e34b7c 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -1124,6 +1124,13 @@ static inline unsigned int blk_rq_zone_no(struct request *rq)
>  	return blk_queue_zone_no(rq->q, blk_rq_pos(rq));
>  }
>  
> +/**
> + * blk_rq_zone_is_seq() - Whether a request is for a sequential zone.
> + * @rq: Request pointer.
> + *
> + * Return: true if and only if blk_rq_pos(@rq) refers either to a sequential
> + * write required or a sequential write preferred zone.
> + */
>  static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
>  {
>  	return blk_queue_zone_is_seq(rq->q, blk_rq_pos(rq));
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 92b3bffad328..2904100d2485 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -688,6 +688,15 @@ static inline unsigned int blk_queue_zone_no(struct request_queue *q,
>  	return sector >> ilog2(q->limits.chunk_sectors);
>  }
>  
> +/**
> + * blk_queue_zone_is_seq() - Whether a logical block is in a sequential zone.
> + * @q: Request queue pointer.
> + * @sector: Offset from start of block device in 512 byte units.
> + *
> + * Return: true if and only if @q is associated with a zoned block device and
> + * @sector refers either to a sequential write required or a sequential write
> + * preferred zone.
> + */
>  static inline bool blk_queue_zone_is_seq(struct request_queue *q,
>  					 sector_t sector)
>  {


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/6] block: Introduce the blk_rq_is_zoned_seq_write() function
  2022-06-23 23:25 ` [PATCH v2 2/6] block: Introduce the blk_rq_is_zoned_seq_write() function Bart Van Assche
@ 2022-06-24  0:15   ` Damien Le Moal
  2022-06-24  6:07   ` Christoph Hellwig
  1 sibling, 0 replies; 20+ messages in thread
From: Damien Le Moal @ 2022-06-24  0:15 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, Christoph Hellwig

On 6/24/22 08:25, Bart Van Assche wrote:
> Introduce a function that makes it easy to verify whether a write
> request is for a sequential write required or sequential write preferred
> zone. Simplify blk_req_needs_zone_write_lock() by using the new function.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Looks good.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

> ---
>  block/blk-zoned.c      | 14 +-------------
>  include/linux/blk-mq.h | 23 +++++++++++++++++++++++
>  2 files changed, 24 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 38cd840d8838..cafcbc508dfb 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -57,19 +57,7 @@ EXPORT_SYMBOL_GPL(blk_zone_cond_str);
>   */
>  bool blk_req_needs_zone_write_lock(struct request *rq)
>  {
> -	if (!rq->q->seq_zones_wlock)
> -		return false;
> -
> -	if (blk_rq_is_passthrough(rq))
> -		return false;
> -
> -	switch (req_op(rq)) {
> -	case REQ_OP_WRITE_ZEROES:
> -	case REQ_OP_WRITE:
> -		return blk_rq_zone_is_seq(rq);
> -	default:
> -		return false;
> -	}
> +	return rq->q->seq_zones_wlock && blk_rq_is_zoned_seq_write(rq);
>  }
>  EXPORT_SYMBOL_GPL(blk_req_needs_zone_write_lock);
>  
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 909d47e34b7c..d5930797b84d 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -1136,6 +1136,24 @@ static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
>  	return blk_queue_zone_is_seq(rq->q, blk_rq_pos(rq));
>  }
>  
> +/**
> + * blk_rq_is_zoned_seq_write() - Whether @rq is a write request for a sequential zone.
> + * @rq: Request to examine.
> + *
> + * In this context sequential zone means either a sequential write required or
> + * a sequential write preferred zone.
> + */
> +static inline bool blk_rq_is_zoned_seq_write(struct request *rq)
> +{
> +	switch (req_op(rq)) {
> +	case REQ_OP_WRITE:
> +	case REQ_OP_WRITE_ZEROES:
> +		return blk_rq_zone_is_seq(rq);
> +	default:
> +		return false;
> +	}
> +}
> +
>  bool blk_req_needs_zone_write_lock(struct request *rq);
>  bool blk_req_zone_write_trylock(struct request *rq);
>  void __blk_req_zone_write_lock(struct request *rq);
> @@ -1166,6 +1184,11 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
>  	return !blk_req_zone_is_write_locked(rq);
>  }
>  #else /* CONFIG_BLK_DEV_ZONED */
> +static inline bool blk_rq_is_zoned_seq_write(struct request *rq)
> +{
> +	return false;
> +}
> +
>  static inline bool blk_req_needs_zone_write_lock(struct request *rq)
>  {
>  	return false;


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/6] block: Introduce a request queue flag for pipelining zoned writes
  2022-06-23 23:26 ` [PATCH v2 3/6] block: Introduce a request queue flag for pipelining zoned writes Bart Van Assche
@ 2022-06-24  0:19   ` Damien Le Moal
  2022-06-24 16:29     ` Bart Van Assche
  0 siblings, 1 reply; 20+ messages in thread
From: Damien Le Moal @ 2022-06-24  0:19 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, Christoph Hellwig

On 6/24/22 08:26, Bart Van Assche wrote:
> Writes in sequential write required zones must happen at the write
> pointer. Even if the submitter of the write commands (e.g. a filesystem)
> submits writes for sequential write required zones in order, the block
> layer or the storage controller may reorder these write commands.
> 
> The zone locking mechanism in the mq-deadline I/O scheduler serializes
> write commands for sequential zones. Some but not all storage controllers
> require this serialization. Introduce a new flag such that block drivers
> can request pipelining of writes for sequential write required zones.
> 
> An example of a storage controller standard that requires write
> serialization is AHCI (Advanced Host Controller Interface). Submitting
> commands to AHCI controllers happens by writing a bit pattern into a
> register. Each set bit corresponds to an active command. This mechanism
> does not preserve command ordering information.
> 
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  include/linux/blkdev.h | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 2904100d2485..fcaa06b9c65a 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -581,6 +581,8 @@ struct request_queue {
>  #define QUEUE_FLAG_HCTX_ACTIVE	28	/* at least one blk-mq hctx is active */
>  #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
>  #define QUEUE_FLAG_SQ_SCHED     30	/* single queue style io dispatch */
> +/* Writes for sequential write required zones may be pipelined. */
> +#define QUEUE_FLAG_PIPELINE_ZONED_WRITES 31
>  
>  #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
>  				 (1 << QUEUE_FLAG_SAME_COMP) |		\
> @@ -624,6 +626,11 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
>  #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
>  #define blk_queue_sq_sched(q)	test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags)
>  
> +static inline bool blk_queue_pipeline_zoned_writes(struct request_queue *q)
> +{
> +	return test_bit(QUEUE_FLAG_PIPELINE_ZONED_WRITES, &(q)->queue_flags);
> +}
> +

Since this flag will be set by an LLD to indicate that the LLD can handle
zoned write commands in order, I would suggest a different name. Something
like: QUEUE_FLAG_ORDERED_ZONED_WRITES ? And well, if the LLD says it can
do that for zoned writes, it likely means that it would be the same for
any command, so the flag could be generalized and named
QUEUE_FLAG_ORDERED_CMD or something like that.

>  extern void blk_set_pm_only(struct request_queue *q);
>  extern void blk_clear_pm_only(struct request_queue *q);
>  


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 4/6] block/mq-deadline: Only use zone locking if necessary
  2022-06-23 23:26 ` [PATCH v2 4/6] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
@ 2022-06-24  0:24   ` Damien Le Moal
  0 siblings, 0 replies; 20+ messages in thread
From: Damien Le Moal @ 2022-06-24  0:24 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, Christoph Hellwig

On 6/24/22 08:26, Bart Van Assche wrote:
> Measurements have shown that limiting the queue depth to one for zoned
> writes has a significant negative performance impact on zoned UFS devices.
> Hence this patch that disables zone locking from the mq-deadline scheduler
> for storage controllers that support pipelining zoned writes. This patch is
> based on the following assumptions:
> - Applications submit write requests to sequential write required zones
>   in order.
> - The I/O priority of all pipelined write requests is the same per zone.
> - If such write requests get reordered by the software or hardware queue
>   mechanism, nr_hw_queues * nr_requests - 1 retries are sufficient to
>   reorder the write requests.
> - It happens infrequently that zoned write requests are reordered by the
>   block layer.
> - Either no I/O scheduler is used or an I/O scheduler is used that
>   submits write requests per zone in LBA order.
> 
> See also commit 5700f69178e9 ("mq-deadline: Introduce zone locking
> support").

I think this patch should be squashed together with the previous patch. It
would then be easier to see what effect the pipeline queue flag has.

> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-zoned.c   |  3 ++-
>  block/mq-deadline.c | 15 +++++++++------
>  2 files changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index cafcbc508dfb..88a0610ba0c3 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -513,7 +513,8 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>  		break;
>  	case BLK_ZONE_TYPE_SEQWRITE_REQ:
>  	case BLK_ZONE_TYPE_SEQWRITE_PREF:
> -		if (!args->seq_zones_wlock) {
> +		if (!blk_queue_pipeline_zoned_writes(q) &&
> +		    !args->seq_zones_wlock) {
>  			args->seq_zones_wlock =
>  				blk_alloc_zone_bitmap(q->node, args->nr_zones);
>  			if (!args->seq_zones_wlock)
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 1a9e835e816c..8ab9694c8f3a 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -292,7 +292,7 @@ deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  		return NULL;
>  
>  	rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
> -	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
> +	if (data_dir == DD_READ || blk_queue_pipeline_zoned_writes(rq->q))

This change seems wrong. Before: both read and writes can proceed for
regular disks. After, only read can proceed, assuming that the regular
device does not have pipeline zoned writes enabled.

>  		return rq;
>  
>  	/*
> @@ -326,7 +326,7 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  	if (!rq)
>  		return NULL;
>  
> -	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
> +	if (data_dir == DD_READ || blk_queue_pipeline_zoned_writes(rq->q))

same here.

>  		return rq;
>  
>  	/*
> @@ -445,8 +445,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	}
>  
>  	/*
> -	 * For a zoned block device, if we only have writes queued and none of
> -	 * them can be dispatched, rq will be NULL.
> +	 * For a zoned block device that requires write serialization, if we
> +	 * only have writes queued and none of them can be dispatched, rq will
> +	 * be NULL.
>  	 */
>  	if (!rq)
>  		return NULL;
> @@ -719,6 +720,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
>  	struct dd_per_prio *per_prio;
>  	enum dd_prio prio;
> +	bool pipelined_seq_write = blk_queue_pipeline_zoned_writes(q) &&
> +		blk_rq_is_zoned_seq_write(rq);
>  	LIST_HEAD(free);
>  
>  	lockdep_assert_held(&dd->lock);
> @@ -743,7 +746,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  
>  	trace_block_rq_insert(rq);
>  
> -	if (at_head) {
> +	if (at_head && !pipelined_seq_write) {
>  		list_add(&rq->queuelist, &per_prio->dispatch);
>  		rq->fifo_time = jiffies;
>  	} else {
> @@ -823,7 +826,7 @@ static void dd_finish_request(struct request *rq)
>  
>  	atomic_inc(&per_prio->stats.completed);
>  
> -	if (blk_queue_is_zoned(q)) {
> +	if (!blk_queue_pipeline_zoned_writes(q)) {
>  		unsigned long flags;
>  
>  		spin_lock_irqsave(&dd->zone_lock, flags);


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/6] block/null_blk: Refactor null_queue_rq()
  2022-06-23 23:26 ` [PATCH v2 5/6] block/null_blk: Refactor null_queue_rq() Bart Van Assche
@ 2022-06-24  0:26   ` Damien Le Moal
  0 siblings, 0 replies; 20+ messages in thread
From: Damien Le Moal @ 2022-06-24  0:26 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, Christoph Hellwig

On 6/24/22 08:26, Bart Van Assche wrote:
> Introduce a local variable for the expression bd->rq since that expression
> occurs multiple times. This patch does not change any functionality.
> 
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Looks good. Completely independent from the series goal though, so this
can be applied independently I think.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

> ---
>  drivers/block/null_blk/main.c | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
> index 6b67088f4ea7..fd68e6f4637f 100644
> --- a/drivers/block/null_blk/main.c
> +++ b/drivers/block/null_blk/main.c
> @@ -1609,10 +1609,11 @@ static enum blk_eh_timer_return null_timeout_rq(struct request *rq, bool res)
>  static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
>  			 const struct blk_mq_queue_data *bd)
>  {
> -	struct nullb_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
> +	struct request *rq = bd->rq;
> +	struct nullb_cmd *cmd = blk_mq_rq_to_pdu(rq);
>  	struct nullb_queue *nq = hctx->driver_data;
> -	sector_t nr_sectors = blk_rq_sectors(bd->rq);
> -	sector_t sector = blk_rq_pos(bd->rq);
> +	sector_t nr_sectors = blk_rq_sectors(rq);
> +	sector_t sector = blk_rq_pos(rq);
>  	const bool is_poll = hctx->type == HCTX_TYPE_POLL;
>  
>  	might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
> @@ -1621,14 +1622,14 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		hrtimer_init(&cmd->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>  		cmd->timer.function = null_cmd_timer_expired;
>  	}
> -	cmd->rq = bd->rq;
> +	cmd->rq = rq;
>  	cmd->error = BLK_STS_OK;
>  	cmd->nq = nq;
> -	cmd->fake_timeout = should_timeout_request(bd->rq);
> +	cmd->fake_timeout = should_timeout_request(rq);
>  
> -	blk_mq_start_request(bd->rq);
> +	blk_mq_start_request(rq);
>  
> -	if (should_requeue_request(bd->rq)) {
> +	if (should_requeue_request(rq)) {
>  		/*
>  		 * Alternate between hitting the core BUSY path, and the
>  		 * driver driven requeue path
> @@ -1637,21 +1638,21 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		if (nq->requeue_selection & 1)
>  			return BLK_STS_RESOURCE;
>  		else {
> -			blk_mq_requeue_request(bd->rq, true);
> +			blk_mq_requeue_request(rq, true);
>  			return BLK_STS_OK;
>  		}
>  	}
>  
>  	if (is_poll) {
>  		spin_lock(&nq->poll_lock);
> -		list_add_tail(&bd->rq->queuelist, &nq->poll_list);
> +		list_add_tail(&rq->queuelist, &nq->poll_list);
>  		spin_unlock(&nq->poll_lock);
>  		return BLK_STS_OK;
>  	}
>  	if (cmd->fake_timeout)
>  		return BLK_STS_OK;
>  
> -	return null_handle_cmd(cmd, sector, nr_sectors, req_op(bd->rq));
> +	return null_handle_cmd(cmd, sector, nr_sectors, req_op(rq));
>  }
>  
>  static void cleanup_queue(struct nullb_queue *nq)


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/6] block/null_blk: Add support for pipelining zoned writes
  2022-06-23 23:26 ` [PATCH v2 6/6] block/null_blk: Add support for pipelining zoned writes Bart Van Assche
@ 2022-06-24  0:29   ` Damien Le Moal
  2022-06-24  6:06   ` Christoph Hellwig
  1 sibling, 0 replies; 20+ messages in thread
From: Damien Le Moal @ 2022-06-24  0:29 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, Christoph Hellwig

On 6/24/22 08:26, Bart Van Assche wrote:
> Add a new configfs attribute for enabling pipelining of zoned writes. If
> that attribute has been set, retry zoned writes that are not aligned with
> the write pointer. The test script below reports 236 K IOPS with no I/O
> scheduler, 81 K IOPS with mq-deadline and pipelining disabled and 121 K
> IOPS with mq-deadline and pipelining enabled (+49%).
> 
>     #!/bin/bash
> 
>     for d in /sys/kernel/config/nullb/*; do
>         [ -d "$d" ] && rmdir "$d"
>     done
>     modprobe -r null_blk
>     set -e
>     modprobe null_blk nr_devices=0
>     udevadm settle
>     (
>         cd /sys/kernel/config/nullb
>         mkdir nullb0
>         cd nullb0
>         params=(
> 	    completion_nsec=100000
> 	    hw_queue_depth=64
> 	    irqmode=2
> 	    memory_backed=1
> 	    pipeline_zoned_writes=1
> 	    size=1
> 	    submit_queues=1
> 	    zone_size=1
> 	    zoned=1
> 	    power=1
>         )
>         for p in "${params[@]}"; do
> 	    echo "${p//*=}" > "${p//=*}"
>         done
>     )
>     params=(
>         --direct=1
>         --filename=/dev/nullb0
>         --iodepth=64
>         --iodepth_batch=16
>         --ioengine=io_uring
>         --ioscheduler=mq-deadline
> 	--hipri=0
>         --name=nullb0
>         --runtime=30
>         --rw=write
>         --time_based=1
>         --zonemode=zbd
>     )
>     fio "${params[@]}"
> 
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  drivers/block/null_blk/main.c     | 9 +++++++++
>  drivers/block/null_blk/null_blk.h | 3 +++
>  drivers/block/null_blk/zoned.c    | 4 +++-
>  3 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
> index fd68e6f4637f..d5fc651ffc3d 100644
> --- a/drivers/block/null_blk/main.c
> +++ b/drivers/block/null_blk/main.c
> @@ -408,6 +408,7 @@ NULLB_DEVICE_ATTR(zone_capacity, ulong, NULL);
>  NULLB_DEVICE_ATTR(zone_nr_conv, uint, NULL);
>  NULLB_DEVICE_ATTR(zone_max_open, uint, NULL);
>  NULLB_DEVICE_ATTR(zone_max_active, uint, NULL);
> +NULLB_DEVICE_ATTR(pipeline_zoned_writes, bool, NULL);
>  NULLB_DEVICE_ATTR(virt_boundary, bool, NULL);
>  
>  static ssize_t nullb_device_power_show(struct config_item *item, char *page)
> @@ -531,6 +532,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
>  	&nullb_device_attr_zone_nr_conv,
>  	&nullb_device_attr_zone_max_open,
>  	&nullb_device_attr_zone_max_active,
> +	&nullb_device_attr_pipeline_zoned_writes,
>  	&nullb_device_attr_virt_boundary,
>  	NULL,
>  };
> @@ -1626,6 +1628,11 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	cmd->error = BLK_STS_OK;
>  	cmd->nq = nq;
>  	cmd->fake_timeout = should_timeout_request(rq);
> +	if (!(rq->rq_flags & RQF_DONTPREP)) {
> +		rq->rq_flags |= RQF_DONTPREP;
> +		cmd->retries = 0;
> +		cmd->max_attempts = rq->q->nr_hw_queues * rq->q->nr_requests;
> +	}
>  
>  	blk_mq_start_request(rq);
>  
> @@ -2042,6 +2049,8 @@ static int null_add_dev(struct nullb_device *dev)
>  	nullb->q->queuedata = nullb;
>  	blk_queue_flag_set(QUEUE_FLAG_NONROT, nullb->q);
>  	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, nullb->q);
> +	if (dev->pipeline_zoned_writes)
> +		blk_queue_flag_set(QUEUE_FLAG_PIPELINE_ZONED_WRITES, nullb->q);
>  
>  	mutex_lock(&lock);
>  	nullb->index = ida_simple_get(&nullb_indexes, 0, 0, GFP_KERNEL);
> diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
> index 8359b43842f2..bbe2cb17bdbd 100644
> --- a/drivers/block/null_blk/null_blk.h
> +++ b/drivers/block/null_blk/null_blk.h
> @@ -23,6 +23,8 @@ struct nullb_cmd {
>  	unsigned int tag;
>  	blk_status_t error;
>  	bool fake_timeout;
> +	u16 retries;
> +	u16 max_attempts;
>  	struct nullb_queue *nq;
>  	struct hrtimer timer;
>  };
> @@ -112,6 +114,7 @@ struct nullb_device {
>  	bool memory_backed; /* if data is stored in memory */
>  	bool discard; /* if support discard */
>  	bool zoned; /* if device is zoned */
> +	bool pipeline_zoned_writes;
>  	bool virt_boundary; /* virtual boundary on/off for the device */
>  };
>  
> diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
> index 2fdd7b20c224..8d0a5e16f4b1 100644
> --- a/drivers/block/null_blk/zoned.c
> +++ b/drivers/block/null_blk/zoned.c
> @@ -403,7 +403,9 @@ static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
>  		else
>  			cmd->bio->bi_iter.bi_sector = sector;
>  	} else if (sector != zone->wp) {
> -		ret = BLK_STS_IOERR;
> +		ret = dev->pipeline_zoned_writes &&
> +			++cmd->retries < cmd->max_attempts ?
> +			BLK_STS_DEV_RESOURCE : BLK_STS_IOERR;

Hard to read. Could you change this to a plain "if()" ?


		if (dev->pipeline_zoned_writes &&
		    ++cmd->retries < cmd->max_attempts)
			ret = BLK_STS_DEV_RESOURCE;
		else
			ret = BLK_STS_IOERR;

Adding a comment on top of this hunk explaining the difference between the
2 cases would be nice too.

>  		goto unlock;
>  	}
>  


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/6] block/null_blk: Add support for pipelining zoned writes
  2022-06-23 23:26 ` [PATCH v2 6/6] block/null_blk: Add support for pipelining zoned writes Bart Van Assche
  2022-06-24  0:29   ` Damien Le Moal
@ 2022-06-24  6:06   ` Christoph Hellwig
  2022-06-24 16:45     ` Bart Van Assche
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2022-06-24  6:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal

I don't think adding all the previous code just for null_blk is a good
idea.  We need a real user of it first.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/6] block: Introduce the blk_rq_is_zoned_seq_write() function
  2022-06-23 23:25 ` [PATCH v2 2/6] block: Introduce the blk_rq_is_zoned_seq_write() function Bart Van Assche
  2022-06-24  0:15   ` Damien Le Moal
@ 2022-06-24  6:07   ` Christoph Hellwig
  2022-06-24 16:35     ` Bart Van Assche
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2022-06-24  6:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal

On Thu, Jun 23, 2022 at 04:25:59PM -0700, Bart Van Assche wrote:
> Introduce a function that makes it easy to verify whether a write
> request is for a sequential write required or sequential write preferred
> zone. Simplify blk_req_needs_zone_write_lock() by using the new function.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-zoned.c      | 14 +-------------
>  include/linux/blk-mq.h | 23 +++++++++++++++++++++++
>  2 files changed, 24 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 38cd840d8838..cafcbc508dfb 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -57,19 +57,7 @@ EXPORT_SYMBOL_GPL(blk_zone_cond_str);
>   */
>  bool blk_req_needs_zone_write_lock(struct request *rq)
>  {
> -	if (!rq->q->seq_zones_wlock)
> -		return false;
> -
> -	if (blk_rq_is_passthrough(rq))
> -		return false;
> -
> -	switch (req_op(rq)) {
> -	case REQ_OP_WRITE_ZEROES:
> -	case REQ_OP_WRITE:
> -		return blk_rq_zone_is_seq(rq);
> -	default:
> -		return false;
> -	}
> +	return rq->q->seq_zones_wlock && blk_rq_is_zoned_seq_write(rq);
>  }
>  EXPORT_SYMBOL_GPL(blk_req_needs_zone_write_lock);
>  
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 909d47e34b7c..d5930797b84d 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -1136,6 +1136,24 @@ static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
>  	return blk_queue_zone_is_seq(rq->q, blk_rq_pos(rq));
>  }
>  
> +/**
> + * blk_rq_is_zoned_seq_write() - Whether @rq is a write request for a sequential zone.

This doesn't parse and is overly long at the same time.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/6] block: Introduce a request queue flag for pipelining zoned writes
  2022-06-24  0:19   ` Damien Le Moal
@ 2022-06-24 16:29     ` Bart Van Assche
  0 siblings, 0 replies; 20+ messages in thread
From: Bart Van Assche @ 2022-06-24 16:29 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, Christoph Hellwig

On 6/23/22 17:19, Damien Le Moal wrote:
> On 6/24/22 08:26, Bart Van Assche wrote:
>> +static inline bool blk_queue_pipeline_zoned_writes(struct request_queue *q)
>> +{
>> +	return test_bit(QUEUE_FLAG_PIPELINE_ZONED_WRITES, &(q)->queue_flags);
>> +}
>> +
> 
> Since this flag will be set by an LLD to indicate that the LLD can handle
> zoned write commands in order, I would suggest a different name. Something
> like: QUEUE_FLAG_ORDERED_ZONED_WRITES ? And well, if the LLD says it can
> do that for zoned writes, it likely means that it would be the same for
> any command, so the flag could be generalized and named
> QUEUE_FLAG_ORDERED_CMD or something like that.

Zoned writes should always be submitted in order by the software that 
generates the zoned writes so I think the name 
QUEUE_FLAG_ORDERED_ZONED_WRITES would cause confusion. I'm concerned it 
would make other kernel developers wonder whether it is OK for e.g. 
filesystems to submit zoned writes out of order, which is not the case. 
I think the word "pipelining" has a well-established meaning in computer 
science and also that in this context it reflects the intent correctly. 
See also https://en.wikipedia.org/wiki/Pipeline_(computing).

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/6] block: Introduce the blk_rq_is_zoned_seq_write() function
  2022-06-24  6:07   ` Christoph Hellwig
@ 2022-06-24 16:35     ` Bart Van Assche
  0 siblings, 0 replies; 20+ messages in thread
From: Bart Van Assche @ 2022-06-24 16:35 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, Damien Le Moal

On 6/23/22 23:07, Christoph Hellwig wrote:
> On Thu, Jun 23, 2022 at 04:25:59PM -0700, Bart Van Assche wrote:
>> +/**
>> + * blk_rq_is_zoned_seq_write() - Whether @rq is a write request for a sequential zone.
> 
> This doesn't parse and is overly long at the same time.

Hi Christoph,

It is not clear to me why you wrote that "this doesn't parse"? Are you 
referring to the function name or to the function description? I think 
the text at the right side of the hyphen is grammatically correct.

Although I'm in favor of respecting the 80 column limit, I haven't found 
any explanation in Documentation/doc-guide/kernel-doc.rst about whether 
or not splitting the brief description across multiple lines is supported.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/6] block/null_blk: Add support for pipelining zoned writes
  2022-06-24  6:06   ` Christoph Hellwig
@ 2022-06-24 16:45     ` Bart Van Assche
  2022-06-25  9:25       ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Bart Van Assche @ 2022-06-24 16:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, Damien Le Moal

On 6/23/22 23:06, Christoph Hellwig wrote:
> I don't think adding all the previous code just for null_blk is a good
> idea.  We need a real user of it first.

Hi Christoph,

Although these patches work fine in our tests with zoned UFS prototypes, 
these patches have been tested with a UFSHCI 3.0 controller. While the 
UFSHCI 4.0 standard supports multiple queues and while UFSHCI 4.0 
controllers process each queue in order, with UFSHCI 3.0 controllers 
submitting requests happens by writing into a 32-bit doorbell register, 
just like for AHCI controllers. So although our UFS tests pass it cannot 
be guaranteed for UFSHCI 3.0 controllers that requests are propagated to 
the UFS device in the same order these have been submitted to the 
controller.

I have not yet seen any UFSHCI 4.0 implementation. The upstream UFSHCI 
driver does not yet support it either.

Hence the choice to add support to the null_blk driver for setting the 
QUEUE_FLAG_PIPELINE_ZONED_WRITES flag.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/6] block/null_blk: Add support for pipelining zoned writes
  2022-06-24 16:45     ` Bart Van Assche
@ 2022-06-25  9:25       ` Christoph Hellwig
  2022-06-26  0:23         ` Bart Van Assche
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2022-06-25  9:25 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Damien Le Moal

On Fri, Jun 24, 2022 at 09:45:54AM -0700, Bart Van Assche wrote:
> I have not yet seen any UFSHCI 4.0 implementation. The upstream UFSHCI 
> driver does not yet support it either.
>
> Hence the choice to add support to the null_blk driver for setting the 
> QUEUE_FLAG_PIPELINE_ZONED_WRITES flag.

Also I don't think UFS even supports zoned devices as is?  Either way
without an in-tree driver for actually useful and existing hardware
we should not add extra code to the block layer.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/6] block/null_blk: Add support for pipelining zoned writes
  2022-06-25  9:25       ` Christoph Hellwig
@ 2022-06-26  0:23         ` Bart Van Assche
  0 siblings, 0 replies; 20+ messages in thread
From: Bart Van Assche @ 2022-06-26  0:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, Damien Le Moal

On 6/25/22 02:25, Christoph Hellwig wrote:
> On Fri, Jun 24, 2022 at 09:45:54AM -0700, Bart Van Assche wrote:
>> I have not yet seen any UFSHCI 4.0 implementation. The upstream UFSHCI
>> driver does not yet support it either.
>>
>> Hence the choice to add support to the null_blk driver for setting the
>> QUEUE_FLAG_PIPELINE_ZONED_WRITES flag.
> 
> Also I don't think UFS even supports zoned devices as is?

The following ballot has been approved about two weeks ago and is on its 
way to become a standard: 
https://www.jedec.org/members/documents/84467/download (file name: 
Zoned_Storage_for_UFS_Ballot_v3; only JEDEC members have permission to 
download this document).

 > Either way> without an in-tree driver for actually useful and 
existing hardware
> we should not add extra code to the block layer.

There is a zoned NVMe SSD in my workstation. How about adding support 
for zoned NVMe SSDs in this patch series?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-06-26  0:23 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-23 23:25 [PATCH v2 0/6] Improve zoned storage write performance Bart Van Assche
2022-06-23 23:25 ` [PATCH v2 1/6] block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq() Bart Van Assche
2022-06-24  0:10   ` Damien Le Moal
2022-06-23 23:25 ` [PATCH v2 2/6] block: Introduce the blk_rq_is_zoned_seq_write() function Bart Van Assche
2022-06-24  0:15   ` Damien Le Moal
2022-06-24  6:07   ` Christoph Hellwig
2022-06-24 16:35     ` Bart Van Assche
2022-06-23 23:26 ` [PATCH v2 3/6] block: Introduce a request queue flag for pipelining zoned writes Bart Van Assche
2022-06-24  0:19   ` Damien Le Moal
2022-06-24 16:29     ` Bart Van Assche
2022-06-23 23:26 ` [PATCH v2 4/6] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
2022-06-24  0:24   ` Damien Le Moal
2022-06-23 23:26 ` [PATCH v2 5/6] block/null_blk: Refactor null_queue_rq() Bart Van Assche
2022-06-24  0:26   ` Damien Le Moal
2022-06-23 23:26 ` [PATCH v2 6/6] block/null_blk: Add support for pipelining zoned writes Bart Van Assche
2022-06-24  0:29   ` Damien Le Moal
2022-06-24  6:06   ` Christoph Hellwig
2022-06-24 16:45     ` Bart Van Assche
2022-06-25  9:25       ` Christoph Hellwig
2022-06-26  0:23         ` Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.