All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/11] mq-deadline: Improve support for zoned block devices
@ 2023-04-18 22:39 Bart Van Assche
  2023-04-18 22:39 ` [PATCH v2 01/11] block: Simplify blk_req_needs_zone_write_lock() Bart Van Assche
                   ` (10 more replies)
  0 siblings, 11 replies; 48+ messages in thread
From: Bart Van Assche @ 2023-04-18 22:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jaegeuk Kim, Christoph Hellwig, Bart Van Assche

Hi Jens,

This patch series improves support for zoned block devices in the mq-deadline
scheduler as follows:
* The order of requeued writes (REQ_OP_WRITE*) is preserved.
* The active zone limit is enforced.

Please consider this patch series for the next merge window.

Thanks,

Bart.

Changes compared to v1:
- Left out the patches related to request insertion and requeuing since
  Christoph is busy with reworking these patches.
- Added a patch for enforcing the active zone limit.

Bart Van Assche (11):
  block: Simplify blk_req_needs_zone_write_lock()
  block: Micro-optimize blk_req_needs_zone_write_lock()
  block: Introduce blk_rq_is_seq_zoned_write()
  block: mq-deadline: Simplify deadline_skip_seq_writes()
  block: mq-deadline: Improve deadline_skip_seq_writes()
  block: mq-deadline: Disable head insertion for zoned writes
  block: mq-deadline: Preserve write streams for all device types
  block: mq-deadline: Fix a race condition related to zoned writes
  block: mq-deadline: Handle requeued requests correctly
  block: Add support for the zone capacity concept
  block: mq-deadline: Respect the active zone limit

 Documentation/ABI/stable/sysfs-block |   8 ++
 block/Kconfig.iosched                |   4 +
 block/Makefile                       |   1 +
 block/blk-mq.h                       |   2 +-
 block/blk-settings.c                 |   1 +
 block/blk-sysfs.c                    |   7 ++
 block/blk-zoned.c                    |  43 +++++--
 block/mq-deadline-zoned.c            | 141 +++++++++++++++++++++
 block/mq-deadline-zoned.h            |  31 +++++
 block/mq-deadline.c                  | 177 ++++++++++++++-------------
 block/mq-deadline.h                  |  79 ++++++++++++
 include/linux/blk-mq.h               |   6 +
 include/linux/blkdev.h               |  18 ++-
 13 files changed, 423 insertions(+), 95 deletions(-)
 create mode 100644 block/mq-deadline-zoned.c
 create mode 100644 block/mq-deadline-zoned.h
 create mode 100644 block/mq-deadline.h

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 01/11] block: Simplify blk_req_needs_zone_write_lock()
  2023-04-18 22:39 [PATCH v2 00/11] mq-deadline: Improve support for zoned block devices Bart Van Assche
@ 2023-04-18 22:39 ` Bart Van Assche
  2023-04-19  4:09   ` Christoph Hellwig
  2023-04-18 22:39 ` [PATCH v2 02/11] block: Micro-optimize blk_req_needs_zone_write_lock() Bart Van Assche
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-18 22:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Christoph Hellwig, Bart Van Assche,
	Damien Le Moal, Ming Lei

Remove the blk_rq_is_passthrough() check because it is redundant:
bdev_op_is_zoned_write() evaluates to false for pass-through requests.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index fce9082384d6..835d9e937d4d 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -57,9 +57,6 @@ EXPORT_SYMBOL_GPL(blk_zone_cond_str);
  */
 bool blk_req_needs_zone_write_lock(struct request *rq)
 {
-	if (blk_rq_is_passthrough(rq))
-		return false;
-
 	if (!rq->q->disk->seq_zones_wlock)
 		return false;
 

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 02/11] block: Micro-optimize blk_req_needs_zone_write_lock()
  2023-04-18 22:39 [PATCH v2 00/11] mq-deadline: Improve support for zoned block devices Bart Van Assche
  2023-04-18 22:39 ` [PATCH v2 01/11] block: Simplify blk_req_needs_zone_write_lock() Bart Van Assche
@ 2023-04-18 22:39 ` Bart Van Assche
  2023-04-19  4:11   ` Christoph Hellwig
  2023-04-18 22:39 ` [PATCH v2 03/11] block: Introduce blk_rq_is_seq_zoned_write() Bart Van Assche
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-18 22:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Christoph Hellwig, Bart Van Assche,
	Damien Le Moal, Ming Lei

Instead of using the following expression to translate a request pointer
into a request queue pointer: rq->q->disk->part0->bd_queue, use the
following expression: rq->q.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.h         | 2 +-
 block/blk-zoned.c      | 2 +-
 include/linux/blkdev.h | 6 +++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.h b/block/blk-mq.h
index e876584d3516..6b5bc0b8d7b8 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -368,7 +368,7 @@ static inline struct blk_plug *blk_mq_plug( struct bio *bio)
 {
 	/* Zoned block device write operation case: do not plug the BIO */
 	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
-	    bdev_op_is_zoned_write(bio->bi_bdev, bio_op(bio)))
+	    queue_op_is_zoned_write(bdev_get_queue(bio->bi_bdev), bio_op(bio)))
 		return NULL;
 
 	/*
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 835d9e937d4d..c93a26ce4670 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -60,7 +60,7 @@ bool blk_req_needs_zone_write_lock(struct request *rq)
 	if (!rq->q->disk->seq_zones_wlock)
 		return false;
 
-	if (bdev_op_is_zoned_write(rq->q->disk->part0, req_op(rq)))
+	if (queue_op_is_zoned_write(rq->q, req_op(rq)))
 		return blk_rq_zone_is_seq(rq);
 
 	return false;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e3242e67a8e3..261538319bbf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1284,10 +1284,10 @@ static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
 	return disk_zone_no(bdev->bd_disk, sec);
 }
 
-static inline bool bdev_op_is_zoned_write(struct block_device *bdev,
-					  blk_opf_t op)
+static inline bool queue_op_is_zoned_write(struct request_queue *q,
+					   enum req_op op)
 {
-	if (!bdev_is_zoned(bdev))
+	if (!blk_queue_is_zoned(q))
 		return false;
 
 	return op == REQ_OP_WRITE || op == REQ_OP_WRITE_ZEROES;

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 03/11] block: Introduce blk_rq_is_seq_zoned_write()
  2023-04-18 22:39 [PATCH v2 00/11] mq-deadline: Improve support for zoned block devices Bart Van Assche
  2023-04-18 22:39 ` [PATCH v2 01/11] block: Simplify blk_req_needs_zone_write_lock() Bart Van Assche
  2023-04-18 22:39 ` [PATCH v2 02/11] block: Micro-optimize blk_req_needs_zone_write_lock() Bart Van Assche
@ 2023-04-18 22:39 ` Bart Van Assche
  2023-04-19  4:50   ` Christoph Hellwig
  2023-04-18 22:39 ` [PATCH v2 04/11] block: mq-deadline: Simplify deadline_skip_seq_writes() Bart Van Assche
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-18 22:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Christoph Hellwig, Bart Van Assche,
	Damien Le Moal, Ming Lei

Introduce the function blk_rq_is_seq_zoned_write(). This function will
be used in later patches to preserve the order of zoned writes for which
the order needs to be preserved.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c      | 25 +++++++++++++++++++++----
 include/linux/blk-mq.h |  6 ++++++
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index c93a26ce4670..9b9cd6adfd1b 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -52,6 +52,26 @@ const char *blk_zone_cond_str(enum blk_zone_cond zone_cond)
 }
 EXPORT_SYMBOL_GPL(blk_zone_cond_str);
 
+/**
+ * blk_rq_is_seq_zoned_write() - Whether @rq is a zoned write for which the order matters.
+ * @rq: Request to examine.
+ *
+ * In this context sequential zone means either a sequential write required or
+ * to a sequential write preferred zone.
+ */
+bool blk_rq_is_seq_zoned_write(struct request *rq)
+{
+	switch (req_op(rq)) {
+	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
+		return blk_rq_zone_is_seq(rq);
+	case REQ_OP_ZONE_APPEND:
+	default:
+		return false;
+	}
+}
+EXPORT_SYMBOL_GPL(blk_rq_is_seq_zoned_write);
+
 /*
  * Return true if a request is a write requests that needs zone write locking.
  */
@@ -60,10 +80,7 @@ bool blk_req_needs_zone_write_lock(struct request *rq)
 	if (!rq->q->disk->seq_zones_wlock)
 		return false;
 
-	if (queue_op_is_zoned_write(rq->q, req_op(rq)))
-		return blk_rq_zone_is_seq(rq);
-
-	return false;
+	return blk_rq_is_seq_zoned_write(rq);
 }
 EXPORT_SYMBOL_GPL(blk_req_needs_zone_write_lock);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 06caacd77ed6..e498b85bc470 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1164,6 +1164,7 @@ static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
 	return disk_zone_is_seq(rq->q->disk, blk_rq_pos(rq));
 }
 
+bool blk_rq_is_seq_zoned_write(struct request *rq);
 bool blk_req_needs_zone_write_lock(struct request *rq);
 bool blk_req_zone_write_trylock(struct request *rq);
 void __blk_req_zone_write_lock(struct request *rq);
@@ -1194,6 +1195,11 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
 	return !blk_req_zone_is_write_locked(rq);
 }
 #else /* CONFIG_BLK_DEV_ZONED */
+static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
+{
+	return false;
+}
+
 static inline bool blk_req_needs_zone_write_lock(struct request *rq)
 {
 	return false;

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 04/11] block: mq-deadline: Simplify deadline_skip_seq_writes()
  2023-04-18 22:39 [PATCH v2 00/11] mq-deadline: Improve support for zoned block devices Bart Van Assche
                   ` (2 preceding siblings ...)
  2023-04-18 22:39 ` [PATCH v2 03/11] block: Introduce blk_rq_is_seq_zoned_write() Bart Van Assche
@ 2023-04-18 22:39 ` Bart Van Assche
  2023-04-19  4:52   ` Christoph Hellwig
  2023-04-18 22:39 ` [PATCH v2 05/11] block: mq-deadline: Improve deadline_skip_seq_writes() Bart Van Assche
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-18 22:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Christoph Hellwig, Bart Van Assche,
	Damien Le Moal, Ming Lei

Make the deadline_skip_seq_writes() code shorter without changing its
functionality.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 5839a027e0f0..3dde720e2f59 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -310,12 +310,9 @@ static struct request *deadline_skip_seq_writes(struct deadline_data *dd,
 						struct request *rq)
 {
 	sector_t pos = blk_rq_pos(rq);
-	sector_t skipped_sectors = 0;
 
-	while (rq) {
-		if (blk_rq_pos(rq) != pos + skipped_sectors)
-			break;
-		skipped_sectors += blk_rq_sectors(rq);
+	while (rq && blk_rq_pos(rq) == pos) {
+		pos += blk_rq_sectors(rq);
 		rq = deadline_latter_request(rq);
 	}
 

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 05/11] block: mq-deadline: Improve deadline_skip_seq_writes()
  2023-04-18 22:39 [PATCH v2 00/11] mq-deadline: Improve support for zoned block devices Bart Van Assche
                   ` (3 preceding siblings ...)
  2023-04-18 22:39 ` [PATCH v2 04/11] block: mq-deadline: Simplify deadline_skip_seq_writes() Bart Van Assche
@ 2023-04-18 22:39 ` Bart Van Assche
  2023-04-18 22:39 ` [PATCH v2 06/11] block: mq-deadline: Disable head insertion for zoned writes Bart Van Assche
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2023-04-18 22:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Christoph Hellwig, Bart Van Assche,
	Damien Le Moal, Ming Lei

Make deadline_skip_seq_writes() do what its name suggests, namely to
skip sequential writes.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 3dde720e2f59..4a0a269db382 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -311,7 +311,7 @@ static struct request *deadline_skip_seq_writes(struct deadline_data *dd,
 {
 	sector_t pos = blk_rq_pos(rq);
 
-	while (rq && blk_rq_pos(rq) == pos) {
+	while (rq && blk_rq_pos(rq) == pos && blk_rq_is_seq_zoned_write(rq)) {
 		pos += blk_rq_sectors(rq);
 		rq = deadline_latter_request(rq);
 	}

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 06/11] block: mq-deadline: Disable head insertion for zoned writes
  2023-04-18 22:39 [PATCH v2 00/11] mq-deadline: Improve support for zoned block devices Bart Van Assche
                   ` (4 preceding siblings ...)
  2023-04-18 22:39 ` [PATCH v2 05/11] block: mq-deadline: Improve deadline_skip_seq_writes() Bart Van Assche
@ 2023-04-18 22:39 ` Bart Van Assche
  2023-04-19  4:30   ` Christoph Hellwig
  2023-04-18 22:39 ` [PATCH v2 07/11] block: mq-deadline: Preserve write streams for all device types Bart Van Assche
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-18 22:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Christoph Hellwig, Bart Van Assche,
	Damien Le Moal, Ming Lei

Make sure that zoned writes (REQ_OP_WRITE and REQ_OP_WRITE_ZEROES) are
submitted in LBA order. This patch does not affect REQ_OP_WRITE_APPEND
requests.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 4a0a269db382..a73e16d3f8ac 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -796,7 +796,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 
 	trace_block_rq_insert(rq);
 
-	if (flags & BLK_MQ_INSERT_AT_HEAD) {
+	if ((flags & BLK_MQ_INSERT_AT_HEAD) && !blk_rq_is_seq_zoned_write(rq)) {
 		list_add(&rq->queuelist, &per_prio->dispatch);
 		rq->fifo_time = jiffies;
 	} else {

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 07/11] block: mq-deadline: Preserve write streams for all device types
  2023-04-18 22:39 [PATCH v2 00/11] mq-deadline: Improve support for zoned block devices Bart Van Assche
                   ` (5 preceding siblings ...)
  2023-04-18 22:39 ` [PATCH v2 06/11] block: mq-deadline: Disable head insertion for zoned writes Bart Van Assche
@ 2023-04-18 22:39 ` Bart Van Assche
  2023-04-18 22:39 ` [PATCH v2 08/11] block: mq-deadline: Fix a race condition related to zoned writes Bart Van Assche
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2023-04-18 22:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Christoph Hellwig, Bart Van Assche,
	Damien Le Moal, Ming Lei

For SSDs, preserving a write stream reduces the number of active zones.
This is important for devices that restrict the number of active zones.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index a73e16d3f8ac..3122c471f473 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -381,16 +381,14 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 	 * an unlocked target zone. For some HDDs, breaking a sequential
 	 * write stream can lead to lower throughput, so make sure to preserve
 	 * sequential write streams, even if that stream crosses into the next
-	 * zones and these zones are unlocked.
+	 * zones and these zones are unlocked. For SSDs, do not break write
+	 * streams to minimize the number of active zones.
 	 */
 	spin_lock_irqsave(&dd->zone_lock, flags);
 	while (rq) {
 		if (blk_req_can_dispatch_to_zone(rq))
 			break;
-		if (blk_queue_nonrot(rq->q))
-			rq = deadline_latter_request(rq);
-		else
-			rq = deadline_skip_seq_writes(dd, rq);
+		rq = deadline_skip_seq_writes(dd, rq);
 	}
 	spin_unlock_irqrestore(&dd->zone_lock, flags);
 

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 08/11] block: mq-deadline: Fix a race condition related to zoned writes
  2023-04-18 22:39 [PATCH v2 00/11] mq-deadline: Improve support for zoned block devices Bart Van Assche
                   ` (6 preceding siblings ...)
  2023-04-18 22:39 ` [PATCH v2 07/11] block: mq-deadline: Preserve write streams for all device types Bart Van Assche
@ 2023-04-18 22:39 ` Bart Van Assche
  2023-04-19  5:07   ` Christoph Hellwig
  2023-04-18 22:40 ` [PATCH v2 09/11] block: mq-deadline: Handle requeued requests correctly Bart Van Assche
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-18 22:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Christoph Hellwig, Bart Van Assche,
	Damien Le Moal, Ming Lei

Let deadline_next_request() only consider the first zoned write per
zone. This patch fixes a race condition between deadline_next_request()
and completion of zoned writes.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 3122c471f473..32a2cc013ed3 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -302,6 +302,7 @@ static bool deadline_is_seq_write(struct deadline_data *dd, struct request *rq)
 	return blk_rq_pos(prev) + blk_rq_sectors(prev) == blk_rq_pos(rq);
 }
 
+#ifdef CONFIG_BLK_DEV_ZONED
 /*
  * Skip all write requests that are sequential from @rq, even if we cross
  * a zone boundary.
@@ -318,6 +319,7 @@ static struct request *deadline_skip_seq_writes(struct deadline_data *dd,
 
 	return rq;
 }
+#endif
 
 /*
  * For the specified data direction, return the next request to
@@ -386,9 +388,25 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 	 */
 	spin_lock_irqsave(&dd->zone_lock, flags);
 	while (rq) {
+		unsigned int zno __maybe_unused;
+
 		if (blk_req_can_dispatch_to_zone(rq))
 			break;
+
+#ifdef CONFIG_BLK_DEV_ZONED
+		zno = blk_rq_zone_no(rq);
+
 		rq = deadline_skip_seq_writes(dd, rq);
+
+		/*
+		 * Skip all other write requests for the zone with zone number
+		 * 'zno'. This prevents that this function selects a zoned write
+		 * that is not the first write for a given zone.
+		 */
+		while (rq && blk_rq_zone_no(rq) == zno &&
+		       blk_rq_is_seq_zoned_write(rq))
+			rq = deadline_latter_request(rq);
+#endif
 	}
 	spin_unlock_irqrestore(&dd->zone_lock, flags);
 

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 09/11] block: mq-deadline: Handle requeued requests correctly
  2023-04-18 22:39 [PATCH v2 00/11] mq-deadline: Improve support for zoned block devices Bart Van Assche
                   ` (7 preceding siblings ...)
  2023-04-18 22:39 ` [PATCH v2 08/11] block: mq-deadline: Fix a race condition related to zoned writes Bart Van Assche
@ 2023-04-18 22:40 ` Bart Van Assche
  2023-04-19  5:07   ` Christoph Hellwig
  2023-04-18 22:40 ` [PATCH v2 10/11] block: Add support for the zone capacity concept Bart Van Assche
  2023-04-18 22:40 ` [PATCH v2 11/11] block: mq-deadline: Respect the active zone limit Bart Van Assche
  10 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-18 22:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Christoph Hellwig, Bart Van Assche,
	Damien Le Moal, Ming Lei

If a zoned write is requeued with an LBA that is lower than already
inserted zoned writes, make sure that it is submitted first.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 32a2cc013ed3..4d2bfb3898b0 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -160,8 +160,22 @@ static void
 deadline_add_rq_rb(struct dd_per_prio *per_prio, struct request *rq)
 {
 	struct rb_root *root = deadline_rb_root(per_prio, rq);
+	struct request **next_rq __maybe_unused;
 
 	elv_rb_add(root, rq);
+#ifdef CONFIG_BLK_DEV_ZONED
+	next_rq = &per_prio->next_rq[rq_data_dir(rq)];
+	if (*next_rq == NULL || !blk_queue_is_zoned(rq->q))
+		return;
+	/*
+	 * If a request got requeued or requests have been submitted out of
+	 * order, make sure that per zone the request with the lowest LBA is
+	 * submitted first.
+	 */
+	if (blk_rq_pos(rq) < blk_rq_pos(*next_rq) &&
+	    blk_rq_zone_no(rq) == blk_rq_zone_no(*next_rq))
+		*next_rq = rq;
+#endif
 }
 
 static inline void
@@ -816,6 +830,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 		list_add(&rq->queuelist, &per_prio->dispatch);
 		rq->fifo_time = jiffies;
 	} else {
+		struct list_head *insert_before;
+
 		deadline_add_rq_rb(per_prio, rq);
 
 		if (rq_mergeable(rq)) {
@@ -828,7 +844,16 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 		 * set expire time and add to fifo list
 		 */
 		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
-		list_add_tail(&rq->queuelist, &per_prio->fifo_list[data_dir]);
+		insert_before = &per_prio->fifo_list[data_dir];
+#ifdef CONFIG_BLK_DEV_ZONED
+		if (blk_rq_is_seq_zoned_write(rq)) {
+			struct request *rq2 = deadline_latter_request(rq);
+
+			if (rq2 && blk_rq_zone_no(rq2) == blk_rq_zone_no(rq))
+				insert_before = &rq2->queuelist;
+		}
+#endif
+		list_add_tail(&rq->queuelist, insert_before);
 	}
 }
 

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-18 22:39 [PATCH v2 00/11] mq-deadline: Improve support for zoned block devices Bart Van Assche
                   ` (8 preceding siblings ...)
  2023-04-18 22:40 ` [PATCH v2 09/11] block: mq-deadline: Handle requeued requests correctly Bart Van Assche
@ 2023-04-18 22:40 ` Bart Van Assche
  2023-04-20  9:23   ` Niklas Cassel
  2023-04-18 22:40 ` [PATCH v2 11/11] block: mq-deadline: Respect the active zone limit Bart Van Assche
  10 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-18 22:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Christoph Hellwig, Bart Van Assche,
	Damien Le Moal, Ming Lei

Make the zone capacity available in struct queue_limits for those
drivers that need it.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 Documentation/ABI/stable/sysfs-block |  8 ++++++++
 block/blk-settings.c                 |  1 +
 block/blk-sysfs.c                    |  7 +++++++
 block/blk-zoned.c                    | 15 +++++++++++++++
 include/linux/blkdev.h               |  1 +
 5 files changed, 32 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index c57e5b7cb532..4527d0514fdb 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -671,6 +671,14 @@ Description:
 		regular block devices.
 
 
+What:		/sys/block/<disk>/queue/zone_capacity
+Date:		March 2023
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RO] The number of 512-byte sectors in a zone that can be read
+		or written. This number is less than or equal to the zone size.
+
+
 What:		/sys/block/<disk>/queue/zone_write_granularity
 Date:		January 2021
 Contact:	linux-block@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 896b4654ab00..96f5dc63a815 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -685,6 +685,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 						   b->max_secure_erase_sectors);
 	t->zone_write_granularity = max(t->zone_write_granularity,
 					b->zone_write_granularity);
+	t->zone_capacity = max(t->zone_capacity, b->zone_capacity);
 	t->zoned = max(t->zoned, b->zoned);
 	return ret;
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index a64208583853..0443b8f536f4 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -337,6 +337,11 @@ static ssize_t queue_nr_zones_show(struct request_queue *q, char *page)
 	return queue_var_show(disk_nr_zones(q->disk), page);
 }
 
+static ssize_t queue_zone_capacity_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->limits.zone_capacity, page);
+}
+
 static ssize_t queue_max_open_zones_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(bdev_max_open_zones(q->disk->part0), page);
@@ -587,6 +592,7 @@ QUEUE_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
 
 QUEUE_RO_ENTRY(queue_zoned, "zoned");
 QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
+QUEUE_RO_ENTRY(queue_zone_capacity, "zone_capacity");
 QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
 QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
 
@@ -644,6 +650,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_nonrot_entry.attr,
 	&queue_zoned_entry.attr,
 	&queue_nr_zones_entry.attr,
+	&queue_zone_capacity_entry.attr,
 	&queue_max_open_zones_entry.attr,
 	&queue_max_active_zones_entry.attr,
 	&queue_nomerges_entry.attr,
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 9b9cd6adfd1b..a319acbe377b 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -463,6 +463,7 @@ struct blk_revalidate_zone_args {
 	unsigned long	*seq_zones_wlock;
 	unsigned int	nr_zones;
 	sector_t	zone_sectors;
+	sector_t	zone_capacity;
 	sector_t	sector;
 };
 
@@ -489,6 +490,7 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 		}
 
 		args->zone_sectors = zone->len;
+		args->zone_capacity = zone->capacity;
 		args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
 	} else if (zone->start + args->zone_sectors < capacity) {
 		if (zone->len != args->zone_sectors) {
@@ -496,12 +498,23 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 				disk->disk_name);
 			return -ENODEV;
 		}
+		if (zone->capacity != args->zone_capacity) {
+			pr_warn("%s: Invalid zoned device with non constant zone capacity\n",
+				disk->disk_name);
+			return -ENODEV;
+		}
 	} else {
 		if (zone->len > args->zone_sectors) {
 			pr_warn("%s: Invalid zoned device with larger last zone size\n",
 				disk->disk_name);
 			return -ENODEV;
 		}
+		if (zone->capacity >
+		    min(args->zone_sectors, args->zone_capacity)) {
+			pr_warn("%s: Invalid zoned device with too large last zone capacity\n",
+				disk->disk_name);
+			return -ENODEV;
+		}
 	}
 
 	/* Check for holes in the zone report */
@@ -604,6 +617,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 	blk_mq_freeze_queue(q);
 	if (ret > 0) {
 		blk_queue_chunk_sectors(q, args.zone_sectors);
+		q->limits.zone_capacity = args.zone_capacity;
 		disk->nr_zones = args.nr_zones;
 		swap(disk->seq_zones_wlock, args.seq_zones_wlock);
 		swap(disk->conv_zones_bitmap, args.conv_zones_bitmap);
@@ -635,6 +649,7 @@ void disk_clear_zone_settings(struct gendisk *disk)
 	disk->max_open_zones = 0;
 	disk->max_active_zones = 0;
 	q->limits.chunk_sectors = 0;
+	q->limits.zone_capacity = 0;
 	q->limits.zone_write_granularity = 0;
 	q->limits.max_zone_append_sectors = 0;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 261538319bbf..4fb0e6d7fdc8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -297,6 +297,7 @@ struct queue_limits {
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 	unsigned int		zone_write_granularity;
+	unsigned int		zone_capacity;
 
 	unsigned short		max_segments;
 	unsigned short		max_integrity_segments;

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 11/11] block: mq-deadline: Respect the active zone limit
  2023-04-18 22:39 [PATCH v2 00/11] mq-deadline: Improve support for zoned block devices Bart Van Assche
                   ` (9 preceding siblings ...)
  2023-04-18 22:40 ` [PATCH v2 10/11] block: Add support for the zone capacity concept Bart Van Assche
@ 2023-04-18 22:40 ` Bart Van Assche
  10 siblings, 0 replies; 48+ messages in thread
From: Bart Van Assche @ 2023-04-18 22:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jaegeuk Kim, Christoph Hellwig, Bart Van Assche,
	Damien Le Moal, Ming Lei

Some zoned block devices restrict the number of active zones. Attempts
to exceed the number of active zones fail with
BLK_STS_ZONE_ACTIVE_RESOURCE. Prevent that data loss happens if a
filesystem is using a zoned block device by restricting the number of
active zones in the mq-deadline I/O scheduler.

This patch prevents that the following pattern triggers write errors:
* Data is written sequentially. The number of data streams is identical to
  the maximum number of active zones.
* Per stream, writes for a new zone are queued before the previous zone
  is full.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/Kconfig.iosched     |   4 ++
 block/Makefile            |   1 +
 block/mq-deadline-zoned.c | 146 ++++++++++++++++++++++++++++++++++++++
 block/mq-deadline-zoned.h |  33 +++++++++
 block/mq-deadline.c       | 117 ++++++++++++------------------
 block/mq-deadline.h       |  81 +++++++++++++++++++++
 include/linux/blkdev.h    |  11 +++
 7 files changed, 320 insertions(+), 73 deletions(-)
 create mode 100644 block/mq-deadline-zoned.c
 create mode 100644 block/mq-deadline-zoned.h
 create mode 100644 block/mq-deadline.h

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 27f11320b8d1..604edd5da8d8 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -4,9 +4,13 @@ menu "IO Schedulers"
 config MQ_IOSCHED_DEADLINE
 	tristate "MQ deadline I/O scheduler"
 	default y
+	select MQ_IOSCHED_DEADLINE_ZONED if BLK_DEV_ZONED
 	help
 	  MQ version of the deadline IO scheduler.
 
+config MQ_IOSCHED_DEADLINE_ZONED
+	bool
+
 config MQ_IOSCHED_KYBER
 	tristate "Kyber I/O scheduler"
 	default y
diff --git a/block/Makefile b/block/Makefile
index b31b05390749..a685035ac325 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -22,6 +22,7 @@ obj-$(CONFIG_BLK_CGROUP_IOPRIO)	+= blk-ioprio.o
 obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
 obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
+obj-$(CONFIG_MQ_IOSCHED_DEADLINE_ZONED)	+= mq-deadline-zoned.o
 obj-$(CONFIG_MQ_IOSCHED_KYBER)	+= kyber-iosched.o
 bfq-y				:= bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
 obj-$(CONFIG_IOSCHED_BFQ)	+= bfq.o
diff --git a/block/mq-deadline-zoned.c b/block/mq-deadline-zoned.c
new file mode 100644
index 000000000000..fe42e27dad41
--- /dev/null
+++ b/block/mq-deadline-zoned.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2023 Google LLC <bvanassche@google.com>
+ */
+
+#include <linux/blk-mq.h>
+#include <linux/hashtable.h>
+#include <linux/lockdep.h>
+#include "mq-deadline.h"
+#include "mq-deadline-zoned.h"
+
+struct dd_zone {
+	struct hlist_node entry; /* Bucket list entry */
+	unsigned int zno; /* Zone number */
+	sector_t start; /* First sector of zone */
+	sector_t end; /* First sector past zone capacity */
+	sector_t wp; /* Write pointer */
+};
+
+static struct dd_zone *dd_find_zone(struct deadline_data *dd,
+				    struct request *rq)
+{
+	unsigned int zno = blk_rq_zone_no(rq);
+	struct dd_zone *zone;
+
+	lockdep_assert_held(&dd->lock);
+
+	hash_for_each_possible(dd->active_zones, zone, entry, zno)
+		if (zone->zno == zno)
+			return zone;
+
+	return NULL;
+}
+
+static struct dd_zone *dd_find_add_zone(struct deadline_data *dd,
+					struct request *rq)
+{
+	const struct queue_limits *lim = &rq->q->limits;
+	struct dd_zone *zone;
+	unsigned int zno;
+
+	lockdep_assert_held(&dd->lock);
+
+	zone = dd_find_zone(dd, rq);
+	if (zone)
+		return zone;
+	zone = kzalloc(sizeof(*zone), GFP_ATOMIC);
+	if (WARN_ON_ONCE(!zone))
+		return zone;
+	zno = blk_rq_zone_no(rq);
+	zone->zno = zno;
+	zone->start = zno * lim->chunk_sectors;
+	zone->end = zone->start + lim->zone_capacity;
+	hash_add(dd->active_zones, &zone->entry, zno);
+	return zone;
+}
+
+unsigned int dd_active_zones(struct deadline_data *dd)
+{
+	struct dd_zone *zone;
+	unsigned int bkt, n = 0;
+
+	lockdep_assert_held(&dd->lock);
+
+	hash_for_each(dd->active_zones, bkt, zone, entry)
+		n++;
+
+	return n;
+}
+
+/*
+ * Returns true if and only if dispatching request @rq won't cause the active
+ * zones limit to be exceeded.
+ */
+bool dd_check_zone(struct deadline_data *dd, struct request *rq)
+{
+	unsigned int zno = blk_rq_zone_no(rq);
+	struct dd_zone *zone;
+
+	lockdep_assert_held(&dd->lock);
+
+	if (rq->q->disk->max_active_zones == 0)
+		return true;
+
+	switch (req_op(rq)) {
+	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ZONE_APPEND:
+		break;
+	default:
+		return true;
+	}
+
+	hash_for_each_possible(dd->active_zones, zone, entry, zno)
+		if (zone->zno == zno)
+			return true; /* zone already open */
+
+	/* zone is empty or closed */
+	return dd_active_zones(dd) < rq->q->disk->max_active_zones;
+}
+
+void dd_update_wp(struct deadline_data *dd, struct request *rq,
+		  sector_t completed)
+{
+	struct dd_zone *zone;
+
+	lockdep_assert_held(&dd->lock);
+
+	if (WARN_ON_ONCE(rq->q->disk->max_active_zones == 0))
+		return;
+
+	switch (req_op(rq)) {
+	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ZONE_APPEND:
+		zone = dd_find_add_zone(dd, rq);
+		if (!zone)
+			return;
+		zone->wp = blk_rq_pos(rq) + completed;
+		if (zone->wp < zone->end)
+			return;
+		hash_del(&zone->entry);
+		kfree(zone);
+		break;
+	case REQ_OP_ZONE_FINISH:
+	case REQ_OP_ZONE_RESET:
+		zone = dd_find_zone(dd, rq);
+		if (!zone)
+			return;
+		hash_del(&zone->entry);
+		kfree(zone);
+		break;
+	case REQ_OP_ZONE_RESET_ALL: {
+		struct hlist_node *tmp;
+		unsigned int bkt;
+
+		hash_for_each_safe(dd->active_zones, bkt, tmp, zone, entry) {
+			hash_del(&zone->entry);
+			kfree(zone);
+		}
+		break;
+	}
+	default:
+		break;
+	}
+}
diff --git a/block/mq-deadline-zoned.h b/block/mq-deadline-zoned.h
new file mode 100644
index 000000000000..614c83e3f95a
--- /dev/null
+++ b/block/mq-deadline-zoned.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef _MQ_DEADLINE_ZONED_H_
+#define _MQ_DEADLINE_ZONED_H_
+
+#include <linux/types.h>
+
+struct deadline_data;
+struct request;
+
+#ifdef CONFIG_BLK_DEV_ZONED
+bool dd_check_zone(struct deadline_data *dd, struct request *rq);
+void dd_update_wp(struct deadline_data *dd, struct request *rq,
+		  sector_t completed);
+unsigned int dd_active_zones(struct deadline_data *dd);
+#else
+static inline bool dd_check_zone(struct deadline_data *dd, struct request *rq)
+{
+	return true;
+}
+
+static inline void dd_update_wp(struct deadline_data *dd, struct request *rq,
+				sector_t completed)
+{
+}
+
+static inline unsigned int dd_active_zones(struct deadline_data *dd)
+{
+	return 0;
+}
+#endif
+
+#endif /* _MQ_DEADLINE_ZONED_H_ */
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 4d2bfb3898b0..e8609e1a58dd 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -9,6 +9,7 @@
 #include <linux/fs.h>
 #include <linux/blkdev.h>
 #include <linux/bio.h>
+#include <linux/hashtable.h>
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/init.h>
@@ -23,6 +24,8 @@
 #include "blk-mq.h"
 #include "blk-mq-debugfs.h"
 #include "blk-mq-sched.h"
+#include "mq-deadline.h"
+#include "mq-deadline-zoned.h"
 
 /*
  * See Documentation/block/deadline-iosched.rst
@@ -38,73 +41,6 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
-enum dd_data_dir {
-	DD_READ		= READ,
-	DD_WRITE	= WRITE,
-};
-
-enum { DD_DIR_COUNT = 2 };
-
-enum dd_prio {
-	DD_RT_PRIO	= 0,
-	DD_BE_PRIO	= 1,
-	DD_IDLE_PRIO	= 2,
-	DD_PRIO_MAX	= 2,
-};
-
-enum { DD_PRIO_COUNT = 3 };
-
-/*
- * I/O statistics per I/O priority. It is fine if these counters overflow.
- * What matters is that these counters are at least as wide as
- * log2(max_outstanding_requests).
- */
-struct io_stats_per_prio {
-	uint32_t inserted;
-	uint32_t merged;
-	uint32_t dispatched;
-	atomic_t completed;
-};
-
-/*
- * Deadline scheduler data per I/O priority (enum dd_prio). Requests are
- * present on both sort_list[] and fifo_list[].
- */
-struct dd_per_prio {
-	struct list_head dispatch;
-	struct rb_root sort_list[DD_DIR_COUNT];
-	struct list_head fifo_list[DD_DIR_COUNT];
-	/* Next request in FIFO order. Read, write or both are NULL. */
-	struct request *next_rq[DD_DIR_COUNT];
-	struct io_stats_per_prio stats;
-};
-
-struct deadline_data {
-	/*
-	 * run time data
-	 */
-
-	struct dd_per_prio per_prio[DD_PRIO_COUNT];
-
-	/* Data direction of latest dispatched request. */
-	enum dd_data_dir last_dir;
-	unsigned int batching;		/* number of sequential requests made */
-	unsigned int starved;		/* times reads have starved writes */
-
-	/*
-	 * settings that change how the i/o scheduler behaves
-	 */
-	int fifo_expire[DD_DIR_COUNT];
-	int fifo_batch;
-	int writes_starved;
-	int front_merges;
-	u32 async_depth;
-	int prio_aging_expire;
-
-	spinlock_t lock;
-	spinlock_t zone_lock;
-};
-
 /* Maps an I/O priority class to a deadline scheduler priority. */
 static const enum dd_prio ioprio_class_to_prio[] = {
 	[IOPRIO_CLASS_NONE]	= DD_BE_PRIO,
@@ -459,7 +395,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	if (!list_empty(&per_prio->dispatch)) {
 		rq = list_first_entry(&per_prio->dispatch, struct request,
 				      queuelist);
-		if (started_after(dd, rq, latest_start))
+		if (started_after(dd, rq, latest_start) ||
+		    !dd_check_zone(dd, rq))
 			return NULL;
 		list_del_init(&rq->queuelist);
 		goto done;
@@ -469,6 +406,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	 * batches are currently reads XOR writes
 	 */
 	rq = deadline_next_request(dd, per_prio, dd->last_dir);
+	if (rq && !dd_check_zone(dd, rq))
+		return NULL;
 	if (rq && dd->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
 		goto dispatch_request;
@@ -512,13 +451,16 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	 * we are not running a batch, find best request for selected data_dir
 	 */
 	next_rq = deadline_next_request(dd, per_prio, data_dir);
-	if (deadline_check_fifo(per_prio, data_dir) || !next_rq) {
+	if (deadline_check_fifo(per_prio, data_dir) || !next_rq ||
+	    !dd_check_zone(dd, next_rq)) {
 		/*
 		 * A deadline has expired, the last request was in the other
 		 * direction, or we have run out of higher-sectored requests.
 		 * Start again from the request with the earliest expiry time.
 		 */
 		rq = deadline_fifo_request(dd, per_prio, data_dir);
+		if (rq && !dd_check_zone(dd, rq))
+			rq = NULL;
 	} else {
 		/*
 		 * The last req was the same dir and we have a next request in
@@ -538,6 +480,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	dd->batching = 0;
 
 dispatch_request:
+	WARN_ON_ONCE(!dd_check_zone(dd, rq));
 	if (started_after(dd, rq, latest_start))
 		return NULL;
 
@@ -604,7 +547,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	spin_lock(&dd->lock);
 	rq = dd_dispatch_prio_aged_requests(dd, now);
 	if (rq)
-		goto unlock;
+		goto update_wp;
 
 	/*
 	 * Next, dispatch requests in priority order. Ignore lower priority
@@ -616,7 +559,10 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 			break;
 	}
 
-unlock:
+update_wp:
+	if (rq && disk_max_active_zones(rq->q->disk))
+		dd_update_wp(dd, rq, 0);
+
 	spin_unlock(&dd->lock);
 
 	return rq;
@@ -722,6 +668,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->last_dir = DD_WRITE;
 	dd->fifo_batch = fifo_batch;
 	dd->prio_aging_expire = prio_aging_expire;
+	hash_init(dd->active_zones);
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
 
@@ -919,6 +866,7 @@ static void dd_finish_request(struct request *rq)
 	const u8 ioprio_class = dd_rq_ioclass(rq);
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
 	struct dd_per_prio *per_prio = &dd->per_prio[prio];
+	unsigned long flags;
 
 	/*
 	 * The block layer core may call dd_finish_request() without having
@@ -930,9 +878,16 @@ static void dd_finish_request(struct request *rq)
 
 	atomic_inc(&per_prio->stats.completed);
 
-	if (blk_queue_is_zoned(q)) {
-		unsigned long flags;
+	if (disk_max_active_zones(rq->q->disk)) {
+		spin_lock_irqsave(&dd->lock, flags);
+		dd_update_wp(dd, rq, blk_rq_sectors(rq));
+		spin_unlock_irqrestore(&dd->lock, flags);
 
+		if (dd_has_write_work(rq->mq_hctx))
+			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
+	}
+
+	if (blk_queue_is_zoned(q)) {
 		spin_lock_irqsave(&dd->zone_lock, flags);
 		blk_req_zone_write_unlock(rq);
 		spin_unlock_irqrestore(&dd->zone_lock, flags);
@@ -1132,6 +1087,21 @@ static int dd_queued_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+static int dd_active_zones_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+	struct deadline_data *dd = q->elevator->elevator_data;
+	unsigned int n;
+
+	spin_lock(&dd->lock);
+	n = dd_active_zones(dd);
+	spin_unlock(&dd->lock);
+
+	seq_printf(m, "%u\n", n);
+
+	return 0;
+}
+
 /* Number of requests owned by the block driver for a given priority. */
 static u32 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
 {
@@ -1230,6 +1200,7 @@ static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
 	{"dispatch2", 0400, .seq_ops = &deadline_dispatch2_seq_ops},
 	{"owned_by_driver", 0400, dd_owned_by_driver_show},
 	{"queued", 0400, dd_queued_show},
+	{"active_zones", 0400, dd_active_zones_show},
 	{},
 };
 #undef DEADLINE_QUEUE_DDIR_ATTRS
diff --git a/block/mq-deadline.h b/block/mq-deadline.h
new file mode 100644
index 000000000000..98a3e9c43931
--- /dev/null
+++ b/block/mq-deadline.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef _MQ_DEADLINE_H_
+#define _MQ_DEADLINE_H_
+
+#include <linux/hashtable.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock_types.h>
+
+enum dd_data_dir {
+	DD_READ		= READ,
+	DD_WRITE	= WRITE,
+};
+
+enum { DD_DIR_COUNT = 2 };
+
+enum dd_prio {
+	DD_RT_PRIO	= 0,
+	DD_BE_PRIO	= 1,
+	DD_IDLE_PRIO	= 2,
+	DD_PRIO_MAX	= 2,
+};
+
+enum { DD_PRIO_COUNT = 3 };
+
+/*
+ * I/O statistics per I/O priority. It is fine if these counters overflow.
+ * What matters is that these counters are at least as wide as
+ * log2(max_outstanding_requests).
+ */
+struct io_stats_per_prio {
+	uint32_t inserted;
+	uint32_t merged;
+	uint32_t dispatched;
+	atomic_t completed;
+};
+
+/*
+ * Deadline scheduler data per I/O priority (enum dd_prio). Requests are
+ * present on both sort_list[] and fifo_list[].
+ */
+struct dd_per_prio {
+	struct list_head dispatch;
+	struct rb_root sort_list[DD_DIR_COUNT];
+	struct list_head fifo_list[DD_DIR_COUNT];
+	/* Next request in FIFO order. Read, write or both are NULL. */
+	struct request *next_rq[DD_DIR_COUNT];
+	struct io_stats_per_prio stats;
+};
+
+struct deadline_data {
+	/*
+	 * run time data
+	 */
+
+	struct dd_per_prio per_prio[DD_PRIO_COUNT];
+
+	/* Data direction of latest dispatched request. */
+	enum dd_data_dir last_dir;
+	unsigned int batching;		/* number of sequential requests made */
+	unsigned int starved;		/* times reads have starved writes */
+
+	/*
+	 * settings that change how the i/o scheduler behaves
+	 */
+	int fifo_expire[DD_DIR_COUNT];
+	int fifo_batch;
+	int writes_starved;
+	int front_merges;
+	u32 async_depth;
+	int prio_aging_expire;
+
+	DECLARE_HASHTABLE(active_zones, 8);
+
+	spinlock_t lock;
+	spinlock_t zone_lock;
+};
+
+#endif /* _MQ_DEADLINE_H_ */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4fb0e6d7fdc8..6d7ec5d9841c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -686,6 +686,11 @@ static inline void disk_set_max_active_zones(struct gendisk *disk,
 	disk->max_active_zones = max_active_zones;
 }
 
+static inline unsigned int disk_max_active_zones(const struct gendisk *disk)
+{
+	return disk->max_active_zones;
+}
+
 static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
 {
 	return bdev->bd_disk->max_open_zones;
@@ -709,6 +714,12 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
 {
 	return 0;
 }
+
+static inline unsigned int disk_max_active_zones(const struct gendisk *disk)
+{
+	return 0;
+}
+
 static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
 {
 	return 0;

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 01/11] block: Simplify blk_req_needs_zone_write_lock()
  2023-04-18 22:39 ` [PATCH v2 01/11] block: Simplify blk_req_needs_zone_write_lock() Bart Van Assche
@ 2023-04-19  4:09   ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-04-19  4:09 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Damien Le Moal, Ming Lei

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 02/11] block: Micro-optimize blk_req_needs_zone_write_lock()
  2023-04-18 22:39 ` [PATCH v2 02/11] block: Micro-optimize blk_req_needs_zone_write_lock() Bart Van Assche
@ 2023-04-19  4:11   ` Christoph Hellwig
  2023-04-19 18:30     ` Bart Van Assche
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-04-19  4:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Damien Le Moal, Ming Lei

On Tue, Apr 18, 2023 at 03:39:53PM -0700, Bart Van Assche wrote:
> Instead of using the following expression to translate a request pointer
> into a request queue pointer: rq->q->disk->part0->bd_queue, use the
> following expression: rq->q.

This looks correct, but I also kinda hate adding queue back in more
places.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 06/11] block: mq-deadline: Disable head insertion for zoned writes
  2023-04-18 22:39 ` [PATCH v2 06/11] block: mq-deadline: Disable head insertion for zoned writes Bart Van Assche
@ 2023-04-19  4:30   ` Christoph Hellwig
  2023-04-19 22:43     ` Bart Van Assche
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-04-19  4:30 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Damien Le Moal, Ming Lei

On Tue, Apr 18, 2023 at 03:39:57PM -0700, Bart Van Assche wrote:
> Make sure that zoned writes (REQ_OP_WRITE and REQ_OP_WRITE_ZEROES) are
> submitted in LBA order. This patch does not affect REQ_OP_WRITE_APPEND
> requests.

As said before this is not correct.  What we need to instead is to
support proper patch insertation when the at_head flag is set so
that the requests get inserted before the existing requests, but
in ordered they are passed to the I/O scheduler.

This also needs to be done for the other two I/O schedulers.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 03/11] block: Introduce blk_rq_is_seq_zoned_write()
  2023-04-18 22:39 ` [PATCH v2 03/11] block: Introduce blk_rq_is_seq_zoned_write() Bart Van Assche
@ 2023-04-19  4:50   ` Christoph Hellwig
  2023-04-19 21:12     ` Bart Van Assche
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-04-19  4:50 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Damien Le Moal, Ming Lei

On Tue, Apr 18, 2023 at 03:39:54PM -0700, Bart Van Assche wrote:
> +/**
> + * blk_rq_is_seq_zoned_write() - Whether @rq is a zoned write for which the order matters.

Maybe:

 * blk_rq_is_seq_zoned_write() - check if @rq needs zoned write serialization

> +bool blk_rq_is_seq_zoned_write(struct request *rq)
> +{
> +	switch (req_op(rq)) {
> +	case REQ_OP_WRITE:
> +	case REQ_OP_WRITE_ZEROES:
> +		return blk_rq_zone_is_seq(rq);
> +	case REQ_OP_ZONE_APPEND:
> +	default:

The REQ_OP_ZONE_APPEND case here is superflous.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 04/11] block: mq-deadline: Simplify deadline_skip_seq_writes()
  2023-04-18 22:39 ` [PATCH v2 04/11] block: mq-deadline: Simplify deadline_skip_seq_writes() Bart Van Assche
@ 2023-04-19  4:52   ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-04-19  4:52 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Damien Le Moal, Ming Lei

On Tue, Apr 18, 2023 at 03:39:55PM -0700, Bart Van Assche wrote:
> Make the deadline_skip_seq_writes() code shorter without changing its
> functionality.
> 
> Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 5839a027e0f0..3dde720e2f59 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -310,12 +310,9 @@ static struct request *deadline_skip_seq_writes(struct deadline_data *dd,
>  						struct request *rq)
>  {
>  	sector_t pos = blk_rq_pos(rq);
> -	sector_t skipped_sectors = 0;
>  
> -	while (rq) {
> -		if (blk_rq_pos(rq) != pos + skipped_sectors)
> -			break;
> -		skipped_sectors += blk_rq_sectors(rq);
> +	while (rq && blk_rq_pos(rq) == pos) {
> +		pos += blk_rq_sectors(rq);
>  		rq = deadline_latter_request(rq);
>  	}

This still reads a bit odd.

Maybe turn this into:

	do {
		pos += blk_rq_sectors(rq);
		rq = deadline_latter_request(rq);
		if (!rq)
			break;
	} while (blk_rq_pos(rq) == pos);


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 08/11] block: mq-deadline: Fix a race condition related to zoned writes
  2023-04-18 22:39 ` [PATCH v2 08/11] block: mq-deadline: Fix a race condition related to zoned writes Bart Van Assche
@ 2023-04-19  5:07   ` Christoph Hellwig
  2023-04-19 18:46     ` Bart Van Assche
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-04-19  5:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Damien Le Moal, Ming Lei

On Tue, Apr 18, 2023 at 03:39:59PM -0700, Bart Van Assche wrote:
> Let deadline_next_request() only consider the first zoned write per
> zone. This patch fixes a race condition between deadline_next_request()
> and completion of zoned writes.

Can you explain the condition in a bit more detail?

>   * For the specified data direction, return the next request to
> @@ -386,9 +388,25 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  	 */
>  	spin_lock_irqsave(&dd->zone_lock, flags);
>  	while (rq) {
> +		unsigned int zno __maybe_unused;
> +
>  		if (blk_req_can_dispatch_to_zone(rq))
>  			break;
> +
> +#ifdef CONFIG_BLK_DEV_ZONED
> +		zno = blk_rq_zone_no(rq);
> +
>  		rq = deadline_skip_seq_writes(dd, rq);
> +
> +		/*
> +		 * Skip all other write requests for the zone with zone number
> +		 * 'zno'. This prevents that this function selects a zoned write
> +		 * that is not the first write for a given zone.
> +		 */
> +		while (rq && blk_rq_zone_no(rq) == zno &&
> +		       blk_rq_is_seq_zoned_write(rq))
> +			rq = deadline_latter_request(rq);
> +#endif

The ifdefere and extra checks are a bit ugly here.

I'd suggest to:

 - move all the zoned related logic in deadline_next_request into
   a separate helper that is stubbed out for !CONFIG_BLK_DEV_ZONED
 - merge the loop in deadline_skip_seq_writes and
   added here into one single loop.  Something like:

static struct request *
deadline_zoned_next_request(struct deadline_data *dd, struct request *rq)
{
        unsigned long flags;

	spin_lock_irqsave(&dd->zone_lock, flags);
	while (!blk_req_can_dispatch_to_zone(rq)) {
		unsigned int zone_no = blk_rq_zone_no(rq);
		sector_t pos = blk_rq_pos(rq);

		while (blk_rq_is_seq_zoned_write(rq)) {
			// XXX: good comment explaining the check here
			if (blk_rq_pos(rq) != pos &&
			    blk_rq_zone_no(rq) != zone_no)
				break;
			pos += blk_rq_sectors(rq);
			rq = deadline_latter_request(rq);
			if (!rq)
				goto out_unlock;
		}
	}
out_unlock:
 	spin_unlock_irqrestore(&dd->zone_lock, flags);
	return rq;
} 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 09/11] block: mq-deadline: Handle requeued requests correctly
  2023-04-18 22:40 ` [PATCH v2 09/11] block: mq-deadline: Handle requeued requests correctly Bart Van Assche
@ 2023-04-19  5:07   ` Christoph Hellwig
  2023-04-19 23:01     ` Bart Van Assche
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-04-19  5:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Damien Le Moal, Ming Lei

>  deadline_add_rq_rb(struct dd_per_prio *per_prio, struct request *rq)
>  {
>  	struct rb_root *root = deadline_rb_root(per_prio, rq);
> +	struct request **next_rq __maybe_unused;
>  
>  	elv_rb_add(root, rq);
> +#ifdef CONFIG_BLK_DEV_ZONED
> +	next_rq = &per_prio->next_rq[rq_data_dir(rq)];
> +	if (*next_rq == NULL || !blk_queue_is_zoned(rq->q))
> +		return;
> +	/*
> +	 * If a request got requeued or requests have been submitted out of
> +	 * order, make sure that per zone the request with the lowest LBA is
> +	 * submitted first.
> +	 */
> +	if (blk_rq_pos(rq) < blk_rq_pos(*next_rq) &&
> +	    blk_rq_zone_no(rq) == blk_rq_zone_no(*next_rq))
> +		*next_rq = rq;
> +#endif

Please key move this into a helper only called when blk_queue_is_zoned
is true.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 02/11] block: Micro-optimize blk_req_needs_zone_write_lock()
  2023-04-19  4:11   ` Christoph Hellwig
@ 2023-04-19 18:30     ` Bart Van Assche
  2023-04-20  5:00       ` Christoph Hellwig
  0 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-19 18:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Damien Le Moal, Ming Lei

On 4/18/23 21:11, Christoph Hellwig wrote:
> On Tue, Apr 18, 2023 at 03:39:53PM -0700, Bart Van Assche wrote:
>> Instead of using the following expression to translate a request pointer
>> into a request queue pointer: rq->q->disk->part0->bd_queue, use the
>> following expression: rq->q.
> 
> This looks correct, but I also kinda hate adding queue back in more
> places.

Hi Christoph,

How about changing queue_op_is_zoned_write() into 
disk_op_is_zoned_write()? That won't be as efficient as patch 2/11 in 
this series but still removes the 'part0' dereference.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 08/11] block: mq-deadline: Fix a race condition related to zoned writes
  2023-04-19  5:07   ` Christoph Hellwig
@ 2023-04-19 18:46     ` Bart Van Assche
  2023-04-20  1:00       ` Damien Le Moal
  0 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-19 18:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Damien Le Moal, Ming Lei

On 4/18/23 22:07, Christoph Hellwig wrote:
> On Tue, Apr 18, 2023 at 03:39:59PM -0700, Bart Van Assche wrote:
>> Let deadline_next_request() only consider the first zoned write per
>> zone. This patch fixes a race condition between deadline_next_request()
>> and completion of zoned writes.
> 
> Can you explain the condition in a bit more detail?

Hi Christoph,

I'm concerned about the following scenario:
* deadline_next_request() starts iterating over the writes for a
   particular zone.
* blk_req_can_dispatch_to_zone() returns false for the first zoned write
   examined by deadline_next_request().
* A write for the same zone completes.
* deadline_next_request() continues examining zoned writes and
   blk_req_can_dispatch_to_zone() returns true for another write than the
   first write for a zone. This would result in an UNALIGNED WRITE
   COMMAND error for zoned SCSI devices.

Bart.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 03/11] block: Introduce blk_rq_is_seq_zoned_write()
  2023-04-19  4:50   ` Christoph Hellwig
@ 2023-04-19 21:12     ` Bart Van Assche
  2023-04-20  1:03       ` Damien Le Moal
  0 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-19 21:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Damien Le Moal, Ming Lei

On 4/18/23 21:50, Christoph Hellwig wrote:
> On Tue, Apr 18, 2023 at 03:39:54PM -0700, Bart Van Assche wrote:
>> +/**
>> + * blk_rq_is_seq_zoned_write() - Whether @rq is a zoned write for which the order matters.
> 
> Maybe:
> 
>   * blk_rq_is_seq_zoned_write() - check if @rq needs zoned write serialization

That looks better to me :-)

>> +bool blk_rq_is_seq_zoned_write(struct request *rq)
>> +{
>> +	switch (req_op(rq)) {
>> +	case REQ_OP_WRITE:
>> +	case REQ_OP_WRITE_ZEROES:
>> +		return blk_rq_zone_is_seq(rq);
>> +	case REQ_OP_ZONE_APPEND:
>> +	default:
> 
> The REQ_OP_ZONE_APPEND case here is superflous.

Agreed, but I'd like to keep it since last time I posted this patch I 
was asked whether I had perhaps overlooked the REQ_OP_ZONE_APPEND case. 
I added "case REQ_OP_ZONE_APPEND:" to prevent such questions. Are you OK 
with keeping "case REQ_OP_ZONE_APPEND:" or do you perhaps prefer that I 
remove it?

Bart.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 06/11] block: mq-deadline: Disable head insertion for zoned writes
  2023-04-19  4:30   ` Christoph Hellwig
@ 2023-04-19 22:43     ` Bart Van Assche
  2023-04-20  5:06       ` Christoph Hellwig
  0 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-19 22:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Damien Le Moal, Ming Lei

On 4/18/23 21:30, Christoph Hellwig wrote:
> On Tue, Apr 18, 2023 at 03:39:57PM -0700, Bart Van Assche wrote:
>> Make sure that zoned writes (REQ_OP_WRITE and REQ_OP_WRITE_ZEROES) are
>> submitted in LBA order. This patch does not affect REQ_OP_WRITE_APPEND
>> requests.
> 
> As said before this is not correct.  What we need to instead is to
> support proper patch insertation when the at_head flag is set so
> that the requests get inserted before the existing requests, but
> in ordered they are passed to the I/O scheduler.

It is not clear to me how this approach should work if the AT_HEAD flag 
is set for some zoned writes but not for other zoned writes for the same 
zone? The mq-deadline scheduler uses separate lists for at-head and for 
other requests. Having to check both lists before dispatching a request 
would significantly complicate the mq-deadline scheduler.

> This also needs to be done for the other two I/O schedulers.

As far as I know neither Kyber nor BFQ support zoned storage so we don't 
need to worry about how these schedulers handle zoned writes?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 09/11] block: mq-deadline: Handle requeued requests correctly
  2023-04-19  5:07   ` Christoph Hellwig
@ 2023-04-19 23:01     ` Bart Van Assche
  2023-04-20  1:07       ` Damien Le Moal
  0 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-19 23:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Damien Le Moal, Ming Lei

On 4/18/23 22:07, Christoph Hellwig wrote:
>>   deadline_add_rq_rb(struct dd_per_prio *per_prio, struct request *rq)
>>   {
>>   	struct rb_root *root = deadline_rb_root(per_prio, rq);
>> +	struct request **next_rq __maybe_unused;
>>   
>>   	elv_rb_add(root, rq);
>> +#ifdef CONFIG_BLK_DEV_ZONED
>> +	next_rq = &per_prio->next_rq[rq_data_dir(rq)];
>> +	if (*next_rq == NULL || !blk_queue_is_zoned(rq->q))
>> +		return;
>> +	/*
>> +	 * If a request got requeued or requests have been submitted out of
>> +	 * order, make sure that per zone the request with the lowest LBA is
>> +	 * submitted first.
>> +	 */
>> +	if (blk_rq_pos(rq) < blk_rq_pos(*next_rq) &&
>> +	    blk_rq_zone_no(rq) == blk_rq_zone_no(*next_rq))
>> +		*next_rq = rq;
>> +#endif
> 
> Please key move this into a helper only called when blk_queue_is_zoned
> is true.
Hi Christoph,

I'm looking into an alternative, namely to remove the next_rq member 
from struct dd_per_prio and instead to do the following:
* Track the offset (blk_rq_pos()) of the most recently dispatched 
request ("latest_pos").
* Where the next_rq member is read, look up the request that comes after 
latest_pos in the RB-tree. This should require an effort that is similar 
to updating next_rq after having dispatched a request.

With this approach the code quoted above and that is surrounded with 
#ifdef/#endif will disappear.

Bart.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 08/11] block: mq-deadline: Fix a race condition related to zoned writes
  2023-04-19 18:46     ` Bart Van Assche
@ 2023-04-20  1:00       ` Damien Le Moal
  0 siblings, 0 replies; 48+ messages in thread
From: Damien Le Moal @ 2023-04-20  1:00 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Damien Le Moal, Ming Lei

On 4/20/23 03:46, Bart Van Assche wrote:
> On 4/18/23 22:07, Christoph Hellwig wrote:
>> On Tue, Apr 18, 2023 at 03:39:59PM -0700, Bart Van Assche wrote:
>>> Let deadline_next_request() only consider the first zoned write per
>>> zone. This patch fixes a race condition between deadline_next_request()
>>> and completion of zoned writes.
>>
>> Can you explain the condition in a bit more detail?
> 
> Hi Christoph,
> 
> I'm concerned about the following scenario:
> * deadline_next_request() starts iterating over the writes for a
>    particular zone.
> * blk_req_can_dispatch_to_zone() returns false for the first zoned write
>    examined by deadline_next_request().
> * A write for the same zone completes.
> * deadline_next_request() continues examining zoned writes and
>    blk_req_can_dispatch_to_zone() returns true for another write than the
>    first write for a zone. This would result in an UNALIGNED WRITE
>    COMMAND error for zoned SCSI devices.

Examining zoned writes has to be done with deadline zone lock held, exactly to
avoid this issue.

> 
> Bart.
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 03/11] block: Introduce blk_rq_is_seq_zoned_write()
  2023-04-19 21:12     ` Bart Van Assche
@ 2023-04-20  1:03       ` Damien Le Moal
  2023-04-20  5:01         ` Christoph Hellwig
  0 siblings, 1 reply; 48+ messages in thread
From: Damien Le Moal @ 2023-04-20  1:03 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Damien Le Moal, Ming Lei

On 4/20/23 06:12, Bart Van Assche wrote:
> On 4/18/23 21:50, Christoph Hellwig wrote:
>> On Tue, Apr 18, 2023 at 03:39:54PM -0700, Bart Van Assche wrote:
>>> +/**
>>> + * blk_rq_is_seq_zoned_write() - Whether @rq is a zoned write for which the order matters.
>>
>> Maybe:
>>
>>   * blk_rq_is_seq_zoned_write() - check if @rq needs zoned write serialization
> 
> That looks better to me :-)
> 
>>> +bool blk_rq_is_seq_zoned_write(struct request *rq)
>>> +{
>>> +	switch (req_op(rq)) {
>>> +	case REQ_OP_WRITE:
>>> +	case REQ_OP_WRITE_ZEROES:
>>> +		return blk_rq_zone_is_seq(rq);
>>> +	case REQ_OP_ZONE_APPEND:
>>> +	default:
>>
>> The REQ_OP_ZONE_APPEND case here is superflous.
> 
> Agreed, but I'd like to keep it since last time I posted this patch I 
> was asked whether I had perhaps overlooked the REQ_OP_ZONE_APPEND case. 
> I added "case REQ_OP_ZONE_APPEND:" to prevent such questions. Are you OK 
> with keeping "case REQ_OP_ZONE_APPEND:" or do you perhaps prefer that I 
> remove it?
> 

Could also have a comment on top of the switch explicitly saying that only WRITE
and WRITE ZEROES need to be checked, and that all other commands, including zone
append writes, do not have strong reordering requirements. This way, no need to
superfluous cases.

> Bart.
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 09/11] block: mq-deadline: Handle requeued requests correctly
  2023-04-19 23:01     ` Bart Van Assche
@ 2023-04-20  1:07       ` Damien Le Moal
  0 siblings, 0 replies; 48+ messages in thread
From: Damien Le Moal @ 2023-04-20  1:07 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Ming Lei

On 4/20/23 08:01, Bart Van Assche wrote:
> On 4/18/23 22:07, Christoph Hellwig wrote:
>>>   deadline_add_rq_rb(struct dd_per_prio *per_prio, struct request *rq)
>>>   {
>>>   	struct rb_root *root = deadline_rb_root(per_prio, rq);
>>> +	struct request **next_rq __maybe_unused;
>>>   
>>>   	elv_rb_add(root, rq);
>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>> +	next_rq = &per_prio->next_rq[rq_data_dir(rq)];
>>> +	if (*next_rq == NULL || !blk_queue_is_zoned(rq->q))
>>> +		return;
>>> +	/*
>>> +	 * If a request got requeued or requests have been submitted out of
>>> +	 * order, make sure that per zone the request with the lowest LBA is
>>> +	 * submitted first.
>>> +	 */
>>> +	if (blk_rq_pos(rq) < blk_rq_pos(*next_rq) &&
>>> +	    blk_rq_zone_no(rq) == blk_rq_zone_no(*next_rq))
>>> +		*next_rq = rq;
>>> +#endif
>>
>> Please key move this into a helper only called when blk_queue_is_zoned
>> is true.
> Hi Christoph,
> 
> I'm looking into an alternative, namely to remove the next_rq member 
> from struct dd_per_prio and instead to do the following:
> * Track the offset (blk_rq_pos()) of the most recently dispatched 
> request ("latest_pos").
> * Where the next_rq member is read, look up the request that comes after 
> latest_pos in the RB-tree. This should require an effort that is similar 
> to updating next_rq after having dispatched a request.
> 
> With this approach the code quoted above and that is surrounded with 
> #ifdef/#endif will disappear.

This sounds much better, given that there are in practice lots of cases where
next_rq is set null and we endup getting the next req from the fifo list head.
At least last time I looked at this is what I saw (it was when I patched for the
skip seq writes over 2 zones).

> 
> Bart.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 02/11] block: Micro-optimize blk_req_needs_zone_write_lock()
  2023-04-19 18:30     ` Bart Van Assche
@ 2023-04-20  5:00       ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-04-20  5:00 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Jaegeuk Kim,
	Damien Le Moal, Ming Lei

On Wed, Apr 19, 2023 at 11:30:36AM -0700, Bart Van Assche wrote:
>> This looks correct, but I also kinda hate adding queue back in more
>> places.
>
> Hi Christoph,
>
> How about changing queue_op_is_zoned_write() into disk_op_is_zoned_write()? 
> That won't be as efficient as patch 2/11 in this series but still removes 
> the 'part0' dereference.

Or just open code it since we only have two callers anyway?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 03/11] block: Introduce blk_rq_is_seq_zoned_write()
  2023-04-20  1:03       ` Damien Le Moal
@ 2023-04-20  5:01         ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-04-20  5:01 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Christoph Hellwig, Jens Axboe, linux-block,
	Jaegeuk Kim, Damien Le Moal, Ming Lei

On Thu, Apr 20, 2023 at 10:03:16AM +0900, Damien Le Moal wrote:
> Could also have a comment on top of the switch explicitly saying that only WRITE
> and WRITE ZEROES need to be checked, and that all other commands, including zone
> append writes, do not have strong reordering requirements. This way, no need to
> superfluous cases.

Yes, I think a good comment would be more useful here.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 06/11] block: mq-deadline: Disable head insertion for zoned writes
  2023-04-19 22:43     ` Bart Van Assche
@ 2023-04-20  5:06       ` Christoph Hellwig
  2023-04-20 17:00         ` Bart Van Assche
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2023-04-20  5:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Jaegeuk Kim,
	Damien Le Moal, Ming Lei

On Wed, Apr 19, 2023 at 03:43:41PM -0700, Bart Van Assche wrote:
>> As said before this is not correct.  What we need to instead is to
>> support proper patch insertation when the at_head flag is set so
>> that the requests get inserted before the existing requests, but
>> in ordered they are passed to the I/O scheduler.
>
> It is not clear to me how this approach should work if the AT_HEAD flag is 
> set for some zoned writes but not for other zoned writes for the same zone? 
> The mq-deadline scheduler uses separate lists for at-head and for other 
> requests. Having to check both lists before dispatching a request would 
> significantly complicate the mq-deadline scheduler.
>
>> This also needs to be done for the other two I/O schedulers.
>
> As far as I know neither Kyber nor BFQ support zoned storage so we don't 
> need to worry about how these schedulers handle zoned writes?

The problem is that we now run into different handling depending
on which kind of write is coming in.  So we need to find a policy
that works for everyone.  Remember that as of the current for-6.4/block
branch the only at_head inservations remaining are:

 - blk_execute* for passthrough requests that never enter the I/O
   scheduler
 - REQ_OP_FLUSH that never enter the I/O scheduler
 - requeues using blk_mq_requeue_request
 - processing of the actual writes in the flush state machine

with the last one going away in my RFC series.

So if we come to the conclusion that requeues from the driver do not
actually need at_head insertations to avoid starvation or similar
we should just remove at_head insertations from the I/O scheduler.
If we can't do that, we need to handle them for zoned writes as well.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-18 22:40 ` [PATCH v2 10/11] block: Add support for the zone capacity concept Bart Van Assche
@ 2023-04-20  9:23   ` Niklas Cassel
  2023-04-20 17:12     ` Bart Van Assche
  0 siblings, 1 reply; 48+ messages in thread
From: Niklas Cassel @ 2023-04-20  9:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Damien Le Moal, Ming Lei, Matias Bjorling

On Tue, Apr 18, 2023 at 03:40:01PM -0700, Bart Van Assche wrote:
> Make the zone capacity available in struct queue_limits for those
> drivers that need it.
> 
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  Documentation/ABI/stable/sysfs-block |  8 ++++++++
>  block/blk-settings.c                 |  1 +
>  block/blk-sysfs.c                    |  7 +++++++
>  block/blk-zoned.c                    | 15 +++++++++++++++
>  include/linux/blkdev.h               |  1 +
>  5 files changed, 32 insertions(+)
> 
> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> index c57e5b7cb532..4527d0514fdb 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -671,6 +671,14 @@ Description:
>  		regular block devices.
>  
>  
> +What:		/sys/block/<disk>/queue/zone_capacity
> +Date:		March 2023
> +Contact:	linux-block@vger.kernel.org
> +Description:
> +		[RO] The number of 512-byte sectors in a zone that can be read
> +		or written. This number is less than or equal to the zone size.
> +
> +
>  What:		/sys/block/<disk>/queue/zone_write_granularity
>  Date:		January 2021
>  Contact:	linux-block@vger.kernel.org
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 896b4654ab00..96f5dc63a815 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -685,6 +685,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  						   b->max_secure_erase_sectors);
>  	t->zone_write_granularity = max(t->zone_write_granularity,
>  					b->zone_write_granularity);
> +	t->zone_capacity = max(t->zone_capacity, b->zone_capacity);
>  	t->zoned = max(t->zoned, b->zoned);
>  	return ret;
>  }

(snip)

> @@ -496,12 +498,23 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>  				disk->disk_name);
>  			return -ENODEV;
>  		}
> +		if (zone->capacity != args->zone_capacity) {
> +			pr_warn("%s: Invalid zoned device with non constant zone capacity\n",
> +				disk->disk_name);
> +			return -ENODEV;

Hello Bart,

The NVMe Zoned Namespace Command Set Specification:
https://nvmexpress.org/wp-content/uploads/NVM-Express-Zoned-Namespace-Command-Set-Specification-1.1c-2022.10.03-Ratified.pdf

specifies the Zone Capacity (ZCAP) field in each Zone Descriptor.
The Descriptors are part of the Report Zones Data Structure.

This means that while the zone size is the same for all zones,
the zone capacity can be different for each zone.

While the single NVMe ZNS SSD that I've encountered so far did
coincidentally have the same zone capacity for all zones, this
is not required by the specification.

The NVMe driver does reject a ZNS device that has support for
Variable Zone Capacity (which is defined in the ZOC field):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/zns.c?h=v6.3-rc7#n95

Variable Zone Capacity simply means the the zone capacities
cannot change without a NVM format.

However, even when Variable Zone Capacity is not supported,
a NVMe ZNS device can still have different zone capacities,
and AFAICT, such devices are currently supported.

With your change above, we would start rejecting such devices.


Is this reduction of supported NVMe ZNS SSD devices really desired?

If it is, then I would at least expect to find a motivation of why we
are reducing the scope of the currently supported NVMe ZNS devices
to be found somewhere in the commit message.


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 06/11] block: mq-deadline: Disable head insertion for zoned writes
  2023-04-20  5:06       ` Christoph Hellwig
@ 2023-04-20 17:00         ` Bart Van Assche
  2023-04-24  7:00           ` Christoph Hellwig
  0 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-20 17:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Damien Le Moal, Ming Lei

On 4/19/23 22:06, Christoph Hellwig wrote:
> The problem is that we now run into different handling depending
> on which kind of write is coming in.  So we need to find a policy
> that works for everyone.  Remember that as of the current for-6.4/block
> branch the only at_head inservations remaining are:
> 
>   - blk_execute* for passthrough requests that never enter the I/O
>     scheduler
>   - REQ_OP_FLUSH that never enter the I/O scheduler
>   - requeues using blk_mq_requeue_request
>   - processing of the actual writes in the flush state machine
> 
> with the last one going away in my RFC series.
> 
> So if we come to the conclusion that requeues from the driver do not
> actually need at_head insertations to avoid starvation or similar
> we should just remove at_head insertations from the I/O scheduler.
> If we can't do that, we need to handle them for zoned writes as well.

Hi Christoph,

I'm fine with not inserting requeued requests at the head of the queue. 
Inserting requeued requests at the head of the queue only preserves the 
original submission order if a single request is requeued. If multiple 
requests are requeued inserting at the head of the queue will cause 
inversion of the order of the requeued requests.

This implies that the I/O scheduler or disk controller (if no I/O 
scheduler is configured) will become responsible for optimizing the 
request order if requeuing happens.

Bart.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-20  9:23   ` Niklas Cassel
@ 2023-04-20 17:12     ` Bart Van Assche
  2023-04-20 22:00       ` Damien Le Moal
  0 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-20 17:12 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Damien Le Moal, Ming Lei, Matias Bjorling

On 4/20/23 02:23, Niklas Cassel wrote:
> With your change above, we would start rejecting such devices.
> 
> Is this reduction of supported NVMe ZNS SSD devices really desired?

Hi Niklas,

This is not my intention. A possible solution is to modify the NVMe 
driver and SCSI core such that the "zone is full" information is stored 
in struct request when a command completes. That will remove the need 
for the mq-deadline scheduler to know the zone capacity.

Bart.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-20 17:12     ` Bart Van Assche
@ 2023-04-20 22:00       ` Damien Le Moal
  2023-04-20 22:51         ` Bart Van Assche
  0 siblings, 1 reply; 48+ messages in thread
From: Damien Le Moal @ 2023-04-20 22:00 UTC (permalink / raw)
  To: Bart Van Assche, Niklas Cassel
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Damien Le Moal, Ming Lei, Matias Bjorling

On 4/21/23 02:12, Bart Van Assche wrote:
> On 4/20/23 02:23, Niklas Cassel wrote:
>> With your change above, we would start rejecting such devices.
>>
>> Is this reduction of supported NVMe ZNS SSD devices really desired?
> 
> Hi Niklas,
> 
> This is not my intention. A possible solution is to modify the NVMe 
> driver and SCSI core such that the "zone is full" information is stored 
> in struct request when a command completes. That will remove the need 
> for the mq-deadline scheduler to know the zone capacity.

I am not following... Why would the scheduler need to know the zone capacity ?

If the user does stupid things like accessing sectors between zone capacity and
zone size or trying to write to a full zone, the commands will be failed by the
drive and I do not see why the scheduler need to care about that.

> 
> Bart.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-20 22:00       ` Damien Le Moal
@ 2023-04-20 22:51         ` Bart Van Assche
  2023-04-20 23:37           ` Damien Le Moal
  0 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-20 22:51 UTC (permalink / raw)
  To: Damien Le Moal, Niklas Cassel
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Damien Le Moal, Ming Lei, Matias Bjorling

On 4/20/23 15:00, Damien Le Moal wrote:
> On 4/21/23 02:12, Bart Van Assche wrote:
>> On 4/20/23 02:23, Niklas Cassel wrote:
>>> With your change above, we would start rejecting such devices.
>>>
>>> Is this reduction of supported NVMe ZNS SSD devices really desired?
>>
>> Hi Niklas,
>>
>> This is not my intention. A possible solution is to modify the NVMe
>> driver and SCSI core such that the "zone is full" information is stored
>> in struct request when a command completes. That will remove the need
>> for the mq-deadline scheduler to know the zone capacity.
> 
> I am not following... Why would the scheduler need to know the zone capacity ?
> 
> If the user does stupid things like accessing sectors between zone capacity and
> zone size or trying to write to a full zone, the commands will be failed by the
> drive and I do not see why the scheduler need to care about that.

Hi Damien,

Restricting the number of active zones in the I/O scheduler (patch 
11/11) requires some knowledge of the zone condition.

According to ZBC-2, for sequential write preferred zones the additional 
sense code ZONE TRANSITION TO FULL must be reported if the zone 
condition changes from not full into full. There is no such requirement 
for sequential write required zones. Additionally, I'm not aware of any 
reporting mechanism in the NVMe specification for changes in the zone 
condition.

The overhead of submitting a REPORT ZONES command after every I/O 
completion would be unacceptable.

Is there any other solution for tracking the zone condition other than 
comparing the LBA at which a WRITE command finished with the zone capacity?

Did I perhaps overlook something?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-20 22:51         ` Bart Van Assche
@ 2023-04-20 23:37           ` Damien Le Moal
  2023-04-20 23:44             ` Bart Van Assche
  0 siblings, 1 reply; 48+ messages in thread
From: Damien Le Moal @ 2023-04-20 23:37 UTC (permalink / raw)
  To: Bart Van Assche, Niklas Cassel
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Ming Lei, Matias Bjorling

On 4/21/23 07:51, Bart Van Assche wrote:
> On 4/20/23 15:00, Damien Le Moal wrote:
>> On 4/21/23 02:12, Bart Van Assche wrote:
>>> On 4/20/23 02:23, Niklas Cassel wrote:
>>>> With your change above, we would start rejecting such devices.
>>>>
>>>> Is this reduction of supported NVMe ZNS SSD devices really desired?
>>>
>>> Hi Niklas,
>>>
>>> This is not my intention. A possible solution is to modify the NVMe
>>> driver and SCSI core such that the "zone is full" information is stored
>>> in struct request when a command completes. That will remove the need
>>> for the mq-deadline scheduler to know the zone capacity.
>>
>> I am not following... Why would the scheduler need to know the zone capacity ?
>>
>> If the user does stupid things like accessing sectors between zone capacity and
>> zone size or trying to write to a full zone, the commands will be failed by the
>> drive and I do not see why the scheduler need to care about that.
> 
> Hi Damien,
> 
> Restricting the number of active zones in the I/O scheduler (patch 
> 11/11) requires some knowledge of the zone condition.

Why would you need to handle the max active zone number restriction in the
scheduler ? That is the user responsibility. btrfs does it (that is still buggy
though, working on it).

> According to ZBC-2, for sequential write preferred zones the additional 
> sense code ZONE TRANSITION TO FULL must be reported if the zone 
> condition changes from not full into full. There is no such requirement 
> for sequential write required zones. Additionally, I'm not aware of any 
> reporting mechanism in the NVMe specification for changes in the zone 
> condition.

Sequential write preferred zones is ZBC which does not have the concept of
active zone. In general, for ZBC HDDs, ignoring the max number of open zones is
fine. There is no performance impact that can be measured, unless the user goes
full random write on the device. But in that case, the user is already asking
for bad perf anyway.

I suspect you are thinking about all this in the context of UFS devices ?
My point stands though. Trying to manage the active zone limit at the scheduler
level is not a good idea as there are no guarantees that the user will
eventually issue all the write commands to make zones full, and thus turn them
inactive. This is the responsibility of the user to manage that, so above the
block IO scheduler.

> The overhead of submitting a REPORT ZONES command after every I/O 
> completion would be unacceptable.
> 
> Is there any other solution for tracking the zone condition other than 
> comparing the LBA at which a WRITE command finished with the zone capacity?

The sd driver does some minimal tracking already which is used for zone append
emulation.

> 
> Did I perhaps overlook something?
> 
> Thanks,
> 
> Bart.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-20 23:37           ` Damien Le Moal
@ 2023-04-20 23:44             ` Bart Van Assche
  2023-04-20 23:53               ` Damien Le Moal
  0 siblings, 1 reply; 48+ messages in thread
From: Bart Van Assche @ 2023-04-20 23:44 UTC (permalink / raw)
  To: Damien Le Moal, Niklas Cassel
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Ming Lei, Matias Bjorling

On 4/20/23 16:37, Damien Le Moal wrote:
> Why would you need to handle the max active zone number restriction in the
> scheduler ? That is the user responsibility. btrfs does it (that is still buggy
> though, working on it).

Hi Damien,

If the user (filesystem) restricts the number of active zones, the code 
for restricting the number of active zones will have to be duplicated 
into every filesystem that supports zoned devices. Wouldn't it be better 
if the I/O scheduler tracks the number of active zones?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-20 23:44             ` Bart Van Assche
@ 2023-04-20 23:53               ` Damien Le Moal
  2023-04-21  0:29                 ` Jaegeuk Kim
  0 siblings, 1 reply; 48+ messages in thread
From: Damien Le Moal @ 2023-04-20 23:53 UTC (permalink / raw)
  To: Bart Van Assche, Niklas Cassel
  Cc: Jens Axboe, linux-block, Jaegeuk Kim, Christoph Hellwig,
	Ming Lei, Matias Bjorling

On 4/21/23 08:44, Bart Van Assche wrote:
> On 4/20/23 16:37, Damien Le Moal wrote:
>> Why would you need to handle the max active zone number restriction in the
>> scheduler ? That is the user responsibility. btrfs does it (that is still buggy
>> though, working on it).
> 
> Hi Damien,
> 
> If the user (filesystem) restricts the number of active zones, the code 
> for restricting the number of active zones will have to be duplicated 
> into every filesystem that supports zoned devices. Wouldn't it be better 
> if the I/O scheduler tracks the number of active zones?

I do not think so. The reason is that for a file system, the block allocator
must be aware of any active zone limit of the underlying device to make the best
decision possible regarding where to allocate blocks for files and metadata. For
btrfs, we added "active block groups" management for that purpose. And we have
tracking of a block group active state so that the block allocator can start
using new block groups (inactive ones) when previously used ones become full. We
also have a "finish block group" for cases when there is not enough remaining
free blocks in a group/zone (this does a finish zone operation to make a
non-full zone full, that is, inactive).

Even if the block IO scheduler were to track active zones, the FS would still
need to do its own tracking (e.g. to be able to finish zones when needed). So I
do not see the point in having the block scheduler doing anything about active
zones.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-20 23:53               ` Damien Le Moal
@ 2023-04-21  0:29                 ` Jaegeuk Kim
  2023-04-21  1:52                   ` Damien Le Moal
  0 siblings, 1 reply; 48+ messages in thread
From: Jaegeuk Kim @ 2023-04-21  0:29 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Niklas Cassel, Jens Axboe, linux-block,
	Christoph Hellwig, Ming Lei, Matias Bjorling

On 04/21, Damien Le Moal wrote:
> On 4/21/23 08:44, Bart Van Assche wrote:
> > On 4/20/23 16:37, Damien Le Moal wrote:
> >> Why would you need to handle the max active zone number restriction in the
> >> scheduler ? That is the user responsibility. btrfs does it (that is still buggy
> >> though, working on it).
> > 
> > Hi Damien,
> > 
> > If the user (filesystem) restricts the number of active zones, the code 
> > for restricting the number of active zones will have to be duplicated 
> > into every filesystem that supports zoned devices. Wouldn't it be better 
> > if the I/O scheduler tracks the number of active zones?
> 
> I do not think so. The reason is that for a file system, the block allocator
> must be aware of any active zone limit of the underlying device to make the best
> decision possible regarding where to allocate blocks for files and metadata. For

Well, I'm wondering what kind of decision FS can make when allocating zones?

> btrfs, we added "active block groups" management for that purpose. And we have
> tracking of a block group active state so that the block allocator can start
> using new block groups (inactive ones) when previously used ones become full. We
> also have a "finish block group" for cases when there is not enough remaining
> free blocks in a group/zone (this does a finish zone operation to make a
> non-full zone full, that is, inactive).
> 
> Even if the block IO scheduler were to track active zones, the FS would still
> need to do its own tracking (e.g. to be able to finish zones when needed). So I

Why does FS also need to track the # of open zones redundantly? I have two
concerns if FS needs to force it:
1) performance - waiting for finish_zone before allocating a new zone can break
the IO pipeline and block FS operations.
2) multiple partition support - if F2FS uses two partitions, one on conventional
partition while the other on zoned partition, we have to maintain such tracking
mechanism on zoned partition only which gives some code complexity.

In general, doesn't it make sense that FS (not zoned-device FS) just needs to
guarantee sequential writes per zone, while IO scheduler needs to limit the #
of open zones gracefully?

> do not see the point in having the block scheduler doing anything about active
> zones.
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-21  0:29                 ` Jaegeuk Kim
@ 2023-04-21  1:52                   ` Damien Le Moal
  2023-04-21 20:15                     ` Jaegeuk Kim
  0 siblings, 1 reply; 48+ messages in thread
From: Damien Le Moal @ 2023-04-21  1:52 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Bart Van Assche, Niklas Cassel, Jens Axboe, linux-block,
	Christoph Hellwig, Ming Lei, Matias Bjorling

On 4/21/23 09:29, Jaegeuk Kim wrote:
> On 04/21, Damien Le Moal wrote:
>> On 4/21/23 08:44, Bart Van Assche wrote:
>>> On 4/20/23 16:37, Damien Le Moal wrote:
>>>> Why would you need to handle the max active zone number restriction in the
>>>> scheduler ? That is the user responsibility. btrfs does it (that is still buggy
>>>> though, working on it).
>>>
>>> Hi Damien,
>>>
>>> If the user (filesystem) restricts the number of active zones, the code 
>>> for restricting the number of active zones will have to be duplicated 
>>> into every filesystem that supports zoned devices. Wouldn't it be better 
>>> if the I/O scheduler tracks the number of active zones?
>>
>> I do not think so. The reason is that for a file system, the block allocator
>> must be aware of any active zone limit of the underlying device to make the best
>> decision possible regarding where to allocate blocks for files and metadata. For
> 
> Well, I'm wondering what kind of decision FS can make when allocating zones?

Not sure what you mean... It is very similar to regular block device case. The
FS does block allocation based on whatever block placement policy it wants/need
to implement. With zoned devices, the FS block management object are mapped to
zones (btrfs: block group == zone, f2fs: section == zone) and the active zone
limits simply adds one more constraint regarding the selection of block groups
for allocating blocks. This is a resource management issue.

>> btrfs, we added "active block groups" management for that purpose. And we have
>> tracking of a block group active state so that the block allocator can start
>> using new block groups (inactive ones) when previously used ones become full. We
>> also have a "finish block group" for cases when there is not enough remaining
>> free blocks in a group/zone (this does a finish zone operation to make a
>> non-full zone full, that is, inactive).
>>
>> Even if the block IO scheduler were to track active zones, the FS would still
>> need to do its own tracking (e.g. to be able to finish zones when needed). So I
> 
> Why does FS also need to track the # of open zones redundantly? I have two

Because the FS should not be issuing writes to a zone that cannot be activated
on the device, e.g. starting writing to an empty zone when there are already N
zones being written or partly written, with N >= max active zones, will result
in IO failures. Even if you have active zone tracking in the block IO scheduler,
there is absolutely NOTHING that the scheduler can do to prevent such errors.
E.g. think of this simple case:
1) Let's take a device with max active zones = N (N != 0)
2) The FS implements a block allocation policy which results in new files being
written to empty block groups, with 1 block group == 1 zone
3) User writes N files of 4KB

After step 3, the device has N active zones. So if the user tries to write a new
file, it will fail (cannot write to an empty zone as that will result in an
error because that zone cannot be activated by the device). AND the FS cannot
write metadata for these files into a metadata block group.

There is nothing that the IO scheduler can do about all this. The FS has to be
more intelligent and do block allocation also based on the current
active/inactive state of the zones used by block groups.

> concerns if FS needs to force it:
> 1) performance - waiting for finish_zone before allocating a new zone can break
> the IO pipeline and block FS operations.

The need to perform a zone finish is purely an optimization if, for instance,
you want to reduce fragmentation. E.g., if there is only 4K of free space left
in a zone and need to write a 1MB extent, you can write the last 4K of that
zone, making it inactive and write the remaining 1MB - 4KB in another zone and
you are guaranteed that this other zone can be written since you just
deactivated one zone.

But if you do not want to fragment that 1MB extent, then you must finish that
zone with only 4KB left first, to ensure that another zone can be activated.

This of course also depends on the current number of active zones N. If N < max
active zone limit, then there is no need to finish a zone.

> 2) multiple partition support - if F2FS uses two partitions, one on conventional
> partition while the other on zoned partition, we have to maintain such tracking
> mechanism on zoned partition only which gives some code complexity.

Conventional zones have no concept of active zones. All active zone resources
can be allocated to the sequential zones. And zoned block devices do not support
partitions anyway.

> In general, doesn't it make sense that FS (not zoned-device FS) just needs to
> guarantee sequential writes per zone, while IO scheduler needs to limit the #
> of open zones gracefully?

No. That will never work. See the example above: you can endup in a situation
where the drive becomes read-only (all writes failing) if the FS does not direct
block allocation & writes to zones that are already active. No amount of
trickery in the IO scheduler can change that fact.

If you want to hide the active zone limit to the FS, then what is needed is a
device mapper that remaps blocks. That is a lot more overhead that implementing
that support in the FS, and the FS can do a much better job at optimizing block
placement.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-21  1:52                   ` Damien Le Moal
@ 2023-04-21 20:15                     ` Jaegeuk Kim
  2023-04-21 22:25                       ` Damien Le Moal
  0 siblings, 1 reply; 48+ messages in thread
From: Jaegeuk Kim @ 2023-04-21 20:15 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Niklas Cassel, Jens Axboe, linux-block,
	Christoph Hellwig, Ming Lei, Matias Bjorling

On 04/21, Damien Le Moal wrote:
> On 4/21/23 09:29, Jaegeuk Kim wrote:
> > On 04/21, Damien Le Moal wrote:
> >> On 4/21/23 08:44, Bart Van Assche wrote:
> >>> On 4/20/23 16:37, Damien Le Moal wrote:
> >>>> Why would you need to handle the max active zone number restriction in the
> >>>> scheduler ? That is the user responsibility. btrfs does it (that is still buggy
> >>>> though, working on it).
> >>>
> >>> Hi Damien,
> >>>
> >>> If the user (filesystem) restricts the number of active zones, the code 
> >>> for restricting the number of active zones will have to be duplicated 
> >>> into every filesystem that supports zoned devices. Wouldn't it be better 
> >>> if the I/O scheduler tracks the number of active zones?
> >>
> >> I do not think so. The reason is that for a file system, the block allocator
> >> must be aware of any active zone limit of the underlying device to make the best
> >> decision possible regarding where to allocate blocks for files and metadata. For
> > 
> > Well, I'm wondering what kind of decision FS can make when allocating zones?
> 
> Not sure what you mean... It is very similar to regular block device case. The
> FS does block allocation based on whatever block placement policy it wants/need
> to implement. With zoned devices, the FS block management object are mapped to
> zones (btrfs: block group == zone, f2fs: section == zone) and the active zone
> limits simply adds one more constraint regarding the selection of block groups
> for allocating blocks. This is a resource management issue.

Ok, so it seems I overlooked there might be something in the zone allocation
policy. So, f2fs already manages 6 open zones by design.

> 
> >> btrfs, we added "active block groups" management for that purpose. And we have
> >> tracking of a block group active state so that the block allocator can start
> >> using new block groups (inactive ones) when previously used ones become full. We
> >> also have a "finish block group" for cases when there is not enough remaining
> >> free blocks in a group/zone (this does a finish zone operation to make a
> >> non-full zone full, that is, inactive).
> >>
> >> Even if the block IO scheduler were to track active zones, the FS would still
> >> need to do its own tracking (e.g. to be able to finish zones when needed). So I
> > 
> > Why does FS also need to track the # of open zones redundantly? I have two
> 
> Because the FS should not be issuing writes to a zone that cannot be activated
> on the device, e.g. starting writing to an empty zone when there are already N
> zones being written or partly written, with N >= max active zones, will result
> in IO failures. Even if you have active zone tracking in the block IO scheduler,
> there is absolutely NOTHING that the scheduler can do to prevent such errors.
> E.g. think of this simple case:
> 1) Let's take a device with max active zones = N (N != 0)
> 2) The FS implements a block allocation policy which results in new files being
> written to empty block groups, with 1 block group == 1 zone
> 3) User writes N files of 4KB
> 
> After step 3, the device has N active zones. So if the user tries to write a new
> file, it will fail (cannot write to an empty zone as that will result in an
> error because that zone cannot be activated by the device). AND the FS cannot
> write metadata for these files into a metadata block group.

I think it needs to consider block allocation vs. data writes separately. That
being said,

1) FS zone allocation: as you described, FS needs to allocate blocks per zone,
and should finish to *allocate* blocks entirely in the zone, when allocating a
new one if it meets the limit. Fortunately, F2FS is doing that by design, so
I don't see any need to manage the open zone limitation.

2) data writes: IO scheduler needs to control write pipeline to get the best
performance while just checking the max open zones seamlessly.

With that, FS doesn't need to wait for IO completion when allocating a new
zone.

> 
> There is nothing that the IO scheduler can do about all this. The FS has to be
> more intelligent and do block allocation also based on the current
> active/inactive state of the zones used by block groups.

TBH, I can't find any benefit to manage such the active/inactive states in FS.
Am I mssing something in btrfs especially?

> 
> > concerns if FS needs to force it:
> > 1) performance - waiting for finish_zone before allocating a new zone can break
> > the IO pipeline and block FS operations.
> 
> The need to perform a zone finish is purely an optimization if, for instance,
> you want to reduce fragmentation. E.g., if there is only 4K of free space left
> in a zone and need to write a 1MB extent, you can write the last 4K of that
> zone, making it inactive and write the remaining 1MB - 4KB in another zone and
> you are guaranteed that this other zone can be written since you just
> deactivated one zone.
> 
> But if you do not want to fragment that 1MB extent, then you must finish that
> zone with only 4KB left first, to ensure that another zone can be activated.

So, why should FS be aware of that? I was expecting, once FS submitted 1MB
extent, block or IO scheduler will gracefully finish the old zone and open a
new one which is matched to the in-disk write pointers.

> 
> This of course also depends on the current number of active zones N. If N < max
> active zone limit, then there is no need to finish a zone.
> 
> > 2) multiple partition support - if F2FS uses two partitions, one on conventional
> > partition while the other on zoned partition, we have to maintain such tracking
> > mechanism on zoned partition only which gives some code complexity.
> 
> Conventional zones have no concept of active zones. All active zone resources
> can be allocated to the sequential zones. And zoned block devices do not support
> partitions anyway.
> 
> > In general, doesn't it make sense that FS (not zoned-device FS) just needs to
> > guarantee sequential writes per zone, while IO scheduler needs to limit the #
> > of open zones gracefully?
> 
> No. That will never work. See the example above: you can endup in a situation
> where the drive becomes read-only (all writes failing) if the FS does not direct
> block allocation & writes to zones that are already active. No amount of
> trickery in the IO scheduler can change that fact.
> 
> If you want to hide the active zone limit to the FS, then what is needed is a
> device mapper that remaps blocks. That is a lot more overhead that implementing
> that support in the FS, and the FS can do a much better job at optimizing block
> placement.

Oh, I meant FS like f2fs supporting zoned device.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-21 20:15                     ` Jaegeuk Kim
@ 2023-04-21 22:25                       ` Damien Le Moal
  2023-04-24  6:01                         ` Christoph Hellwig
  2023-04-24 17:48                         ` Jaegeuk Kim
  0 siblings, 2 replies; 48+ messages in thread
From: Damien Le Moal @ 2023-04-21 22:25 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Bart Van Assche, Niklas Cassel, Jens Axboe, linux-block,
	Christoph Hellwig, Ming Lei, Matias Bjorling

On 4/22/23 05:15, Jaegeuk Kim wrote:
> On 04/21, Damien Le Moal wrote:
>> On 4/21/23 09:29, Jaegeuk Kim wrote:
>>> On 04/21, Damien Le Moal wrote:
>>>> On 4/21/23 08:44, Bart Van Assche wrote:
>>>>> On 4/20/23 16:37, Damien Le Moal wrote:
>>>>>> Why would you need to handle the max active zone number restriction in the
>>>>>> scheduler ? That is the user responsibility. btrfs does it (that is still buggy
>>>>>> though, working on it).
>>>>>
>>>>> Hi Damien,
>>>>>
>>>>> If the user (filesystem) restricts the number of active zones, the code 
>>>>> for restricting the number of active zones will have to be duplicated 
>>>>> into every filesystem that supports zoned devices. Wouldn't it be better 
>>>>> if the I/O scheduler tracks the number of active zones?
>>>>
>>>> I do not think so. The reason is that for a file system, the block allocator
>>>> must be aware of any active zone limit of the underlying device to make the best
>>>> decision possible regarding where to allocate blocks for files and metadata. For
>>>
>>> Well, I'm wondering what kind of decision FS can make when allocating zones?
>>
>> Not sure what you mean... It is very similar to regular block device case. The
>> FS does block allocation based on whatever block placement policy it wants/need
>> to implement. With zoned devices, the FS block management object are mapped to
>> zones (btrfs: block group == zone, f2fs: section == zone) and the active zone
>> limits simply adds one more constraint regarding the selection of block groups
>> for allocating blocks. This is a resource management issue.
> 
> Ok, so it seems I overlooked there might be something in the zone allocation
> policy. So, f2fs already manages 6 open zones by design.

Yes, so as long as the device allows for at least 6 active zones, there are no
issues with f2fs.

>>>> btrfs, we added "active block groups" management for that purpose. And we have
>>>> tracking of a block group active state so that the block allocator can start
>>>> using new block groups (inactive ones) when previously used ones become full. We
>>>> also have a "finish block group" for cases when there is not enough remaining
>>>> free blocks in a group/zone (this does a finish zone operation to make a
>>>> non-full zone full, that is, inactive).
>>>>
>>>> Even if the block IO scheduler were to track active zones, the FS would still
>>>> need to do its own tracking (e.g. to be able to finish zones when needed). So I
>>>
>>> Why does FS also need to track the # of open zones redundantly? I have two
>>
>> Because the FS should not be issuing writes to a zone that cannot be activated
>> on the device, e.g. starting writing to an empty zone when there are already N
>> zones being written or partly written, with N >= max active zones, will result
>> in IO failures. Even if you have active zone tracking in the block IO scheduler,
>> there is absolutely NOTHING that the scheduler can do to prevent such errors.
>> E.g. think of this simple case:
>> 1) Let's take a device with max active zones = N (N != 0)
>> 2) The FS implements a block allocation policy which results in new files being
>> written to empty block groups, with 1 block group == 1 zone
>> 3) User writes N files of 4KB
>>
>> After step 3, the device has N active zones. So if the user tries to write a new
>> file, it will fail (cannot write to an empty zone as that will result in an
>> error because that zone cannot be activated by the device). AND the FS cannot
>> write metadata for these files into a metadata block group.
> 
> I think it needs to consider block allocation vs. data writes separately. That
> being said,

That mostly depends on the FS design.

> 
> 1) FS zone allocation: as you described, FS needs to allocate blocks per zone,
> and should finish to *allocate* blocks entirely in the zone, when allocating a
> new one if it meets the limit. Fortunately, F2FS is doing that by design, so
> I don't see any need to manage the open zone limitation.

Correct for f2fs case. btrfs is different in that respect.

> 2) data writes: IO scheduler needs to control write pipeline to get the best
> performance while just checking the max open zones seamlessly.

There is absolutely no need for the IO scheduler to check open/active zones
state. More below.

> With that, FS doesn't need to wait for IO completion when allocating a new
> zone.

Incorrect. I showed you a simple example of why. You can also consider a more
complex scenario and think about what can happen: multiple writers doing
buffered IOs through the page cache and suddenly doing an fsync. If you have
more writers than can have active zones, depending on how blocks are allocated,
you'll need to wait before issuing writes for some to ensure that zones can be
activated. This is *NOT* a performance impact as this synchronization is needed,
it means that you already are pounding the drive hard. Issuing more IOs will not
make the drive go faster.

>> There is nothing that the IO scheduler can do about all this. The FS has to be
>> more intelligent and do block allocation also based on the current
>> active/inactive state of the zones used by block groups.
> 
> TBH, I can't find any benefit to manage such the active/inactive states in FS.
> Am I mssing something in btrfs especially?

btrfs block management is a little more complex than f2fs. For one thing, btrfs
is 100% copy on write (unlike f2fs), which means that we absolutely MUST ensure
that we can always write metadata block groups and the super block (multiple
copies). So we need some "reserved" active zone resources for that. And for file
data, given the that block allocation may work much faster than actually writing
the device, you need to control the writeback process to throttle it within the
available active zone resources. This is naturally done in f2fs given that there
are at most only 6 segments/zones used at any time for writing. But btrfs needs
additional code.

>>> concerns if FS needs to force it:
>>> 1) performance - waiting for finish_zone before allocating a new zone can break
>>> the IO pipeline and block FS operations.
>>
>> The need to perform a zone finish is purely an optimization if, for instance,
>> you want to reduce fragmentation. E.g., if there is only 4K of free space left
>> in a zone and need to write a 1MB extent, you can write the last 4K of that
>> zone, making it inactive and write the remaining 1MB - 4KB in another zone and
>> you are guaranteed that this other zone can be written since you just
>> deactivated one zone.
>>
>> But if you do not want to fragment that 1MB extent, then you must finish that
>> zone with only 4KB left first, to ensure that another zone can be activated.
> 
> So, why should FS be aware of that? I was expecting, once FS submitted 1MB
> extent, block or IO scheduler will gracefully finish the old zone and open a
> new one which is matched to the in-disk write pointers.

The block IO scheduler is just that, a scheduler. It should NEVER be the source
of a new command. You cannot have the block IO scheduler issue commands. That is
not how the block layer works.

And it seems that you are assuming that block IOs make it to the scheduler in
about the same order as issued by the FS. There are no guarantees of that
happening when considering a set of different zones as the various issuers may
block on request allocation, or due to a device mapper between FS and device,
etc. Plenty of reasons for the overall write pattern to change between FS and
device. Not per zone for regular writes though, that is preserved. But that is
not the case for zone append writes that btrfs uses.

And you are forgetting the case of applications using the drive directly. You
cannot rely on the application working correctly and have the IO scheduler do
some random things about open/active zones. That will never work.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-21 22:25                       ` Damien Le Moal
@ 2023-04-24  6:01                         ` Christoph Hellwig
  2023-04-24 17:58                           ` Jaegeuk Kim
                                             ` (2 more replies)
  2023-04-24 17:48                         ` Jaegeuk Kim
  1 sibling, 3 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-04-24  6:01 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Jaegeuk Kim, Bart Van Assche, Niklas Cassel, Jens Axboe,
	linux-block, Christoph Hellwig, Ming Lei, Matias Bjorling

On Sat, Apr 22, 2023 at 07:25:33AM +0900, Damien Le Moal wrote:
> >> for allocating blocks. This is a resource management issue.
> > 
> > Ok, so it seems I overlooked there might be something in the zone allocation
> > policy. So, f2fs already manages 6 open zones by design.
> 
> Yes, so as long as the device allows for at least 6 active zones, there are no
> issues with f2fs.

I don't think it's quite as rosy, because f2fs can still schedule
I/O to the old zone after already scheduling I/O to a new zone for
any of these 6 slots.  It'll need code to wait for all I/O to the old
zone to finish first, similar to btrfs.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 06/11] block: mq-deadline: Disable head insertion for zoned writes
  2023-04-20 17:00         ` Bart Van Assche
@ 2023-04-24  7:00           ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2023-04-24  7:00 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Jaegeuk Kim,
	Damien Le Moal, Ming Lei

On Thu, Apr 20, 2023 at 10:00:07AM -0700, Bart Van Assche wrote:
> I'm fine with not inserting requeued requests at the head of the queue. 
> Inserting requeued requests at the head of the queue only preserves the 
> original submission order if a single request is requeued.

Yes.

> If multiple 
> requests are requeued inserting at the head of the queue will cause 
> inversion of the order of the requeued requests.
>
> This implies that the I/O scheduler or disk controller (if no I/O scheduler 
> is configured) will become responsible for optimizing the request order if 
> requeuing happens.

I think we need to understand why these requeues even happen.
Unfortunately blk_mq_requeue_request has quite a few callers, so they'll
need a bit of an audit.

Quite a few are about resource constraints in the hardware and or driver.
In this case I suspect it is essential that they are prioritized over
incoming new commands in the way I suggested before.

A different case is nvme_retry_req with the CRD field set, in which
case we want to wait some time before retrying this particular command,
so having new command bypass it makes sense.

Another one is the SCSI ALUA transition delay, in which case we want
to wait before sending commands to the LU again.  In this case we
really don't want to resend any commands until the delay in kicking
the requeue list.

So I'm really not sure what the right thing to is here.  But I'm
pretty sure just skipping head inserts for zoned locked writes is
even more wrong than what we do right now.  And I also don't see
what it would be useful for.  All zoned writes should either be
locked by higher layers, or even better just use zone append and
just get a new new location assigned when requeing as discussed
before.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-21 22:25                       ` Damien Le Moal
  2023-04-24  6:01                         ` Christoph Hellwig
@ 2023-04-24 17:48                         ` Jaegeuk Kim
  1 sibling, 0 replies; 48+ messages in thread
From: Jaegeuk Kim @ 2023-04-24 17:48 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Niklas Cassel, Jens Axboe, linux-block,
	Christoph Hellwig, Ming Lei, Matias Bjorling

On 04/22, Damien Le Moal wrote:
> On 4/22/23 05:15, Jaegeuk Kim wrote:
> > On 04/21, Damien Le Moal wrote:
> >> On 4/21/23 09:29, Jaegeuk Kim wrote:
> >>> On 04/21, Damien Le Moal wrote:
> >>>> On 4/21/23 08:44, Bart Van Assche wrote:
> >>>>> On 4/20/23 16:37, Damien Le Moal wrote:
> >>>>>> Why would you need to handle the max active zone number restriction in the
> >>>>>> scheduler ? That is the user responsibility. btrfs does it (that is still buggy
> >>>>>> though, working on it).
> >>>>>
> >>>>> Hi Damien,
> >>>>>
> >>>>> If the user (filesystem) restricts the number of active zones, the code 
> >>>>> for restricting the number of active zones will have to be duplicated 
> >>>>> into every filesystem that supports zoned devices. Wouldn't it be better 
> >>>>> if the I/O scheduler tracks the number of active zones?
> >>>>
> >>>> I do not think so. The reason is that for a file system, the block allocator
> >>>> must be aware of any active zone limit of the underlying device to make the best
> >>>> decision possible regarding where to allocate blocks for files and metadata. For
> >>>
> >>> Well, I'm wondering what kind of decision FS can make when allocating zones?
> >>
> >> Not sure what you mean... It is very similar to regular block device case. The
> >> FS does block allocation based on whatever block placement policy it wants/need
> >> to implement. With zoned devices, the FS block management object are mapped to
> >> zones (btrfs: block group == zone, f2fs: section == zone) and the active zone
> >> limits simply adds one more constraint regarding the selection of block groups
> >> for allocating blocks. This is a resource management issue.
> > 
> > Ok, so it seems I overlooked there might be something in the zone allocation
> > policy. So, f2fs already manages 6 open zones by design.
> 
> Yes, so as long as the device allows for at least 6 active zones, there are no
> issues with f2fs.
> 
> >>>> btrfs, we added "active block groups" management for that purpose. And we have
> >>>> tracking of a block group active state so that the block allocator can start
> >>>> using new block groups (inactive ones) when previously used ones become full. We
> >>>> also have a "finish block group" for cases when there is not enough remaining
> >>>> free blocks in a group/zone (this does a finish zone operation to make a
> >>>> non-full zone full, that is, inactive).
> >>>>
> >>>> Even if the block IO scheduler were to track active zones, the FS would still
> >>>> need to do its own tracking (e.g. to be able to finish zones when needed). So I
> >>>
> >>> Why does FS also need to track the # of open zones redundantly? I have two
> >>
> >> Because the FS should not be issuing writes to a zone that cannot be activated
> >> on the device, e.g. starting writing to an empty zone when there are already N
> >> zones being written or partly written, with N >= max active zones, will result
> >> in IO failures. Even if you have active zone tracking in the block IO scheduler,
> >> there is absolutely NOTHING that the scheduler can do to prevent such errors.
> >> E.g. think of this simple case:
> >> 1) Let's take a device with max active zones = N (N != 0)
> >> 2) The FS implements a block allocation policy which results in new files being
> >> written to empty block groups, with 1 block group == 1 zone
> >> 3) User writes N files of 4KB
> >>
> >> After step 3, the device has N active zones. So if the user tries to write a new
> >> file, it will fail (cannot write to an empty zone as that will result in an
> >> error because that zone cannot be activated by the device). AND the FS cannot
> >> write metadata for these files into a metadata block group.
> > 
> > I think it needs to consider block allocation vs. data writes separately. That
> > being said,
> 
> That mostly depends on the FS design.
> 
> > 
> > 1) FS zone allocation: as you described, FS needs to allocate blocks per zone,
> > and should finish to *allocate* blocks entirely in the zone, when allocating a
> > new one if it meets the limit. Fortunately, F2FS is doing that by design, so
> > I don't see any need to manage the open zone limitation.
> 
> Correct for f2fs case. btrfs is different in that respect.
> 
> > 2) data writes: IO scheduler needs to control write pipeline to get the best
> > performance while just checking the max open zones seamlessly.
> 
> There is absolutely no need for the IO scheduler to check open/active zones
> state. More below.
> 
> > With that, FS doesn't need to wait for IO completion when allocating a new
> > zone.
> 
> Incorrect. I showed you a simple example of why. You can also consider a more
> complex scenario and think about what can happen: multiple writers doing
> buffered IOs through the page cache and suddenly doing an fsync. If you have
> more writers than can have active zones, depending on how blocks are allocated,
> you'll need to wait before issuing writes for some to ensure that zones can be
> activated. This is *NOT* a performance impact as this synchronization is needed,
> it means that you already are pounding the drive hard. Issuing more IOs will not
> make the drive go faster.

There's no issue at all in F2FS regarding to # of open zone limitation now,
even I don't see any problem with a working zoned disk. Again, what I'm talking
about is single case where FS needs to stop block allocation on new zone till
finishing the submitted writes to old zones, even though FS didn't violate the
full sequential writes in a zone all the time. Yes, it's about performance, not
functional matter.

> 
> >> There is nothing that the IO scheduler can do about all this. The FS has to be
> >> more intelligent and do block allocation also based on the current
> >> active/inactive state of the zones used by block groups.
> > 
> > TBH, I can't find any benefit to manage such the active/inactive states in FS.
> > Am I mssing something in btrfs especially?
> 
> btrfs block management is a little more complex than f2fs. For one thing, btrfs
> is 100% copy on write (unlike f2fs), which means that we absolutely MUST ensure
> that we can always write metadata block groups and the super block (multiple
> copies). So we need some "reserved" active zone resources for that. And for file
> data, given the that block allocation may work much faster than actually writing
> the device, you need to control the writeback process to throttle it within the
> available active zone resources. This is naturally done in f2fs given that there
> are at most only 6 segments/zones used at any time for writing. But btrfs needs
> additional code.
> 
> >>> concerns if FS needs to force it:
> >>> 1) performance - waiting for finish_zone before allocating a new zone can break
> >>> the IO pipeline and block FS operations.
> >>
> >> The need to perform a zone finish is purely an optimization if, for instance,
> >> you want to reduce fragmentation. E.g., if there is only 4K of free space left
> >> in a zone and need to write a 1MB extent, you can write the last 4K of that
> >> zone, making it inactive and write the remaining 1MB - 4KB in another zone and
> >> you are guaranteed that this other zone can be written since you just
> >> deactivated one zone.
> >>
> >> But if you do not want to fragment that 1MB extent, then you must finish that
> >> zone with only 4KB left first, to ensure that another zone can be activated.
> > 
> > So, why should FS be aware of that? I was expecting, once FS submitted 1MB
> > extent, block or IO scheduler will gracefully finish the old zone and open a
> > new one which is matched to the in-disk write pointers.
> 
> The block IO scheduler is just that, a scheduler. It should NEVER be the source
> of a new command. You cannot have the block IO scheduler issue commands. That is
> not how the block layer works.
> 
> And it seems that you are assuming that block IOs make it to the scheduler in
> about the same order as issued by the FS. There are no guarantees of that
> happening when considering a set of different zones as the various issuers may
> block on request allocation, or due to a device mapper between FS and device,
> etc. Plenty of reasons for the overall write pattern to change between FS and
> device. Not per zone for regular writes though, that is preserved. But that is
> not the case for zone append writes that btrfs uses.
> 
> And you are forgetting the case of applications using the drive directly. You
> cannot rely on the application working correctly and have the IO scheduler do
> some random things about open/active zones. That will never work.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-24  6:01                         ` Christoph Hellwig
@ 2023-04-24 17:58                           ` Jaegeuk Kim
  2023-04-24 19:05                           ` Jaegeuk Kim
  2023-04-25 13:38                           ` Damien Le Moal
  2 siblings, 0 replies; 48+ messages in thread
From: Jaegeuk Kim @ 2023-04-24 17:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Damien Le Moal, Bart Van Assche, Niklas Cassel, Jens Axboe,
	linux-block, Ming Lei, Matias Bjorling

On 04/24, Christoph Hellwig wrote:
> On Sat, Apr 22, 2023 at 07:25:33AM +0900, Damien Le Moal wrote:
> > >> for allocating blocks. This is a resource management issue.
> > > 
> > > Ok, so it seems I overlooked there might be something in the zone allocation
> > > policy. So, f2fs already manages 6 open zones by design.
> > 
> > Yes, so as long as the device allows for at least 6 active zones, there are no
> > issues with f2fs.
> 
> I don't think it's quite as rosy, because f2fs can still schedule
> I/O to the old zone after already scheduling I/O to a new zone for
> any of these 6 slots.  It'll need code to wait for all I/O to the old
> zone to finish first, similar to btrfs.

F2FS should serialize all the writes across 6 active open zones. If not, I think
it's a bug. The problem here is 1) open zone#1 through zone #6, 2) allocate all
blocks in zone #1, 3) submit all writes in zone #1, 4) allocate blocks in zone
#7, 5) submit all writes in zone #7, and so on.

In this scenario, I'm asking why F2FS needs to wait for entire write completion
between 3) and 4), which will impact performance a lot since 4) blocks syscalls.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-24  6:01                         ` Christoph Hellwig
  2023-04-24 17:58                           ` Jaegeuk Kim
@ 2023-04-24 19:05                           ` Jaegeuk Kim
  2023-04-25 13:38                           ` Damien Le Moal
  2 siblings, 0 replies; 48+ messages in thread
From: Jaegeuk Kim @ 2023-04-24 19:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Damien Le Moal, Bart Van Assche, Niklas Cassel, Jens Axboe,
	linux-block, Ming Lei, Matias Bjorling

On 04/24, Christoph Hellwig wrote:
> On Sat, Apr 22, 2023 at 07:25:33AM +0900, Damien Le Moal wrote:
> > >> for allocating blocks. This is a resource management issue.
> > > 
> > > Ok, so it seems I overlooked there might be something in the zone allocation
> > > policy. So, f2fs already manages 6 open zones by design.
> > 
> > Yes, so as long as the device allows for at least 6 active zones, there are no
> > issues with f2fs.
> 
> I don't think it's quite as rosy, because f2fs can still schedule
> I/O to the old zone after already scheduling I/O to a new zone for
> any of these 6 slots.  It'll need code to wait for all I/O to the old
> zone to finish first, similar to btrfs.

Hmm, I was concerned on the small zone size impacting the bandwidth, but feel
that I can reduce the # of logs in F2FS in that case. So there's some trade-off.
Let me take another look at, if I'm missing anything else. :(

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 10/11] block: Add support for the zone capacity concept
  2023-04-24  6:01                         ` Christoph Hellwig
  2023-04-24 17:58                           ` Jaegeuk Kim
  2023-04-24 19:05                           ` Jaegeuk Kim
@ 2023-04-25 13:38                           ` Damien Le Moal
  2 siblings, 0 replies; 48+ messages in thread
From: Damien Le Moal @ 2023-04-25 13:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jaegeuk Kim, Bart Van Assche, Niklas Cassel, Jens Axboe,
	linux-block, Ming Lei, Matias Bjorling

On 4/24/23 15:01, Christoph Hellwig wrote:
> On Sat, Apr 22, 2023 at 07:25:33AM +0900, Damien Le Moal wrote:
>>>> for allocating blocks. This is a resource management issue.
>>>
>>> Ok, so it seems I overlooked there might be something in the zone allocation
>>> policy. So, f2fs already manages 6 open zones by design.
>>
>> Yes, so as long as the device allows for at least 6 active zones, there are no
>> issues with f2fs.
> 
> I don't think it's quite as rosy, because f2fs can still schedule
> I/O to the old zone after already scheduling I/O to a new zone for
> any of these 6 slots.  It'll need code to wait for all I/O to the old
> zone to finish first, similar to btrfs.

Indeed. Good point.


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2023-04-25 13:38 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-18 22:39 [PATCH v2 00/11] mq-deadline: Improve support for zoned block devices Bart Van Assche
2023-04-18 22:39 ` [PATCH v2 01/11] block: Simplify blk_req_needs_zone_write_lock() Bart Van Assche
2023-04-19  4:09   ` Christoph Hellwig
2023-04-18 22:39 ` [PATCH v2 02/11] block: Micro-optimize blk_req_needs_zone_write_lock() Bart Van Assche
2023-04-19  4:11   ` Christoph Hellwig
2023-04-19 18:30     ` Bart Van Assche
2023-04-20  5:00       ` Christoph Hellwig
2023-04-18 22:39 ` [PATCH v2 03/11] block: Introduce blk_rq_is_seq_zoned_write() Bart Van Assche
2023-04-19  4:50   ` Christoph Hellwig
2023-04-19 21:12     ` Bart Van Assche
2023-04-20  1:03       ` Damien Le Moal
2023-04-20  5:01         ` Christoph Hellwig
2023-04-18 22:39 ` [PATCH v2 04/11] block: mq-deadline: Simplify deadline_skip_seq_writes() Bart Van Assche
2023-04-19  4:52   ` Christoph Hellwig
2023-04-18 22:39 ` [PATCH v2 05/11] block: mq-deadline: Improve deadline_skip_seq_writes() Bart Van Assche
2023-04-18 22:39 ` [PATCH v2 06/11] block: mq-deadline: Disable head insertion for zoned writes Bart Van Assche
2023-04-19  4:30   ` Christoph Hellwig
2023-04-19 22:43     ` Bart Van Assche
2023-04-20  5:06       ` Christoph Hellwig
2023-04-20 17:00         ` Bart Van Assche
2023-04-24  7:00           ` Christoph Hellwig
2023-04-18 22:39 ` [PATCH v2 07/11] block: mq-deadline: Preserve write streams for all device types Bart Van Assche
2023-04-18 22:39 ` [PATCH v2 08/11] block: mq-deadline: Fix a race condition related to zoned writes Bart Van Assche
2023-04-19  5:07   ` Christoph Hellwig
2023-04-19 18:46     ` Bart Van Assche
2023-04-20  1:00       ` Damien Le Moal
2023-04-18 22:40 ` [PATCH v2 09/11] block: mq-deadline: Handle requeued requests correctly Bart Van Assche
2023-04-19  5:07   ` Christoph Hellwig
2023-04-19 23:01     ` Bart Van Assche
2023-04-20  1:07       ` Damien Le Moal
2023-04-18 22:40 ` [PATCH v2 10/11] block: Add support for the zone capacity concept Bart Van Assche
2023-04-20  9:23   ` Niklas Cassel
2023-04-20 17:12     ` Bart Van Assche
2023-04-20 22:00       ` Damien Le Moal
2023-04-20 22:51         ` Bart Van Assche
2023-04-20 23:37           ` Damien Le Moal
2023-04-20 23:44             ` Bart Van Assche
2023-04-20 23:53               ` Damien Le Moal
2023-04-21  0:29                 ` Jaegeuk Kim
2023-04-21  1:52                   ` Damien Le Moal
2023-04-21 20:15                     ` Jaegeuk Kim
2023-04-21 22:25                       ` Damien Le Moal
2023-04-24  6:01                         ` Christoph Hellwig
2023-04-24 17:58                           ` Jaegeuk Kim
2023-04-24 19:05                           ` Jaegeuk Kim
2023-04-25 13:38                           ` Damien Le Moal
2023-04-24 17:48                         ` Jaegeuk Kim
2023-04-18 22:40 ` [PATCH v2 11/11] block: mq-deadline: Respect the active zone limit Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.