All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/26] Zone write plugging
@ 2024-02-02  7:30 Damien Le Moal
  2024-02-02  7:30 ` [PATCH 01/26] block: Restore sector of flush requests Damien Le Moal
                   ` (28 more replies)
  0 siblings, 29 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

The patch series introduces zone write plugging (ZWP) as the new
mechanism to control the ordering of writes to zoned block devices.
ZWP replaces zone write locking (ZWL) which is implemented only by
mq-deadline today. ZWP also allows emulating zone append operations
using regular writes for zoned devices that do not natively support this
operation (e.g. SMR HDDs). This patch series removes the scsi disk
driver and device mapper zone append emulation to use ZWP emulation.

Unlike ZWL which operates on requests, ZWP operates on BIOs. A zone
write plug is simply a BIO list that is atomically manipulated using a
spinlock and a kblockd submission work. A write BIO to a zone is
"plugged" to delay its execution if a write BIO for the same zone was
already issued, that is, if a write request for the same zone is being
executed. The next plugged BIO is unplugged and issued once the write
request completes.

This mechanism allows to:
 - Untangle zone write ordering from the block IO schedulers. This
   allows removing the restriction on using only mq-deadline for zoned
   block devices. Any block IO scheduler, including "none" can be used.
 - Zone write plugging operates on BIOs instead of requests. Plugged
   BIOs waiting for execution thus do not hold scheduling tags and thus
   do not prevent other BIOs from being submitted to the device (reads
   or writes to other zones). Depending on the workload, this can
   significantly improve the device use and the performance.
 - Both blk-mq (request) based zoned devices and BIO-based devices (e.g.
   device mapper) can use ZWP. It is mandatory for the
   former but optional for the latter: BIO-based driver can use zone
   write plugging to implement write ordering guarantees, or the drivers
   can implement their own if needed.
 - The code is less invasive in the block layer and in device drivers.
   ZWP implementation is mostly limited to blk-zoned.c, with some small
   changes in blk-mq.c, blk-merge.c and bio.c.

Performance evaluation results are shown below.

The series is organized as follows:

 - Patch 1 to 5 are preparatory changes for patch 6.
 - Patch 6 introduce ZWP
 - Patch 7 and 8 add zone append emulation to ZWP.
 - Patch 9 to 16 modify zoned block device drivers to use ZWP and
   prepare for the removal of ZWL.
 - Patch 17 to 24 remove zone write locking
 - Finally, Patch 24 and 25 improve ZWP (memory usage reduction and
   debugfs attributes).

Overall, these changes do not increase the amount of code (small
reduction achieved looking at the diff-stat, but in fact, the reduction
is much larger if comments are ignored).

Many thanks must go to Christoph Hellwig for comments and suggestions
he provided on earlier versions of these patches.

Performance evaluation results
==============================

Environments:
 - Xeon 8-cores/16-threads, 128GB of RAM
 - Kernel:
   - Baseline: 6.8-rc2, Linus tree as of 2024-02-01
   - Baseline-next: Jens block/for-next branch as of 2024-02-01
   - ZWP: Jens block/for-next patched to add zone write plugging
   (all kernels were compiled with the same configuration turning off
   most heavy debug features)

Workoads:
 - seqw4K1: 4KB sequential write, qd=1
 - seqw4K16: 4KB sequential write, qd=16
 - seqw1M16: 1MB sequential write, qd=16
 - rndw4K16: 4KB random write, qd=16
 - rndw128K16: 128KB random write, qd=16
 - btrfs workoad: Single fio job writing 128 MB files using 128 KB
   direct IOs at qd=16.

Devices:
 - nullblk (zoned): 4096 zones of 256 MB, no zone resource limits.
 - NVMe ZNS drive: 1 TB ZNS drive with 2GB zone size, 14 max open/active
   zones.
 - SMR HDD: 26 TB disk with 256MB zone size and 128 max open zones.

For ZWP, the result show the performance percentage increase (or
decrease) against current for-next.

1) null_blk zoned device:

               +---------+----------+----------+----------+------------+
               | seqw4K1 | seqw4K16 | seqw1M16 | rndw4K16 | rndw128K16 |
               | (MB/s)  |  (MB/s)  |  (MB/s)  |  (KIOPS) |   (KIOPS)  |
+--------------+---------+----------+----------+----------+------------+
|   Baseline   | 1005    | 881      | 15600    | 564      | 217        |
|  mq-deadline |         |          |          |          |            |
+--------------+---------+----------+----------+----------+------------+
| Baseline-next| 921     | 813      | 14300    | 817      | 330        |
|  mq-deadline |         |          |          |          |            |
+--------------+---------+----------+----------+----------+------------+
|      ZWP     | 946     | 826      | 15000    | 935      | 358        |
|  mq-deadline |(+2%)    | (+1%)    | (+4%)    | (+14%)   | (+8%)      |
+--------------+---------+----------+----------+----------+------------+
|      ZWP     | 2937    | 1882     | 19900    | 2286     | 709        |
|     none     | (+218%) | (+131%)  | (+39%)   | (+179%)  | (+114%)    |
+--------------+---------+----------+----------+----------+------------+

For-next mq-deadline changes and ZWP significantly increase random write
performance but slightly reduce sequential write performance compared to
ZWL.  However, ZWP ability to run fast block devices with the none
scheduler result in very large performance increase for all workloads.

2) NVMe ZNS drive:

               +---------+----------+----------+----------+------------+
               | seqw4K1 | seqw4K16 | seqw1M16 | rndw4K16 | rndw128K16 |
               | (MB/s)  |  (MB/s)  |  (MB/s)  |  (KIOPS) |   (KIOPS)  |
+--------------+---------+----------+----------+----------+------------+
|   Baseline   | 183     | 798      | 1104     | 53.5     | 14.6       |
|  mq-deadline |         |          |          |          |            |
+--------------+---------+----------+----------+----------+------------+
| Baseline-next| 180     | 261      | 1113     | 51.6     | 14.9       |
|  mq-deadline |         |          |          |          |            |
+--------------+---------+----------+----------+----------+------------+
|      ZWP     | 181     | 671      | 1109     | 51.7     | 14.7       |
|  mq-deadline |(+0%)    | (+157%)  | (+0%)    | (+0%)    | (-1%)      |
+--------------+---------+----------+----------+----------+------------+
|      ZWP     | 190     | 660      | 1106     | 51.4     | 15.1       |
|     none     | (+5%)   | (+152%)  | (+0%)    | (-0%)    | (+1%)      |
+--------------+---------+----------+----------+----------+------------+

The current block/for-next significantly regress sequential small write
performace at high queue depth due to lost BIO merge oportunities.
ZWP corrects this but is not as efficient as ZWL for this workload.

3) SMR SATA HDD:

               +---------+----------+----------+----------+------------+
               | seqw4K1 | seqw4K16 | seqw1M16 | rndw4K16 | rndw128K16 |
               | (MB/s)  |  (MB/s)  |  (MB/s)  |  (IOPS)  |   (IOPS)   |
+--------------+---------+----------+----------+----------+------------+
|   Baseline   | 121     | 251      | 251      | 2471     | 664        |
|  mq-deadline |         |          |          |          |            |
+--------------+---------+----------+----------+----------+------------+
| Baseline-next| 121     | 137      | 249      | 2428     | 649        |
|  mq-deadline |         |          |          |          |            |
+--------------+---------+----------+----------+----------+------------+
|      ZWP     | 118     | 137      | 251      | 2415     | 651        |
|  mq-deadline |(-2%)    | (+0%)    | (+0%)    | (+0%)    | (+0%)      |
+--------------+---------+----------+----------+----------+------------+
|      ZWP     | 117     | 238      | 251      | 2400     | 666        |
|     none     | (-3%)   | (+73%)   | (+0%)    | (-1%)    | (+2%)      |
+--------------+---------+----------+----------+----------+------------+

Same observation as for ZNS: for-next regress sequential high QD
performance but ZWP brings back better performance, still slightly lower
than with ZWL.

4) Zone append tests using btrfs:

                +-------------+-------------+-----------+-------------+
                |  null-blk   |  null_blk   |    ZNS    |     SMR     |
                |  native ZA  | emulated ZA | native ZA | emulated ZA |
                |    (MB/s)   |   (MB/s)    |   (MB/s)  |    (MB/s)   |
+---------------+-------------+-------------+-----------+-------------+
|    Baseline   | 2412        | N/A         | 1080      | 203         |
|   mq-deadline |             |             |           |             |
+---------------+-------------+-------------+-----------+-------------+
| Baseline-next | 2471        | N/A         | 1084      | 209         |
|  mq-deadline  |             |             |           |             |
+---------------+-------------+-------------+-----------+-------------+
|      ZWP      | 2397        | 3025        | 1085      | 245         |
|  mq-deadline  | (-2%)       |             | (+0%)     | (+17%)      |
+---------------+-------------+-------------+-----------+-------------+
|      ZWP      | 2614        | 3301        | 1082      | 247         |
|      none     | (+5%)       |             | (-0%)     | (+18%)      |
+---------------+-------------+-------------+-----------+-------------+

With a more realistic use of the device by the FS, ZWP significantly
improves SMR HDD performance thanks to the more efficient zone append
emulation compared to ZWL.

Damien Le Moal (26):
  block: Restore sector of flush requests
  block: Remove req_bio_endio()
  block: Introduce bio_straddle_zones() and bio_offset_from_zone_start()
  block: Introduce blk_zone_complete_request_bio()
  block: Allow using bio_attempt_back_merge() internally
  block: Introduce zone write plugging
  block: Allow zero value of max_zone_append_sectors queue limit
  block: Implement zone append emulation
  block: Allow BIO-based drivers to use blk_revalidate_disk_zones()
  dm: Use the block layer zone append emulation
  scsi: sd: Use the block layer zone append emulation
  ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  null_blk: Introduce zone_append_max_sectors attribute
  null_blk: Introduce fua attribute
  nvmet: zns: Do not reference the gendisk conv_zones_bitmap
  block: Remove BLK_STS_ZONE_RESOURCE
  block: Simplify blk_revalidate_disk_zones() interface
  block: mq-deadline: Remove support for zone write locking
  block: Remove elevator required features
  block: Do not check zone type in blk_check_zone_append()
  block: Move zone related debugfs attribute to blk-zoned.c
  block: Remove zone write locking
  block: Do not special-case plugging of zone write operations
  block: Reduce zone write plugging memory usage
  block: Add zone_active_wplugs debugfs entry

 block/Kconfig                     |    4 -
 block/Makefile                    |    1 -
 block/bio.c                       |    7 +
 block/blk-core.c                  |   13 +-
 block/blk-flush.c                 |    1 +
 block/blk-merge.c                 |   22 +-
 block/blk-mq-debugfs-zoned.c      |   22 -
 block/blk-mq-debugfs.c            |    4 +-
 block/blk-mq-debugfs.h            |   11 +-
 block/blk-mq.c                    |  134 ++--
 block/blk-mq.h                    |   31 -
 block/blk-settings.c              |   51 +-
 block/blk-sysfs.c                 |    2 +-
 block/blk-zoned.c                 | 1143 ++++++++++++++++++++++++++---
 block/blk.h                       |   69 +-
 block/elevator.c                  |   46 +-
 block/elevator.h                  |    1 -
 block/genhd.c                     |    2 +-
 block/mq-deadline.c               |  176 +----
 drivers/block/null_blk/main.c     |   52 +-
 drivers/block/null_blk/null_blk.h |    2 +
 drivers/block/null_blk/zoned.c    |   32 +-
 drivers/block/ublk_drv.c          |    4 +-
 drivers/block/virtio_blk.c        |    2 +-
 drivers/md/dm-core.h              |   11 +-
 drivers/md/dm-zone.c              |  470 ++----------
 drivers/md/dm.c                   |   44 +-
 drivers/md/dm.h                   |    7 -
 drivers/nvme/host/zns.c           |    2 +-
 drivers/nvme/target/zns.c         |   10 +-
 drivers/scsi/scsi_lib.c           |    1 -
 drivers/scsi/sd.c                 |    8 -
 drivers/scsi/sd.h                 |   19 -
 drivers/scsi/sd_zbc.c             |  335 +--------
 include/linux/blk-mq.h            |   85 +--
 include/linux/blk_types.h         |   30 +-
 include/linux/blkdev.h            |  102 ++-
 37 files changed, 1453 insertions(+), 1503 deletions(-)
 delete mode 100644 block/blk-mq-debugfs-zoned.c

-- 
2.43.0


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 01/26] block: Restore sector of flush requests
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 11:55   ` Hannes Reinecke
  2024-02-05 17:22   ` Bart Van Assche
  2024-02-02  7:30 ` [PATCH 02/26] block: Remove req_bio_endio() Damien Le Moal
                   ` (27 subsequent siblings)
  28 siblings, 2 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On completion of a flush sequence, blk_flush_restore_request() restores
the bio of a request to the original submitted BIO. However, the last
use of the request in the flush sequence may have been for a POSTFLUSH
which does not have a sector. So make sure to restore the request sector
using the iter sector of the original BIO. This BIO has not changed yet
since the completions of the flush sequence intermediate steps use
requeueing of the request until all steps are completed.

Restoring the request sector ensures that blk_mq_end_request() will see
a valid sector as originally set when the flush BIO was submitted.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-flush.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index b0f314f4bc14..2f58ae018464 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -130,6 +130,7 @@ static void blk_flush_restore_request(struct request *rq)
 	 * original @rq->bio.  Restore it.
 	 */
 	rq->bio = rq->biotail;
+	rq->__sector = rq->bio->bi_iter.bi_sector;
 
 	/* make @rq a normal request */
 	rq->rq_flags &= ~RQF_FLUSH_SEQ;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 02/26] block: Remove req_bio_endio()
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
  2024-02-02  7:30 ` [PATCH 01/26] block: Restore sector of flush requests Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 11:57   ` Hannes Reinecke
  2024-02-05 17:28   ` Bart Van Assche
  2024-02-02  7:30 ` [PATCH 03/26] block: Introduce bio_straddle_zones() and bio_offset_from_zone_start() Damien Le Moal
                   ` (26 subsequent siblings)
  28 siblings, 2 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

Moving req_bio_endio() code into its only caller, blk_update_request(),
allows reducing accesses to and tests of bio and request fields. Also,
given that partial completions of zone append operations is not
possible and that zone append operations cannot be merged, the update
of the BIO sector using the request sector for these operations can be
moved directly before the call to bio_endio().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-mq.c | 66 ++++++++++++++++++++++++--------------------------
 1 file changed, 31 insertions(+), 35 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 21cd54ad1873..bfebb8fcd248 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -763,36 +763,6 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
 }
 EXPORT_SYMBOL(blk_dump_rq_flags);
 
-static void req_bio_endio(struct request *rq, struct bio *bio,
-			  unsigned int nbytes, blk_status_t error)
-{
-	if (unlikely(error)) {
-		bio->bi_status = error;
-	} else if (req_op(rq) == REQ_OP_ZONE_APPEND) {
-		/*
-		 * Partial zone append completions cannot be supported as the
-		 * BIO fragments may end up not being written sequentially.
-		 * For such case, force the completed nbytes to be equal to
-		 * the BIO size so that bio_advance() sets the BIO remaining
-		 * size to 0 and we end up calling bio_endio() before returning.
-		 */
-		if (bio->bi_iter.bi_size != nbytes) {
-			bio->bi_status = BLK_STS_IOERR;
-			nbytes = bio->bi_iter.bi_size;
-		} else {
-			bio->bi_iter.bi_sector = rq->__sector;
-		}
-	}
-
-	bio_advance(bio, nbytes);
-
-	if (unlikely(rq->rq_flags & RQF_QUIET))
-		bio_set_flag(bio, BIO_QUIET);
-	/* don't actually finish bio if it's part of flush sequence */
-	if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
-		bio_endio(bio);
-}
-
 static void blk_account_io_completion(struct request *req, unsigned int bytes)
 {
 	if (req->part && blk_do_io_stat(req)) {
@@ -896,6 +866,8 @@ static void blk_complete_request(struct request *req)
 bool blk_update_request(struct request *req, blk_status_t error,
 		unsigned int nr_bytes)
 {
+	bool is_flush = req->rq_flags & RQF_FLUSH_SEQ;
+	bool quiet = req->rq_flags & RQF_QUIET;
 	int total_bytes;
 
 	trace_block_rq_complete(req, error, nr_bytes);
@@ -916,9 +888,8 @@ bool blk_update_request(struct request *req, blk_status_t error,
 	if (blk_crypto_rq_has_keyslot(req) && nr_bytes >= blk_rq_bytes(req))
 		__blk_crypto_rq_put_keyslot(req);
 
-	if (unlikely(error && !blk_rq_is_passthrough(req) &&
-		     !(req->rq_flags & RQF_QUIET)) &&
-		     !test_bit(GD_DEAD, &req->q->disk->state)) {
+	if (unlikely(error && !blk_rq_is_passthrough(req) && !quiet) &&
+	    !test_bit(GD_DEAD, &req->q->disk->state)) {
 		blk_print_req_error(req, error);
 		trace_block_rq_error(req, error, nr_bytes);
 	}
@@ -930,12 +901,37 @@ bool blk_update_request(struct request *req, blk_status_t error,
 		struct bio *bio = req->bio;
 		unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
 
-		if (bio_bytes == bio->bi_iter.bi_size)
+		if (unlikely(error))
+			bio->bi_status = error;
+
+		if (bio_bytes == bio->bi_iter.bi_size) {
 			req->bio = bio->bi_next;
+		} else if (req_op(req) == REQ_OP_ZONE_APPEND) {
+			/*
+			 * Partial zone append completions cannot be supported
+			 * as the BIO fragments may end up not being written
+			 * sequentially. For such case, force the completed
+			 * nbytes to be equal to the BIO size so that
+			 * bio_advance() sets the BIO remaining size to 0 and
+			 * we end up calling bio_endio() before returning.
+			 */
+			bio->bi_status = BLK_STS_IOERR;
+			bio_bytes = bio->bi_iter.bi_size;
+		}
 
 		/* Completion has already been traced */
 		bio_clear_flag(bio, BIO_TRACE_COMPLETION);
-		req_bio_endio(req, bio, bio_bytes, error);
+		if (unlikely(quiet))
+			bio_set_flag(bio, BIO_QUIET);
+
+		bio_advance(bio, bio_bytes);
+
+		/* Don't actually finish bio if it's part of flush sequence */
+		if (!bio->bi_iter.bi_size && !is_flush) {
+			if (req_op(req) == REQ_OP_ZONE_APPEND)
+				bio->bi_iter.bi_sector = req->__sector;
+			bio_endio(bio);
+		}
 
 		total_bytes += bio_bytes;
 		nr_bytes -= bio_bytes;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 03/26] block: Introduce bio_straddle_zones() and bio_offset_from_zone_start()
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
  2024-02-02  7:30 ` [PATCH 01/26] block: Restore sector of flush requests Damien Le Moal
  2024-02-02  7:30 ` [PATCH 02/26] block: Remove req_bio_endio() Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-03  4:09   ` Bart Van Assche
  2024-02-04 11:58   ` Hannes Reinecke
  2024-02-02  7:30 ` [PATCH 04/26] block: Introduce blk_zone_complete_request_bio() Damien Le Moal
                   ` (25 subsequent siblings)
  28 siblings, 2 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

Implement the inline helper functions bio_straddle_zones() and
bio_offset_from_zone_start() to respectively test if a BIO crosses a
zone boundary (the start sector and last sector belong to different
zones) and to obtain the oofset from a zone starting sector of a BIO.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 include/linux/blkdev.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d7cac3de65b3..0bb897f0501c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -845,6 +845,12 @@ static inline unsigned int bio_zone_no(struct bio *bio)
 	return disk_zone_no(bio->bi_bdev->bd_disk, bio->bi_iter.bi_sector);
 }
 
+static inline bool bio_straddle_zones(struct bio *bio)
+{
+	return bio_zone_no(bio) !=
+		disk_zone_no(bio->bi_bdev->bd_disk, bio_end_sector(bio) - 1);
+}
+
 static inline unsigned int bio_zone_is_seq(struct bio *bio)
 {
 	return disk_zone_is_seq(bio->bi_bdev->bd_disk, bio->bi_iter.bi_sector);
@@ -1297,6 +1303,12 @@ static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
 	return sector & (bdev_zone_sectors(bdev) - 1);
 }
 
+static inline sector_t bio_offset_from_zone_start(struct bio *bio)
+{
+	return bdev_offset_from_zone_start(bio->bi_bdev,
+					   bio->bi_iter.bi_sector);
+}
+
 static inline bool bdev_is_zone_start(struct block_device *bdev,
 				      sector_t sector)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 04/26] block: Introduce blk_zone_complete_request_bio()
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (2 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 03/26] block: Introduce bio_straddle_zones() and bio_offset_from_zone_start() Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 11:59   ` Hannes Reinecke
  2024-02-02  7:30 ` [PATCH 05/26] block: Allow using bio_attempt_back_merge() internally Damien Le Moal
                   ` (24 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On completion of a zone append request, the request sector indicates the
location of the written data. This value must be returned to the user
through the BIO iter sector. This is done in 2 places: in
blk_complete_request() and in req_bio_endio(). Introduce the inline
helper function blk_zone_complete_request_bio() to avoid duplicating
this BIO update for zone append requests, and to compile out this
helper call when CONFIG_BLK_DEV_ZONED is not enabled.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-mq.c | 11 +++++------
 block/blk.h    | 19 ++++++++++++++++++-
 2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index bfebb8fcd248..f02e486a02ae 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -822,11 +822,11 @@ static void blk_complete_request(struct request *req)
 		/* Completion has already been traced */
 		bio_clear_flag(bio, BIO_TRACE_COMPLETION);
 
-		if (req_op(req) == REQ_OP_ZONE_APPEND)
-			bio->bi_iter.bi_sector = req->__sector;
-
-		if (!is_flush)
+		if (!is_flush) {
+			blk_zone_complete_request_bio(req, bio);
 			bio_endio(bio);
+		}
+
 		bio = next;
 	} while (bio);
 
@@ -928,8 +928,7 @@ bool blk_update_request(struct request *req, blk_status_t error,
 
 		/* Don't actually finish bio if it's part of flush sequence */
 		if (!bio->bi_iter.bi_size && !is_flush) {
-			if (req_op(req) == REQ_OP_ZONE_APPEND)
-				bio->bi_iter.bi_sector = req->__sector;
+			blk_zone_complete_request_bio(req, bio);
 			bio_endio(bio);
 		}
 
diff --git a/block/blk.h b/block/blk.h
index 913c93838a01..23f76b452e70 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -396,12 +396,29 @@ static inline struct bio *blk_queue_bounce(struct bio *bio,
 
 #ifdef CONFIG_BLK_DEV_ZONED
 void disk_free_zone_bitmaps(struct gendisk *disk);
+static inline void blk_zone_complete_request_bio(struct request *rq,
+						 struct bio *bio)
+{
+	/*
+	 * For zone append requests, the request sector indicates the location
+	 * at which the BIO data was written. Return this value to the BIO
+	 * issuer through the BIO iter sector.
+	 */
+	if (req_op(rq) == REQ_OP_ZONE_APPEND)
+		bio->bi_iter.bi_sector = rq->__sector;
+}
 int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd,
 		unsigned long arg);
 int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
 		unsigned int cmd, unsigned long arg);
 #else /* CONFIG_BLK_DEV_ZONED */
-static inline void disk_free_zone_bitmaps(struct gendisk *disk) {}
+static inline void disk_free_zone_bitmaps(struct gendisk *disk)
+{
+}
+static inline void blk_zone_complete_request_bio(struct request *rq,
+						 struct bio *bio)
+{
+}
 static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
 		unsigned int cmd, unsigned long arg)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 05/26] block: Allow using bio_attempt_back_merge() internally
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (3 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 04/26] block: Introduce blk_zone_complete_request_bio() Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-03  4:11   ` Bart Van Assche
  2024-02-04 12:00   ` Hannes Reinecke
  2024-02-02  7:30 ` [PATCH 06/26] block: Introduce zone write plugging Damien Le Moal
                   ` (23 subsequent siblings)
  28 siblings, 2 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

Remove the static definition of bio_attempt_back_merge() to allow using
this function internally from other block layer files. Add the
definition of enum bio_merge_status and the declaration of
bio_attempt_back_merge() to block/blk.h.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-merge.c | 8 +-------
 block/blk.h       | 8 ++++++++
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2d470cf2173e..a1ef61b03e31 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -964,13 +964,7 @@ static void blk_account_io_merge_bio(struct request *req)
 	part_stat_unlock();
 }
 
-enum bio_merge_status {
-	BIO_MERGE_OK,
-	BIO_MERGE_NONE,
-	BIO_MERGE_FAILED,
-};
-
-static enum bio_merge_status bio_attempt_back_merge(struct request *req,
+enum bio_merge_status bio_attempt_back_merge(struct request *req,
 		struct bio *bio, unsigned int nr_segs)
 {
 	const blk_opf_t ff = bio_failfast(bio);
diff --git a/block/blk.h b/block/blk.h
index 23f76b452e70..5180e103ed9c 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -256,6 +256,14 @@ static inline void bio_integrity_free(struct bio *bio)
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
 
+enum bio_merge_status {
+	BIO_MERGE_OK,
+	BIO_MERGE_NONE,
+	BIO_MERGE_FAILED,
+};
+
+enum bio_merge_status bio_attempt_back_merge(struct request *req,
+		struct bio *bio, unsigned int nr_segs);
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 		unsigned int nr_segs);
 bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 06/26] block: Introduce zone write plugging
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (4 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 05/26] block: Allow using bio_attempt_back_merge() internally Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04  3:56   ` Ming Lei
                     ` (2 more replies)
  2024-02-02  7:30 ` [PATCH 07/26] block: Allow zero value of max_zone_append_sectors queue limit Damien Le Moal
                   ` (22 subsequent siblings)
  28 siblings, 3 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

Zone write plugging implements a per-zone "plug" for write operations to
tightly control the submission and execution order of writes to
sequential write required zones of a zoned block device. Per-zone
plugging of writes guarantees that at any time at most one write request
per zone is in flight. This mechanism is intended to replace zone write
locking which is controlled at the scheduler level and implemented only
by mq-deadline.

Unlike zone write locking which operates on requests, zone write
plugging operates on BIOs. A zone write plug is simply a BIO list that
is atomically manipulated using a spinlock and a kblockd submission
work. A write BIO to a zone is "plugged" to delay its execution if a
write BIO for the same zone was already issued, that is, if a write
request for the same zone is being executed. The next plugged BIO is
unplugged and issued once the write request completes.

This mechanism allows to:
 - Untangles zone write ordering from block IO schedulers. This allows
   removing the restriction on using only mq-deadline for zoned block
   devices. Any block IO scheduler, including "none" can be used.
 - Zone write plugging operates on BIOs instead of requests. Plugged
   BIOs waiting for execution thus do not hold scheduling tags and thus
   are not preventing other BIOs to proceed (reads or writes to other
   zones). Depending on the workload, this can significantly improve
   the device use and performance.
 - Both blk-mq (request) based zoned devices and BIO-based devices (e.g.
   device mapper) can use zone write plugging. It is mandatory for the
   former but optional for the latter: BIO-based driver can use zone
   write plugging to implement write ordering guarantees, or the drivers
   can implement their own if needed.
 - The code is less invasive in the block layer and is mostly limited to
   blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
   bio.c.

Zone write plugging is implemented using struct blk_zone_wplug. This
structurei includes a spinlock, a BIO list and a work structure to
handle the submission of plugged BIOs.

Plugging of zone write BIOs is done using the function
blk_zone_write_plug_bio() which returns false if a BIO execution does
not need to be delayed and true otherwise. This function is called
from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
spanning multiple zones which would cause mishandling of zone write
plugging. This enables by default zone write plugging for any mq
request-based block device. BIO-based device drivers can also use zone
write plugging by expliclty calling blk_zone_write_plug_bio() in their
->submit_bio method. For such devices, the driver must ensure that a
BIO passed to blk_zone_write_plug_bio() is already split and not
straddling zone boundaries.

Only write and write zeroes BIOs are plugged. Zone write plugging does
not introduce any significant overhead for other operations. A BIO that
is being handled through zone write plugging is flagged using the new
BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
The completion processing of BIOs and requests flagged trigger
respectively calls to the functions blk_zone_write_plug_bio_endio() and
blk_zone_write_plug_complete_request(). The latter function is used to
trigger submission of the next plugged BIO using the zone plug work.
blk_zone_write_plug_bio_endio() does the same for BIO-based devices.
This ensures that at any time, at most one request (blk-mq devices) or
one BIO (BIO-based devices) are being executed for any zone. The
handling of zone write plug using a per-zone plug spinlock maximizes
parrallelism and device usage by allowing multiple zones to be writen
simultaneously without lock contention.

Zone write plugging ignores flush BIOs without data. Hovever, any flush
BIO that has data is always plugged so that the write part of the flush
sequence is serialized with other regular writes.

Given that any BIO handled through zone write plugging will be the only
BIO in flight for the target zone when it is executed, the unplugging
and submission of a BIO will have no chance of successfully merging with
plugged requests or requests in the scheduler. To overcome this
potential performance loss, blk_mq_submit_bio() calls the function
blk_zone_write_plug_attempt_merge() to try to merge other plugged BIOs
with the one just unplugged. Successful merging is signaled using
blk_zone_write_plug_bio_merged(), called from bio_attempt_back_merge().
Furthermore, to avoid recalculating the number of segments of plugged
BIOs to attempt merging, the number of segments of a plugged BIO is
saved using the new struct bio field __bi_nr_segments. To avoid growing
the size of struct bio, this field is added as a union with the
bio_cookie field. This is safe to do as polling is always disabled for
plugged BIOs.

When BIOs are plugged in a zone write plug, the device request queue
usage counter is always incremented. This kept and reused when the
plugged BIO is unplugged and submitted again using
submit_bio_noacct_nocheck(). For this case, the unplugged BIO is already
flagged with BIO_ZONE_WRITE_PLUGGING and blk_mq_submit_bio() proceeds
directly to allocating a new request for the BIO, re-using the usage
reference count taken when the BIO was plugged. This extra reference
count is dropped in blk_zone_write_plug_attempt_merge() for any plugged
BIO that is successfully merged. Given that BIO-based devices will not
take this path, the extra reference is dropped when a plugged BIO is
unplugged and submitted.

To match the new data structures used for zoned disks, the function
disk_free_zone_bitmaps() is renamed to the more generic
disk_free_zone_resources().

This commit contains contributions from Christoph Hellwig <hch@lst.de>.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/bio.c               |   7 +
 block/blk-merge.c         |  11 +
 block/blk-mq.c            |  28 +++
 block/blk-zoned.c         | 408 +++++++++++++++++++++++++++++++++++++-
 block/blk.h               |  32 ++-
 block/genhd.c             |   2 +-
 include/linux/blk-mq.h    |   2 +
 include/linux/blk_types.h |   8 +-
 include/linux/blkdev.h    |   8 +
 9 files changed, 496 insertions(+), 10 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b9642a41f286..c8b0f7e8c713 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1581,6 +1581,13 @@ void bio_endio(struct bio *bio)
 	if (!bio_integrity_endio(bio))
 		return;
 
+	/*
+	 * For BIOs handled through a zone write plugs, signal the end of the
+	 * BIO to the zone write plug to submit the next plugged BIO.
+	 */
+	if (bio_zone_write_plugging(bio))
+		blk_zone_write_plug_bio_endio(bio);
+
 	rq_qos_done_bio(bio);
 
 	if (bio->bi_bdev && bio_flagged(bio, BIO_TRACE_COMPLETION)) {
diff --git a/block/blk-merge.c b/block/blk-merge.c
index a1ef61b03e31..2b5489cd9c65 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -377,6 +377,7 @@ struct bio *__bio_split_to_limits(struct bio *bio,
 		blkcg_bio_issue_init(split);
 		bio_chain(split, bio);
 		trace_block_split(split, bio->bi_iter.bi_sector);
+		WARN_ON_ONCE(bio_zone_write_plugging(bio));
 		submit_bio_noacct(bio);
 		return split;
 	}
@@ -980,6 +981,9 @@ enum bio_merge_status bio_attempt_back_merge(struct request *req,
 
 	blk_update_mixed_merge(req, bio, false);
 
+	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
+		blk_zone_write_plug_bio_merged(bio);
+
 	req->biotail->bi_next = bio;
 	req->biotail = bio;
 	req->__data_len += bio->bi_iter.bi_size;
@@ -995,6 +999,13 @@ static enum bio_merge_status bio_attempt_front_merge(struct request *req,
 {
 	const blk_opf_t ff = bio_failfast(bio);
 
+	/*
+	 * A front merge for zone writes can happen only if the user submitted
+	 * writes out of order. Do not attempt this to let the write fail.
+	 */
+	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
+		return BIO_MERGE_FAILED;
+
 	if (!ll_front_merge_fn(req, bio, nr_segs))
 		return BIO_MERGE_FAILED;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f02e486a02ae..aa49bebf1199 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -830,6 +830,9 @@ static void blk_complete_request(struct request *req)
 		bio = next;
 	} while (bio);
 
+	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
+		blk_zone_write_plug_complete_request(req);
+
 	/*
 	 * Reset counters so that the request stacking driver
 	 * can find how many bytes remain in the request
@@ -943,6 +946,9 @@ bool blk_update_request(struct request *req, blk_status_t error,
 	 * completely done
 	 */
 	if (!req->bio) {
+		if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
+			blk_zone_write_plug_complete_request(req);
+
 		/*
 		 * Reset counters so that the request stacking driver
 		 * can find how many bytes remain in the request
@@ -2975,6 +2981,17 @@ void blk_mq_submit_bio(struct bio *bio)
 	struct request *rq;
 	blk_status_t ret;
 
+	/*
+	 * A BIO that was released form a zone write plug has already been
+	 * through the preparation in this function, already holds a reference
+	 * on the queue usage counter, and is the only write BIO in-flight for
+	 * the target zone. Go straight to allocating a request for it.
+	 */
+	if (bio_zone_write_plugging(bio)) {
+		nr_segs = bio->__bi_nr_segments;
+		goto new_request;
+	}
+
 	bio = blk_queue_bounce(bio, q);
 	bio_set_ioprio(bio);
 
@@ -3001,7 +3018,11 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
 		goto queue_exit;
 
+	if (blk_queue_is_zoned(q) && blk_zone_write_plug_bio(bio, nr_segs))
+		goto queue_exit;
+
 	if (!rq) {
+new_request:
 		rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
 		if (unlikely(!rq))
 			goto queue_exit;
@@ -3017,8 +3038,12 @@ void blk_mq_submit_bio(struct bio *bio)
 
 	ret = blk_crypto_rq_get_keyslot(rq);
 	if (ret != BLK_STS_OK) {
+		bool zwplugging = bio_zone_write_plugging(bio);
+
 		bio->bi_status = ret;
 		bio_endio(bio);
+		if (zwplugging)
+			blk_zone_write_plug_complete_request(rq);
 		blk_mq_free_request(rq);
 		return;
 	}
@@ -3026,6 +3051,9 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
 		return;
 
+	if (bio_zone_write_plugging(bio))
+		blk_zone_write_plug_attempt_merge(rq);
+
 	if (plug) {
 		blk_add_rq_to_plug(plug, rq);
 		return;
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index d343e5756a9c..f6d4f511b664 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -7,11 +7,11 @@
  *
  * Copyright (c) 2016, Damien Le Moal
  * Copyright (c) 2016, Western Digital
+ * Copyright (c) 2024, Western Digital Corporation or its affiliates.
  */
 
 #include <linux/kernel.h>
 #include <linux/module.h>
-#include <linux/rbtree.h>
 #include <linux/blkdev.h>
 #include <linux/blk-mq.h>
 #include <linux/mm.h>
@@ -19,6 +19,7 @@
 #include <linux/sched/mm.h>
 
 #include "blk.h"
+#include "blk-mq-sched.h"
 
 #define ZONE_COND_NAME(name) [BLK_ZONE_COND_##name] = #name
 static const char *const zone_cond_name[] = {
@@ -33,6 +34,27 @@ static const char *const zone_cond_name[] = {
 };
 #undef ZONE_COND_NAME
 
+/*
+ * Per-zone write plug.
+ */
+struct blk_zone_wplug {
+	spinlock_t		lock;
+	unsigned int		flags;
+	struct bio_list		bio_list;
+	struct work_struct	bio_work;
+};
+
+/*
+ * Zone write plug flags bits:
+ *  - BLK_ZONE_WPLUG_CONV: Indicate that the zone is a conventional one. Writes
+ *    to these zones are never plugged.
+ *  - BLK_ZONE_WPLUG_PLUGGED: Indicate that the zone write plug is plugged,
+ *    that is, that write BIOs are being throttled due to a write BIO already
+ *    being executed or the zone write plug bio list is not empty.
+ */
+#define BLK_ZONE_WPLUG_CONV	(1U << 0)
+#define BLK_ZONE_WPLUG_PLUGGED	(1U << 1)
+
 /**
  * blk_zone_cond_str - Return string XXX in BLK_ZONE_COND_XXX.
  * @zone_cond: BLK_ZONE_COND_XXX.
@@ -429,12 +451,374 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
 	return ret;
 }
 
-void disk_free_zone_bitmaps(struct gendisk *disk)
+#define blk_zone_wplug_lock(zwplug, flags) \
+	spin_lock_irqsave(&zwplug->lock, flags)
+
+#define blk_zone_wplug_unlock(zwplug, flags) \
+	spin_unlock_irqrestore(&zwplug->lock, flags)
+
+static inline void blk_zone_wplug_bio_io_error(struct bio *bio)
+{
+	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+
+	bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING);
+	bio_io_error(bio);
+	blk_queue_exit(q);
+}
+
+static int blk_zone_wplug_abort(struct gendisk *disk,
+				struct blk_zone_wplug *zwplug)
+{
+	struct bio *bio;
+	int nr_aborted = 0;
+
+	while ((bio = bio_list_pop(&zwplug->bio_list))) {
+		blk_zone_wplug_bio_io_error(bio);
+		nr_aborted++;
+	}
+
+	return nr_aborted;
+}
+
+/*
+ * Return the zone write plug for sector in sequential write required zone.
+ * Given that conventional zones have no write ordering constraints, NULL is
+ * returned for sectors in conventional zones, to indicate that zone write
+ * plugging is not needed.
+ */
+static inline struct blk_zone_wplug *
+disk_lookup_zone_wplug(struct gendisk *disk, sector_t sector)
+{
+	struct blk_zone_wplug *zwplug;
+
+	if (WARN_ON_ONCE(!disk->zone_wplugs))
+		return NULL;
+
+	zwplug = &disk->zone_wplugs[disk_zone_no(disk, sector)];
+	if (zwplug->flags & BLK_ZONE_WPLUG_CONV)
+		return NULL;
+	return zwplug;
+}
+
+static inline struct blk_zone_wplug *bio_lookup_zone_wplug(struct bio *bio)
+{
+	return disk_lookup_zone_wplug(bio->bi_bdev->bd_disk,
+				      bio->bi_iter.bi_sector);
+}
+
+static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
+					  struct bio *bio, unsigned int nr_segs)
+{
+	/*
+	 * Keep a reference on the BIO request queue usage. This reference will
+	 * be dropped either if the BIO is failed or after it is issued and
+	 * completes.
+	 */
+	percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
+
+	/*
+	 * The BIO is being plugged and thus will have to wait for the on-going
+	 * write and for all other writes already plugged. So polling makes
+	 * no sense.
+	 */
+	bio_clear_polled(bio);
+
+	/*
+	 * Reuse the poll cookie field to store the number of segments when
+	 * split to the hardware limits.
+	 */
+	bio->__bi_nr_segments = nr_segs;
+
+	/*
+	 * We always receive BIOs after they are split and ready to be issued.
+	 * The block layer passes the parts of a split BIO in order, and the
+	 * user must also issue write sequentially. So simply add the new BIO
+	 * at the tail of the list to preserve the sequential write order.
+	 */
+	bio_list_add(&zwplug->bio_list, bio);
+}
+
+/*
+ * Called from bio_attempt_back_merge() when a BIO was merged with a request.
+ */
+void blk_zone_write_plug_bio_merged(struct bio *bio)
+{
+	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
+}
+
+/*
+ * Attempt to merge plugged BIOs with a newly formed request of a BIO that went
+ * through zone write plugging (either a new BIO or one that was unplugged).
+ */
+void blk_zone_write_plug_attempt_merge(struct request *req)
+{
+	struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(req->bio);
+	sector_t req_back_sector = blk_rq_pos(req) + blk_rq_sectors(req);
+	struct request_queue *q = req->q;
+	unsigned long flags;
+	struct bio *bio;
+
+	/*
+	 * Completion of this request needs to be handled with
+	 * blk_zone_write_complete_request().
+	 */
+	req->rq_flags |= RQF_ZONE_WRITE_PLUGGING;
+
+	if (blk_queue_nomerges(q))
+		return;
+
+	/*
+	 * Walk through the list of plugged BIOs to check if they can be merged
+	 * into the back of the request.
+	 */
+	blk_zone_wplug_lock(zwplug, flags);
+	while ((bio = bio_list_peek(&zwplug->bio_list))) {
+		if (bio->bi_iter.bi_sector != req_back_sector ||
+		    !blk_rq_merge_ok(req, bio))
+			break;
+
+		WARN_ON_ONCE(bio_op(bio) != REQ_OP_WRITE_ZEROES &&
+			     !bio->__bi_nr_segments);
+
+		bio_list_pop(&zwplug->bio_list);
+		if (bio_attempt_back_merge(req, bio, bio->__bi_nr_segments) !=
+		    BIO_MERGE_OK) {
+			bio_list_add_head(&zwplug->bio_list, bio);
+			break;
+		}
+
+		/*
+		 * Drop the extra reference on the queue usage we got when
+		 * plugging the BIO.
+		 */
+		blk_queue_exit(q);
+
+		req_back_sector += bio_sectors(bio);
+	}
+	blk_zone_wplug_unlock(zwplug, flags);
+}
+
+static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
+{
+	struct blk_zone_wplug *zwplug;
+	unsigned long flags;
+
+	/*
+	 * BIOs must be fully contained within a zone so that we use the correct
+	 * zone write plug for the entire BIO. For blk-mq devices, the block
+	 * layer should already have done any splitting required to ensure this
+	 * and this BIO should thus not be straddling zone boundaries. For
+	 * BIO-based devices, it is the responsibility of the driver to split
+	 * the bio before submitting it.
+	 */
+	if (WARN_ON_ONCE(bio_straddle_zones(bio))) {
+		bio_io_error(bio);
+		return true;
+	}
+
+	zwplug = bio_lookup_zone_wplug(bio);
+	if (!zwplug)
+		return false;
+
+	blk_zone_wplug_lock(zwplug, flags);
+
+	/* Indicate that this BIO is being handled using zone write plugging. */
+	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
+
+	/*
+	 * If the zone is already plugged, add the BIO to the plug BIO list.
+	 * Otherwise, plug and let the BIO execute.
+	 */
+	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED) {
+		blk_zone_wplug_add_bio(zwplug, bio, nr_segs);
+		blk_zone_wplug_unlock(zwplug, flags);
+		return true;
+	}
+
+	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+
+	blk_zone_wplug_unlock(zwplug, flags);
+
+	return false;
+}
+
+/**
+ * blk_zone_write_plug_bio - Handle a zone write BIO with zone write plugging
+ * @bio: The BIO being submitted
+ *
+ * Handle write and write zeroes operations using zone write plugging.
+ * Return true whenever @bio execution needs to be delayed through the zone
+ * write plug. Otherwise, return false to let the submission path process
+ * @bio normally.
+ */
+bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs)
+{
+	if (!bio->bi_bdev->bd_disk->zone_wplugs)
+		return false;
+
+	/*
+	 * If the BIO already has the plugging flag set, then it was already
+	 * handled through this path and this is a submission from the zone
+	 * plug bio submit work.
+	 */
+	if (bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING))
+		return false;
+
+	/*
+	 * We do not need to do anything special for empty flush BIOs, e.g
+	 * BIOs such as issued by blkdev_issue_flush(). The is because it is
+	 * the responsibility of the user to first wait for the completion of
+	 * write operations for flush to have any effect on the persistence of
+	 * the written data.
+	 */
+	if (op_is_flush(bio->bi_opf) && !bio_sectors(bio))
+		return false;
+
+	/*
+	 * Regular writes and write zeroes need to be handled through the target
+	 * zone write plug. This includes writes with REQ_FUA | REQ_PREFLUSH
+	 * which may need to go through the flush machinery depending on the
+	 * target device capabilities. Plugging such writes is fine as the flush
+	 * machinery operates at the request level, below the plug, and
+	 * completion of the flush sequence will go through the regular BIO
+	 * completion, which will handle zone write plugging.
+	 */
+	switch (bio_op(bio)) {
+	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
+		return blk_zone_wplug_handle_write(bio, nr_segs);
+	default:
+		return false;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(blk_zone_write_plug_bio);
+
+static void blk_zone_write_plug_unplug_bio(struct blk_zone_wplug *zwplug)
+{
+	unsigned long flags;
+
+	blk_zone_wplug_lock(zwplug, flags);
+
+	/* Schedule submission of the next plugged BIO if we have one. */
+	if (!bio_list_empty(&zwplug->bio_list))
+		kblockd_schedule_work(&zwplug->bio_work);
+	else
+		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+
+	blk_zone_wplug_unlock(zwplug, flags);
+}
+
+void blk_zone_write_plug_bio_endio(struct bio *bio)
+{
+	/* Make sure we do not see this BIO again by clearing the plug flag. */
+	bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING);
+
+	/*
+	 * For BIO-based devices, blk_zone_write_plug_complete_request()
+	 * is not called. So we need to schedule execution of the next
+	 * plugged BIO here.
+	 */
+	if (bio->bi_bdev->bd_has_submit_bio) {
+		struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(bio);
+
+		blk_zone_write_plug_unplug_bio(zwplug);
+	}
+}
+
+void blk_zone_write_plug_complete_request(struct request *req)
+{
+	struct gendisk *disk = req->q->disk;
+	struct blk_zone_wplug *zwplug =
+		disk_lookup_zone_wplug(disk, req->__sector);
+
+	req->rq_flags &= ~RQF_ZONE_WRITE_PLUGGING;
+
+	blk_zone_write_plug_unplug_bio(zwplug);
+}
+
+static void blk_zone_wplug_bio_work(struct work_struct *work)
+{
+	struct blk_zone_wplug *zwplug =
+		container_of(work, struct blk_zone_wplug, bio_work);
+	unsigned long flags;
+	struct bio *bio;
+
+	/*
+	 * Unplug and submit the next plugged BIO. If we do not have any, clear
+	 * the plugged flag.
+	 */
+	blk_zone_wplug_lock(zwplug, flags);
+
+	bio = bio_list_pop(&zwplug->bio_list);
+	if (!bio) {
+		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+		blk_zone_wplug_unlock(zwplug, flags);
+		return;
+	}
+
+	blk_zone_wplug_unlock(zwplug, flags);
+
+	/*
+	 * blk-mq devices will reuse the reference on the request queue usage
+	 * we took when the BIO was plugged, but the submission path for
+	 * BIO-based devices will not do that. So drop this reference here.
+	 */
+	if (bio->bi_bdev->bd_has_submit_bio)
+		blk_queue_exit(bio->bi_bdev->bd_disk->queue);
+
+	submit_bio_noacct_nocheck(bio);
+}
+
+static struct blk_zone_wplug *blk_zone_alloc_write_plugs(unsigned int nr_zones)
+{
+	struct blk_zone_wplug *zwplugs;
+	unsigned int i;
+
+	zwplugs = kvcalloc(nr_zones, sizeof(struct blk_zone_wplug), GFP_NOIO);
+	if (!zwplugs)
+		return NULL;
+
+	for (i = 0; i < nr_zones; i++) {
+		spin_lock_init(&zwplugs[i].lock);
+		bio_list_init(&zwplugs[i].bio_list);
+		INIT_WORK(&zwplugs[i].bio_work, blk_zone_wplug_bio_work);
+	}
+
+	return zwplugs;
+}
+
+static void blk_zone_free_write_plugs(struct gendisk *disk,
+				      struct blk_zone_wplug *zwplugs,
+				      unsigned int nr_zones)
+{
+	struct blk_zone_wplug *zwplug = zwplugs;
+	unsigned int i, n;
+
+	if (!zwplug)
+		return;
+
+	/* Make sure we do not leak any plugged BIO. */
+	for (i = 0; i < nr_zones; i++, zwplug++) {
+		n = blk_zone_wplug_abort(disk, zwplug);
+		if (n)
+			pr_warn_ratelimited("%s: zone %u, %u plugged BIOs aborted\n",
+					    disk->disk_name, i, n);
+	}
+
+	kvfree(zwplugs);
+}
+
+void disk_free_zone_resources(struct gendisk *disk)
 {
 	kfree(disk->conv_zones_bitmap);
 	disk->conv_zones_bitmap = NULL;
 	kfree(disk->seq_zones_wlock);
 	disk->seq_zones_wlock = NULL;
+
+	blk_zone_free_write_plugs(disk, disk->zone_wplugs, disk->nr_zones);
+	disk->zone_wplugs = NULL;
 }
 
 struct blk_revalidate_zone_args {
@@ -442,6 +826,7 @@ struct blk_revalidate_zone_args {
 	unsigned long	*conv_zones_bitmap;
 	unsigned long	*seq_zones_wlock;
 	unsigned int	nr_zones;
+	struct blk_zone_wplug *zone_wplugs;
 	sector_t	sector;
 };
 
@@ -496,6 +881,7 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 				return -ENOMEM;
 		}
 		set_bit(idx, args->conv_zones_bitmap);
+		args->zone_wplugs[idx].flags |= BLK_ZONE_WPLUG_CONV;
 		break;
 	case BLK_ZONE_TYPE_SEQWRITE_REQ:
 		if (!args->seq_zones_wlock) {
@@ -540,7 +926,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 	sector_t capacity = get_capacity(disk);
 	struct blk_revalidate_zone_args args = { };
 	unsigned int noio_flag;
-	int ret;
+	int ret = -ENOMEM;
 
 	if (WARN_ON_ONCE(!blk_queue_is_zoned(q)))
 		return -EIO;
@@ -570,9 +956,14 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 	 * Ensure that all memory allocations in this context are done as if
 	 * GFP_NOIO was specified.
 	 */
+	noio_flag = memalloc_noio_save();
+
 	args.disk = disk;
 	args.nr_zones = (capacity + zone_sectors - 1) >> ilog2(zone_sectors);
-	noio_flag = memalloc_noio_save();
+	args.zone_wplugs = blk_zone_alloc_write_plugs(args.nr_zones);
+	if (!args.zone_wplugs)
+		goto out_restore_noio;
+
 	ret = disk->fops->report_zones(disk, 0, UINT_MAX,
 				       blk_revalidate_zone_cb, &args);
 	if (!ret) {
@@ -601,17 +992,24 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 		disk->nr_zones = args.nr_zones;
 		swap(disk->seq_zones_wlock, args.seq_zones_wlock);
 		swap(disk->conv_zones_bitmap, args.conv_zones_bitmap);
+		swap(disk->zone_wplugs, args.zone_wplugs);
 		if (update_driver_data)
 			update_driver_data(disk);
 		ret = 0;
 	} else {
 		pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
-		disk_free_zone_bitmaps(disk);
+		disk_free_zone_resources(disk);
 	}
 	blk_mq_unfreeze_queue(q);
 
 	kfree(args.seq_zones_wlock);
 	kfree(args.conv_zones_bitmap);
+	blk_zone_free_write_plugs(disk, args.zone_wplugs, args.nr_zones);
+
+	return ret;
+
+out_restore_noio:
+	memalloc_noio_restore(noio_flag);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
diff --git a/block/blk.h b/block/blk.h
index 5180e103ed9c..d0ecd5a2002c 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -403,7 +403,13 @@ static inline struct bio *blk_queue_bounce(struct bio *bio,
 }
 
 #ifdef CONFIG_BLK_DEV_ZONED
-void disk_free_zone_bitmaps(struct gendisk *disk);
+void disk_free_zone_resources(struct gendisk *disk);
+static inline bool bio_zone_write_plugging(struct bio *bio)
+{
+	return bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING);
+}
+void blk_zone_write_plug_bio_merged(struct bio *bio);
+void blk_zone_write_plug_attempt_merge(struct request *rq);
 static inline void blk_zone_complete_request_bio(struct request *rq,
 						 struct bio *bio)
 {
@@ -411,22 +417,42 @@ static inline void blk_zone_complete_request_bio(struct request *rq,
 	 * For zone append requests, the request sector indicates the location
 	 * at which the BIO data was written. Return this value to the BIO
 	 * issuer through the BIO iter sector.
+	 * For plugged zone writes, we need the original BIO sector so
+	 * that blk_zone_write_plug_bio_endio() can lookup the zone write plug.
 	 */
-	if (req_op(rq) == REQ_OP_ZONE_APPEND)
+	if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio))
 		bio->bi_iter.bi_sector = rq->__sector;
 }
+void blk_zone_write_plug_bio_endio(struct bio *bio);
+void blk_zone_write_plug_complete_request(struct request *rq);
 int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd,
 		unsigned long arg);
 int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
 		unsigned int cmd, unsigned long arg);
 #else /* CONFIG_BLK_DEV_ZONED */
-static inline void disk_free_zone_bitmaps(struct gendisk *disk)
+static inline void disk_free_zone_resources(struct gendisk *disk)
+{
+}
+static inline bool bio_zone_write_plugging(struct bio *bio)
+{
+	return false;
+}
+static inline void blk_zone_write_plug_bio_merged(struct bio *bio)
+{
+}
+static inline void blk_zone_write_plug_attempt_merge(struct request *rq)
 {
 }
 static inline void blk_zone_complete_request_bio(struct request *rq,
 						 struct bio *bio)
 {
 }
+static inline void blk_zone_write_plug_bio_endio(struct bio *bio)
+{
+}
+static inline void blk_zone_write_plug_complete_request(struct request *rq)
+{
+}
 static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
 		unsigned int cmd, unsigned long arg)
 {
diff --git a/block/genhd.c b/block/genhd.c
index d74fb5b4ae68..fe45d4713b28 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1182,7 +1182,7 @@ static void disk_release(struct device *dev)
 
 	disk_release_events(disk);
 	kfree(disk->random);
-	disk_free_zone_bitmaps(disk);
+	disk_free_zone_resources(disk);
 	xa_destroy(&disk->part_tbl);
 
 	disk->queue->disk = NULL;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7a8150a5f051..bc74f904b5a1 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -55,6 +55,8 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
 /* The per-zone write lock is held for this request */
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
+/* The request completion needs to be signaled to zone write pluging. */
+#define RQF_ZONE_WRITE_PLUGGING	((__force req_flags_t)(1 << 20))
 /* ->timeout has been called, don't expire again */
 #define RQF_TIMED_OUT		((__force req_flags_t)(1 << 21))
 #define RQF_RESV		((__force req_flags_t)(1 << 23))
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1c07848dea7e..19839d303289 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -232,7 +232,12 @@ struct bio {
 
 	struct bvec_iter	bi_iter;
 
-	blk_qc_t		bi_cookie;
+	union {
+		/* for polled bios: */
+		blk_qc_t		bi_cookie;
+		/* for plugged zoned writes only: */
+		unsigned int		__bi_nr_segments;
+	};
 	bio_end_io_t		*bi_end_io;
 	void			*bi_private;
 #ifdef CONFIG_BLK_CGROUP
@@ -303,6 +308,7 @@ enum {
 	BIO_QOS_MERGED,		/* but went through rq_qos merge path */
 	BIO_REMAPPED,
 	BIO_ZONE_WRITE_LOCKED,	/* Owns a zoned device zone write lock */
+	BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
 	BIO_FLAG_LAST
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 0bb897f0501c..d58aaed6dc24 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -39,6 +39,7 @@ struct rq_qos;
 struct blk_queue_stats;
 struct blk_stat_callback;
 struct blk_crypto_profile;
+struct blk_zone_wplug;
 
 extern const struct device_type disk_type;
 extern const struct device_type part_type;
@@ -193,6 +194,7 @@ struct gendisk {
 	unsigned int		max_active_zones;
 	unsigned long		*conv_zones_bitmap;
 	unsigned long		*seq_zones_wlock;
+	struct blk_zone_wplug	*zone_wplugs;
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 #if IS_ENABLED(CONFIG_CDROM)
@@ -658,6 +660,7 @@ static inline unsigned int bdev_max_active_zones(struct block_device *bdev)
 	return bdev->bd_disk->max_active_zones;
 }
 
+bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline unsigned int bdev_nr_zones(struct block_device *bdev)
 {
@@ -685,6 +688,11 @@ static inline unsigned int bdev_max_active_zones(struct block_device *bdev)
 {
 	return 0;
 }
+static inline bool blk_zone_write_plug_bio(struct bio *bio,
+					   unsigned int nr_segs)
+{
+	return false;
+}
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 static inline unsigned int blk_queue_depth(struct request_queue *q)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 07/26] block: Allow zero value of max_zone_append_sectors queue limit
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (5 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 06/26] block: Introduce zone write plugging Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 12:15   ` Hannes Reinecke
  2024-02-02  7:30 ` [PATCH 08/26] block: Implement zone append emulation Damien Le Moal
                   ` (21 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

In preparation for adding a generic zone append emulation using zone
write plugging, allow device drivers supporting zoned block device to
set a the max_zone_append_sectors queue limit of a device to 0 to
indicate the lack of native support for zone append operations and that
the block layer should emulate these operations using regular write
operations.

blk_queue_max_zone_append_sectors() is modified to allow passing 0 as
the max_zone_append_sectors argument. The function
queue_max_zone_append_sectors() is also modified to ensure that the
minimum of the max_sectors and chunk_sectors limit is used whenever the
max_zone_append_sectors limit is 0.

The helper functions queue_emulates_zone_append() and
bdev_emulates_zone_append() are added to test if a queue (or block
device) emulates zone append operations.

In order for blk_revalidate_disk_zones() to accept zoned block devices
relying on zone append emulation, the direct check to the
max_zone_append_sectors queue limit of the disk is replaced by a check
using the value returned by queue_max_zone_append_sectors(). Similarly,
queue_zone_append_max_show() is modified to use the same accessor so
that the sysfs attribute advertizes the non-zero limit that will be
used, regardless if it is for native or emulated commands.

For stacking drivers, a top device should not need to care if the
underlying devices have native or emulated zone append operations.
blk_stack_limits() is thus modified to set the top device
max_zone_append_sectors limit using the new accessor
queue_limits_max_zone_append_sectors(). queue_max_zone_append_sectors()
is modified to use this function as well. Stacking drivers that require
zone append emulation, e.g. dm-crypt, can still request this feature by
calling blk_queue_max_zone_append_sectors() with a 0 limit.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-core.c       |  2 +-
 block/blk-settings.c   | 35 +++++++++++++++++++++++------------
 block/blk-sysfs.c      |  2 +-
 block/blk-zoned.c      |  2 +-
 include/linux/blkdev.h | 20 +++++++++++++++++---
 5 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 71c6614a97fe..3945cfcc4d9b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -590,7 +590,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 		return BLK_STS_IOERR;
 
 	/* Make sure the BIO is small enough and will not get split */
-	if (nr_sectors > q->limits.max_zone_append_sectors)
+	if (nr_sectors > queue_max_zone_append_sectors(q))
 		return BLK_STS_IOERR;
 
 	bio->bi_opf |= REQ_NOMERGE;
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 06ea91e51b8b..f00bcb595444 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -211,24 +211,32 @@ EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
  * blk_queue_max_zone_append_sectors - set max sectors for a single zone append
  * @q:  the request queue for the device
  * @max_zone_append_sectors: maximum number of sectors to write per command
+ *
+ * Sets the maximum number of sectors allowed for zone append commands. If
+ * Specifying 0 for @max_zone_append_sectors indicates that the queue does
+ * not natively support zone append operations and that the block layer must
+ * emulate these operations using regular writes.
  **/
 void blk_queue_max_zone_append_sectors(struct request_queue *q,
 		unsigned int max_zone_append_sectors)
 {
-	unsigned int max_sectors;
+	unsigned int max_sectors = 0;
 
 	if (WARN_ON(!blk_queue_is_zoned(q)))
 		return;
 
-	max_sectors = min(q->limits.max_hw_sectors, max_zone_append_sectors);
-	max_sectors = min(q->limits.chunk_sectors, max_sectors);
-
-	/*
-	 * Signal eventual driver bugs resulting in the max_zone_append sectors limit
-	 * being 0 due to a 0 argument, the chunk_sectors limit (zone size) not set,
-	 * or the max_hw_sectors limit not set.
-	 */
-	WARN_ON(!max_sectors);
+	if (max_zone_append_sectors) {
+		max_sectors = min(q->limits.max_hw_sectors,
+				  max_zone_append_sectors);
+		max_sectors = min(q->limits.chunk_sectors, max_sectors);
+
+		/*
+		 * Signal eventual driver bugs resulting in the max_zone_append
+		 * sectors limit being 0 due to the chunk_sectors limit (zone
+		 * size) not set or the max_hw_sectors limit not set.
+		 */
+		WARN_ON(!max_sectors);
+	}
 
 	q->limits.max_zone_append_sectors = max_sectors;
 }
@@ -563,8 +571,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
-	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
-					b->max_zone_append_sectors);
+	t->max_zone_append_sectors = min(queue_limits_max_zone_append_sectors(t),
+					 queue_limits_max_zone_append_sectors(b));
 	t->bounce = max(t->bounce, b->bounce);
 
 	t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
@@ -689,6 +697,9 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->zone_write_granularity = max(t->zone_write_granularity,
 					b->zone_write_granularity);
 	t->zoned = max(t->zoned, b->zoned);
+	if (!t->zoned)
+		t->max_zone_append_sectors = 0;
+
 	return ret;
 }
 EXPORT_SYMBOL(blk_stack_limits);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 6b2429cad81a..89c41dcd8bc4 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -218,7 +218,7 @@ static ssize_t queue_zone_write_granularity_show(struct request_queue *q,
 
 static ssize_t queue_zone_append_max_show(struct request_queue *q, char *page)
 {
-	unsigned long long max_sectors = q->limits.max_zone_append_sectors;
+	unsigned long long max_sectors = queue_max_zone_append_sectors(q);
 
 	return sprintf(page, "%llu\n", max_sectors << SECTOR_SHIFT);
 }
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index f6d4f511b664..661ef61ca3b1 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -946,7 +946,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 		return -ENODEV;
 	}
 
-	if (!q->limits.max_zone_append_sectors) {
+	if (!queue_max_zone_append_sectors(q)) {
 		pr_warn("%s: Invalid 0 maximum zone append limit\n",
 			disk->disk_name);
 		return -ENODEV;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d58aaed6dc24..87fba5af34ba 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1137,12 +1137,26 @@ static inline unsigned int queue_max_segment_size(const struct request_queue *q)
 	return q->limits.max_segment_size;
 }
 
-static inline unsigned int queue_max_zone_append_sectors(const struct request_queue *q)
+static inline unsigned int queue_limits_max_zone_append_sectors(struct queue_limits *l)
 {
+	unsigned int max_sectors = min(l->chunk_sectors, l->max_sectors);
 
-	const struct queue_limits *l = &q->limits;
+	return min_not_zero(l->max_zone_append_sectors, max_sectors);
+}
+
+static inline unsigned int queue_max_zone_append_sectors(struct request_queue *q)
+{
+	return queue_limits_max_zone_append_sectors(&q->limits);
+}
 
-	return min(l->max_zone_append_sectors, l->max_sectors);
+static inline bool queue_emulates_zone_append(struct request_queue *q)
+{
+	return blk_queue_is_zoned(q) && !q->limits.max_zone_append_sectors;
+}
+
+static inline bool bdev_emulates_zone_append(struct block_device *bdev)
+{
+	return queue_emulates_zone_append(bdev_get_queue(bdev));
 }
 
 static inline unsigned int
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 08/26] block: Implement zone append emulation
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (6 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 07/26] block: Allow zero value of max_zone_append_sectors queue limit Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 12:24   ` Hannes Reinecke
  2024-02-05 17:58   ` Bart Van Assche
  2024-02-02  7:30 ` [PATCH 09/26] block: Allow BIO-based drivers to use blk_revalidate_disk_zones() Damien Le Moal
                   ` (20 subsequent siblings)
  28 siblings, 2 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

Given that zone write plugging manages all writes to zones of a zoned
block device, we can track the write pointer position of all zones in
order to implement zone append emulation using regular write operations.
This is needed for devices that do not natively support the zone append
command, e.g. SMR hard-disks.

This commit adds zone write pointer tracking similarly to how the SCSI
disk driver (sd) does, that is, in the form of a 32-bits number of
sectors equal to the offset within the zone of the zone write pointer.
The wp_offset field is added to struct blk_zone_wplug for this. Write
pointer tracking is only enabled for zoned devices that requested
zone append emulation by setting the max_zone_append_sectors queue
limit of the disk to 0.

For zoned devices that requested zone append emulation, wp_offset is
managed as follows:
 - It is incremented when a write BIO is prepared for submission or
   merged into a new request. This is done in
   blk_zone_wplug_prepare_bio() when a BIO is unplugged, in
   blk_zone_write_plug_bio_merged() when a new unplugged BIO is merged
   before zone write plugging and in blk_zone_write_plug_attempt_merge()
   when plugged BIOs are merged into a new request.
 - The helper functions blk_zone_handle_reset() and
   blk_zone_handle_reset_all() are added to set the write pointer
   offset to 0 for the targeted zones of REQ_OP_ZONE_RESET and
   REQ_OP_ZONE_RESETALL operations.
 - The helper function blk_zone_handle_finish() is added to set the
   write pointer offset to the zone size for the target zone of a
   REQ_OP_ZONE_FINISH operation.

The function blk_zone_wplug_prepare_bio() also checks and prepares a BIO
for submission. Preparation involves changing zone append BIOs into
non-mergeable regular write BIOs for devices that require zone append
emulation. Modified zone append BIOs are flagged with the new BIO flag
BIO_EMULATES_ZONE_APPEND. This flag is checked on completion of the
BIO in blk_zone_complete_requests_bio() to restore the original
REQ_OP_ZONE_APPEND operation code of the BIO.

If a write error happens, the wp_offset value may become incorrect and
out of sync with the device managed write pointer. This is handled using
the new zone write plug flag BLK_ZONE_WPLUG_ERROR. The function
blk_zone_wplug_handle_error() is called from the new disk zone write
plug work when this flag is set. This function executes a report zone to
update the zone write pointer offset to the current value as indicated
by the device. The disk zone write plug work is scheduled whenever a BIO
flagged with BIO_ZONE_WRITE_PLUGGING completes with an error or when
bio_zone_wplug_prepare_bio() detects an unaligned write. Once scheduled,
the disk zone write plugs work keeps running until all zone errors are
handled.

The block layer internal inline helper function bio_is_zone_append() is
added to test if a BIO is either a native zone append operation
(REQ_OP_ZONE_APPEND operation code) or if it is flagged with
BIO_EMULATES_ZONE_APPEND. Given that both native and emulated zone
append BIO completion handling should be similar, The functions
blk_update_request() and blk_zone_complete_request_bio() are modified to
use bio_is_zone_append() to execute blk_zone_complete_request_bio() for
both native and emulated zone append operations.

This commit contains contributions from Christoph Hellwig <hch@lst.de>.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-mq.c            |   2 +-
 block/blk-zoned.c         | 457 ++++++++++++++++++++++++++++++++++++--
 block/blk.h               |  14 +-
 include/linux/blk_types.h |   1 +
 include/linux/blkdev.h    |   3 +
 5 files changed, 452 insertions(+), 25 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index aa49bebf1199..a112298a6541 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -909,7 +909,7 @@ bool blk_update_request(struct request *req, blk_status_t error,
 
 		if (bio_bytes == bio->bi_iter.bi_size) {
 			req->bio = bio->bi_next;
-		} else if (req_op(req) == REQ_OP_ZONE_APPEND) {
+		} else if (bio_is_zone_append(bio)) {
 			/*
 			 * Partial zone append completions cannot be supported
 			 * as the BIO fragments may end up not being written
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 661ef61ca3b1..929c28796c41 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -42,6 +42,8 @@ struct blk_zone_wplug {
 	unsigned int		flags;
 	struct bio_list		bio_list;
 	struct work_struct	bio_work;
+	unsigned int		wp_offset;
+	unsigned int		capacity;
 };
 
 /*
@@ -51,9 +53,12 @@ struct blk_zone_wplug {
  *  - BLK_ZONE_WPLUG_PLUGGED: Indicate that the zone write plug is plugged,
  *    that is, that write BIOs are being throttled due to a write BIO already
  *    being executed or the zone write plug bio list is not empty.
+ *  - BLK_ZONE_WPLUG_ERROR: Indicate that a write error happened which will be
+ *    recovered with a report zone to update the zone write pointer offset.
  */
 #define BLK_ZONE_WPLUG_CONV	(1U << 0)
 #define BLK_ZONE_WPLUG_PLUGGED	(1U << 1)
+#define BLK_ZONE_WPLUG_ERROR	(1U << 2)
 
 /**
  * blk_zone_cond_str - Return string XXX in BLK_ZONE_COND_XXX.
@@ -480,6 +485,28 @@ static int blk_zone_wplug_abort(struct gendisk *disk,
 	return nr_aborted;
 }
 
+static void blk_zone_wplug_abort_unaligned(struct gendisk *disk,
+					   struct blk_zone_wplug *zwplug)
+{
+	unsigned int wp_offset = zwplug->wp_offset;
+	struct bio_list bl = BIO_EMPTY_LIST;
+	struct bio *bio;
+
+	while ((bio = bio_list_pop(&zwplug->bio_list))) {
+		if (wp_offset >= zwplug->capacity ||
+		    (bio_op(bio) != REQ_OP_ZONE_APPEND &&
+		     bio_offset_from_zone_start(bio) != wp_offset)) {
+			blk_zone_wplug_bio_io_error(bio);
+			continue;
+		}
+
+		wp_offset += bio_sectors(bio);
+		bio_list_add(&bl, bio);
+	}
+
+	bio_list_merge(&zwplug->bio_list, &bl);
+}
+
 /*
  * Return the zone write plug for sector in sequential write required zone.
  * Given that conventional zones have no write ordering constraints, NULL is
@@ -506,6 +533,87 @@ static inline struct blk_zone_wplug *bio_lookup_zone_wplug(struct bio *bio)
 				      bio->bi_iter.bi_sector);
 }
 
+/*
+ * Set a zone write plug write pointer offset to either 0 (zone reset case)
+ * or to the zone size (zone finish case). This aborts all plugged BIOs, which
+ * is fine to do as doing a zone reset or zone finish while writes are in-flight
+ * is a mistake from the user which will most likely cause all plugged BIOs to
+ * fail anyway.
+ */
+static void blk_zone_wplug_set_wp_offset(struct gendisk *disk,
+					 struct blk_zone_wplug *zwplug,
+					 unsigned int wp_offset)
+{
+	/*
+	 * Updating the write pointer offset puts back the zone
+	 * in a good state. So clear the error flag and decrement the
+	 * error count if we were in error state.
+	 */
+	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) {
+		zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
+		atomic_dec(&disk->zone_nr_wplugs_with_error);
+	}
+
+	/* Update the zone write pointer and abort all plugged BIOs. */
+	zwplug->wp_offset = wp_offset;
+	blk_zone_wplug_abort(disk, zwplug);
+}
+
+static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
+						  unsigned int wp_offset)
+{
+	struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(bio);
+	unsigned long flags;
+
+	/* Conventional zones cannot be reset nor finished. */
+	if (!zwplug) {
+		bio_io_error(bio);
+		return true;
+	}
+
+	if (!bdev_emulates_zone_append(bio->bi_bdev))
+		return false;
+
+	/*
+	 * Set the zone write pointer offset to 0 (reset case) or to the
+	 * zone size (finish case). This will abort all BIOs plugged for the
+	 * target zone. It is fine as resetting or finishing zones while writes
+	 * are still in-flight will result in the writes failing anyway.
+	 */
+	blk_zone_wplug_lock(zwplug, flags);
+	blk_zone_wplug_set_wp_offset(bio->bi_bdev->bd_disk, zwplug, wp_offset);
+	blk_zone_wplug_unlock(zwplug, flags);
+
+	return false;
+}
+
+static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	struct blk_zone_wplug *zwplug = &disk->zone_wplugs[0];
+	unsigned long flags;
+	unsigned int i;
+
+	if (!bdev_emulates_zone_append(bio->bi_bdev))
+		return false;
+
+	/*
+	 * Set the write pointer offset of all zones to 0. This will abort all
+	 * plugged BIOs. It is fine as resetting zones while writes are still
+	 * in-flight will result in the writes failing anyway..
+	 */
+	for (i = 0; i < disk->nr_zones; i++, zwplug++) {
+		/* Ignore conventional zones. */
+		if (zwplug->flags & BLK_ZONE_WPLUG_CONV)
+			continue;
+		blk_zone_wplug_lock(zwplug, flags);
+		blk_zone_wplug_set_wp_offset(disk, zwplug, 0);
+		blk_zone_wplug_unlock(zwplug, flags);
+	}
+
+	return false;
+}
+
 static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
 					  struct bio *bio, unsigned int nr_segs)
 {
@@ -543,7 +651,26 @@ static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
  */
 void blk_zone_write_plug_bio_merged(struct bio *bio)
 {
+	struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(bio);
+	unsigned long flags;
+
+	/*
+	 * If the BIO was already plugged, then this we were called through
+	 * blk_zone_write_plug_attempt_merge() -> blk_attempt_bio_merge().
+	 * For this case, blk_zone_write_plug_attempt_merge() will handle the
+	 * zone write pointer offset update.
+	 */
+	if (bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING))
+		return;
+
+	blk_zone_wplug_lock(zwplug, flags);
+
 	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
+
+	/* Advance the zone write pointer offset. */
+	zwplug->wp_offset += bio_sectors(bio);
+
+	blk_zone_wplug_unlock(zwplug, flags);
 }
 
 /*
@@ -572,7 +699,8 @@ void blk_zone_write_plug_attempt_merge(struct request *req)
 	 * into the back of the request.
 	 */
 	blk_zone_wplug_lock(zwplug, flags);
-	while ((bio = bio_list_peek(&zwplug->bio_list))) {
+	while (zwplug->wp_offset < zwplug->capacity &&
+	       (bio = bio_list_peek(&zwplug->bio_list))) {
 		if (bio->bi_iter.bi_sector != req_back_sector ||
 		    !blk_rq_merge_ok(req, bio))
 			break;
@@ -589,15 +717,86 @@ void blk_zone_write_plug_attempt_merge(struct request *req)
 
 		/*
 		 * Drop the extra reference on the queue usage we got when
-		 * plugging the BIO.
+		 * plugging the BIO and advance the write pointer offset.
 		 */
 		blk_queue_exit(q);
+		zwplug->wp_offset += bio_sectors(bio);
 
 		req_back_sector += bio_sectors(bio);
 	}
 	blk_zone_wplug_unlock(zwplug, flags);
 }
 
+static inline void blk_zone_wplug_set_error(struct gendisk *disk,
+					    struct blk_zone_wplug *zwplug)
+{
+	if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR)) {
+		zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
+		atomic_inc(&disk->zone_nr_wplugs_with_error);
+	}
+}
+
+/*
+ * Prepare a zone write bio for submission by incrementing the write pointer and
+ * setting up the zone append emulation if needed.
+ */
+static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
+				       struct bio *bio)
+{
+	/*
+	 * If we do not need to emulate zone append, zone write pointer offset
+	 * tracking is not necessary and we have nothing to do.
+	 */
+	if (!bdev_emulates_zone_append(bio->bi_bdev))
+		return true;
+
+	/*
+	 * Check that the user is not attempting to write to a full zone.
+	 * We know such BIO will fail, and that would potentially overflow our
+	 * write pointer offset, causing zone append BIOs for one zone to be
+	 * directed at the following zone.
+	 */
+	if (zwplug->wp_offset >= zwplug->capacity)
+		goto err;
+
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		/*
+		 * Use a regular write starting at the current write pointer.
+		 * Similarly to native zone append operations, do not allow
+		 * merging.
+		 */
+		bio->bi_opf &= ~REQ_OP_MASK;
+		bio->bi_opf |= REQ_OP_WRITE | REQ_NOMERGE;
+		bio->bi_iter.bi_sector += zwplug->wp_offset;
+
+		/*
+		 * Remember that this BIO is in fact a zone append operation
+		 * so that we can restore its operation code on completion.
+		 */
+		bio_set_flag(bio, BIO_EMULATES_ZONE_APPEND);
+	} else {
+		/*
+		 * Check for non-sequential writes early because we avoid a
+		 * whole lot of error handling trouble if we don't send it off
+		 * to the driver.
+		 */
+		if (bio_offset_from_zone_start(bio) != zwplug->wp_offset)
+			goto err;
+	}
+
+	/* Advance the zone write pointer offset. */
+	zwplug->wp_offset += bio_sectors(bio);
+
+	return true;
+
+err:
+	/* We detected an invalid write BIO: schedule error recovery. */
+	blk_zone_wplug_set_error(bio->bi_bdev->bd_disk, zwplug);
+	kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND,
+				&bio->bi_bdev->bd_disk->zone_wplugs_work, 0);
+	return false;
+}
+
 static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 {
 	struct blk_zone_wplug *zwplug;
@@ -617,8 +816,17 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	}
 
 	zwplug = bio_lookup_zone_wplug(bio);
-	if (!zwplug)
+	if (!zwplug) {
+		/*
+		 * Zone append operations to conventional zones are not
+		 * allowed.
+		 */
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+			bio_io_error(bio);
+			return true;
+		}
 		return false;
+	}
 
 	blk_zone_wplug_lock(zwplug, flags);
 
@@ -626,34 +834,48 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
 
 	/*
-	 * If the zone is already plugged, add the BIO to the plug BIO list.
-	 * Otherwise, plug and let the BIO execute.
+	 * If the zone is already plugged or has a pending error, add the BIO
+	 * to the plug BIO list. Otherwise, plug and let the BIO execute.
 	 */
-	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED) {
-		blk_zone_wplug_add_bio(zwplug, bio, nr_segs);
-		blk_zone_wplug_unlock(zwplug, flags);
-		return true;
-	}
+	if (zwplug->flags & (BLK_ZONE_WPLUG_PLUGGED | BLK_ZONE_WPLUG_ERROR))
+		goto plug;
+
+	/*
+	 * If an error is detected when preparing the BIO, add it to the BIO
+	 * list so that error recovery can deal with it.
+	 */
+	if (!blk_zone_wplug_prepare_bio(zwplug, bio))
+		goto plug;
 
 	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
 
 	blk_zone_wplug_unlock(zwplug, flags);
 
 	return false;
+
+plug:
+	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+	blk_zone_wplug_add_bio(zwplug, bio, nr_segs);
+
+	blk_zone_wplug_unlock(zwplug, flags);
+
+	return true;
 }
 
 /**
  * blk_zone_write_plug_bio - Handle a zone write BIO with zone write plugging
  * @bio: The BIO being submitted
  *
- * Handle write and write zeroes operations using zone write plugging.
- * Return true whenever @bio execution needs to be delayed through the zone
- * write plug. Otherwise, return false to let the submission path process
- * @bio normally.
+ * Handle write, write zeroes and zone append operations requiring emulation
+ * using zone write plugging. Return true whenever @bio execution needs to be
+ * delayed through the zone write plug. Otherwise, return false to let the
+ * submission path process @bio normally.
  */
 bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs)
 {
-	if (!bio->bi_bdev->bd_disk->zone_wplugs)
+	struct block_device *bdev = bio->bi_bdev;
+
+	if (!bdev->bd_disk->zone_wplugs)
 		return false;
 
 	/*
@@ -682,11 +904,30 @@ bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs)
 	 * machinery operates at the request level, below the plug, and
 	 * completion of the flush sequence will go through the regular BIO
 	 * completion, which will handle zone write plugging.
+	 * Zone append operations that need emulation must also be plugged so
+	 * that these operations can be changed into regular writes.
+	 * Zone reset, reset all and finish commands need special treatment
+	 * to correctly track the write pointer offset of zones when zone
+	 * append emulation is needed. These commands are not plugged as we do
+	 * not need serialization with write and append operations. It is the
+	 * responsibility of the user to not issue reset and finish commands
+	 * when write operations are in flight.
 	 */
 	switch (bio_op(bio)) {
+	case REQ_OP_ZONE_APPEND:
+		if (!bdev_emulates_zone_append(bdev))
+			return false;
+		fallthrough;
 	case REQ_OP_WRITE:
 	case REQ_OP_WRITE_ZEROES:
 		return blk_zone_wplug_handle_write(bio, nr_segs);
+	case REQ_OP_ZONE_RESET:
+		return blk_zone_wplug_handle_reset_or_finish(bio, 0);
+	case REQ_OP_ZONE_FINISH:
+		return blk_zone_wplug_handle_reset_or_finish(bio,
+						bdev_zone_sectors(bdev));
+	case REQ_OP_ZONE_RESET_ALL:
+		return blk_zone_wplug_handle_reset_all(bio);
 	default:
 		return false;
 	}
@@ -695,12 +936,24 @@ bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs)
 }
 EXPORT_SYMBOL_GPL(blk_zone_write_plug_bio);
 
-static void blk_zone_write_plug_unplug_bio(struct blk_zone_wplug *zwplug)
+static void blk_zone_write_plug_unplug_bio(struct gendisk *disk,
+					   struct blk_zone_wplug *zwplug)
 {
 	unsigned long flags;
 
 	blk_zone_wplug_lock(zwplug, flags);
 
+	/*
+	 * If we had an error, schedule error recovery. The recovery work
+	 * will restart submission of plugged BIOs.
+	 */
+	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) {
+		blk_zone_wplug_unlock(zwplug, flags);
+		kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND,
+					    &disk->zone_wplugs_work, 0);
+		return;
+	}
+
 	/* Schedule submission of the next plugged BIO if we have one. */
 	if (!bio_list_empty(&zwplug->bio_list))
 		kblockd_schedule_work(&zwplug->bio_work);
@@ -712,19 +965,35 @@ static void blk_zone_write_plug_unplug_bio(struct blk_zone_wplug *zwplug)
 
 void blk_zone_write_plug_bio_endio(struct bio *bio)
 {
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(bio);
+
 	/* Make sure we do not see this BIO again by clearing the plug flag. */
 	bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING);
 
+	/*
+	 * If this is a regular write emulating a zone append operation,
+	 * restore the original operation code.
+	 */
+	if (bio_flagged(bio, BIO_EMULATES_ZONE_APPEND)) {
+		bio->bi_opf &= ~REQ_OP_MASK;
+		bio->bi_opf |= REQ_OP_ZONE_APPEND;
+	}
+
+	/*
+	 * If the BIO failed, mark the plug as having an error to trigger
+	 * recovery.
+	 */
+	if (bio->bi_status != BLK_STS_OK)
+		blk_zone_wplug_set_error(disk, zwplug);
+
 	/*
 	 * For BIO-based devices, blk_zone_write_plug_complete_request()
 	 * is not called. So we need to schedule execution of the next
 	 * plugged BIO here.
 	 */
-	if (bio->bi_bdev->bd_has_submit_bio) {
-		struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(bio);
-
-		blk_zone_write_plug_unplug_bio(zwplug);
-	}
+	if (bio->bi_bdev->bd_has_submit_bio)
+		blk_zone_write_plug_unplug_bio(disk, zwplug);
 }
 
 void blk_zone_write_plug_complete_request(struct request *req)
@@ -735,7 +1004,7 @@ void blk_zone_write_plug_complete_request(struct request *req)
 
 	req->rq_flags &= ~RQF_ZONE_WRITE_PLUGGING;
 
-	blk_zone_write_plug_unplug_bio(zwplug);
+	blk_zone_write_plug_unplug_bio(disk, zwplug);
 }
 
 static void blk_zone_wplug_bio_work(struct work_struct *work)
@@ -758,6 +1027,13 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 		return;
 	}
 
+	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
+		/* Error recovery will decide what to do with the BIO. */
+		bio_list_add_head(&zwplug->bio_list, bio);
+		blk_zone_wplug_unlock(zwplug, flags);
+		return;
+	}
+
 	blk_zone_wplug_unlock(zwplug, flags);
 
 	/*
@@ -771,6 +1047,120 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 	submit_bio_noacct_nocheck(bio);
 }
 
+static unsigned int blk_zone_wp_offset(struct blk_zone *zone)
+{
+	switch (zone->cond) {
+	case BLK_ZONE_COND_IMP_OPEN:
+	case BLK_ZONE_COND_EXP_OPEN:
+	case BLK_ZONE_COND_CLOSED:
+		return zone->wp - zone->start;
+	case BLK_ZONE_COND_FULL:
+		return zone->len;
+	case BLK_ZONE_COND_EMPTY:
+		return 0;
+	case BLK_ZONE_COND_NOT_WP:
+	case BLK_ZONE_COND_OFFLINE:
+	case BLK_ZONE_COND_READONLY:
+	default:
+		/*
+		 * Conventional, offline and read-only zones do not have a valid
+		 * write pointer.
+		 */
+		return UINT_MAX;
+	}
+}
+
+static int blk_zone_wplug_get_zone_cb(struct blk_zone *zone,
+				      unsigned int idx, void *data)
+{
+	struct blk_zone *zonep = data;
+
+	*zonep = *zone;
+	return 0;
+}
+
+static void blk_zone_wplug_handle_error(struct gendisk *disk,
+					struct blk_zone_wplug *zwplug)
+{
+	unsigned int zno = zwplug - disk->zone_wplugs;
+	sector_t zone_start_sector = bdev_zone_sectors(disk->part0) * zno;
+	unsigned int noio_flag;
+	struct blk_zone zone;
+	unsigned long flags;
+	int ret;
+
+	/* Check if we have an error and clear it if we do. */
+	blk_zone_wplug_lock(zwplug, flags);
+	if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR))
+		goto unlock;
+	zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
+	atomic_dec(&disk->zone_nr_wplugs_with_error);
+	blk_zone_wplug_unlock(zwplug, flags);
+
+	/* Get the current zone information from the device. */
+	noio_flag = memalloc_noio_save();
+	ret = disk->fops->report_zones(disk, zone_start_sector, 1,
+				       blk_zone_wplug_get_zone_cb, &zone);
+	memalloc_noio_restore(noio_flag);
+
+	blk_zone_wplug_lock(zwplug, flags);
+
+	if (ret != 1) {
+		/*
+		 * We failed to get the zone information, likely meaning that
+		 * something is really wrong with the device. Abort all
+		 * remaining plugged BIOs as otherwise we could endup waiting
+		 * forever on plugged BIOs to complete if there is a revalidate
+		 * or queue freeze on-going.
+		 */
+		blk_zone_wplug_abort(disk, zwplug);
+		goto unplug;
+	}
+
+	/* Update the zone capacity and write pointer offset. */
+	zwplug->wp_offset = blk_zone_wp_offset(&zone);
+	zwplug->capacity = zone.capacity;
+
+	blk_zone_wplug_abort_unaligned(disk, zwplug);
+
+	/* Restart BIO submission if we still have any BIO left. */
+	if (!bio_list_empty(&zwplug->bio_list)) {
+		WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
+		kblockd_schedule_work(&zwplug->bio_work);
+		goto unlock;
+	}
+
+unplug:
+	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+
+unlock:
+	blk_zone_wplug_unlock(zwplug, flags);
+}
+
+static void disk_zone_wplugs_work(struct work_struct *work)
+{
+	struct gendisk *disk =
+		container_of(work, struct gendisk, zone_wplugs_work.work);
+	struct blk_zone_wplug *zwplug;
+	unsigned int i;
+
+	while (atomic_read(&disk->zone_nr_wplugs_with_error)) {
+		/* Serialize against revalidate. */
+		mutex_lock(&disk->zone_wplugs_mutex);
+
+		zwplug = disk->zone_wplugs;
+		if (!zwplug) {
+			mutex_unlock(&disk->zone_wplugs_mutex);
+			return;
+		}
+
+		for (i = 0; i < disk->nr_zones; i++, zwplug++)
+			blk_zone_wplug_handle_error(disk, zwplug);
+
+		mutex_unlock(&disk->zone_wplugs_mutex);
+	}
+}
+
 static struct blk_zone_wplug *blk_zone_alloc_write_plugs(unsigned int nr_zones)
 {
 	struct blk_zone_wplug *zwplugs;
@@ -794,6 +1184,7 @@ static void blk_zone_free_write_plugs(struct gendisk *disk,
 				      unsigned int nr_zones)
 {
 	struct blk_zone_wplug *zwplug = zwplugs;
+	unsigned long flags;
 	unsigned int i, n;
 
 	if (!zwplug)
@@ -801,7 +1192,13 @@ static void blk_zone_free_write_plugs(struct gendisk *disk,
 
 	/* Make sure we do not leak any plugged BIO. */
 	for (i = 0; i < nr_zones; i++, zwplug++) {
+		blk_zone_wplug_lock(zwplug, flags);
+		if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) {
+			atomic_dec(&disk->zone_nr_wplugs_with_error);
+			zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
+		}
 		n = blk_zone_wplug_abort(disk, zwplug);
+		blk_zone_wplug_unlock(zwplug, flags);
 		if (n)
 			pr_warn_ratelimited("%s: zone %u, %u plugged BIOs aborted\n",
 					    disk->disk_name, i, n);
@@ -812,6 +1209,9 @@ static void blk_zone_free_write_plugs(struct gendisk *disk,
 
 void disk_free_zone_resources(struct gendisk *disk)
 {
+	if (disk->zone_wplugs)
+		cancel_delayed_work_sync(&disk->zone_wplugs_work);
+
 	kfree(disk->conv_zones_bitmap);
 	disk->conv_zones_bitmap = NULL;
 	kfree(disk->seq_zones_wlock);
@@ -819,6 +1219,8 @@ void disk_free_zone_resources(struct gendisk *disk)
 
 	blk_zone_free_write_plugs(disk, disk->zone_wplugs, disk->nr_zones);
 	disk->zone_wplugs = NULL;
+
+	mutex_destroy(&disk->zone_wplugs_mutex);
 }
 
 struct blk_revalidate_zone_args {
@@ -890,6 +1292,8 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 			if (!args->seq_zones_wlock)
 				return -ENOMEM;
 		}
+		args->zone_wplugs[idx].capacity = zone->capacity;
+		args->zone_wplugs[idx].wp_offset = blk_zone_wp_offset(zone);
 		break;
 	case BLK_ZONE_TYPE_SEQWRITE_PREF:
 	default:
@@ -964,6 +1368,13 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 	if (!args.zone_wplugs)
 		goto out_restore_noio;
 
+	if (!disk->zone_wplugs) {
+		mutex_init(&disk->zone_wplugs_mutex);
+		atomic_set(&disk->zone_nr_wplugs_with_error, 0);
+		INIT_DELAYED_WORK(&disk->zone_wplugs_work,
+				  disk_zone_wplugs_work);
+	}
+
 	ret = disk->fops->report_zones(disk, 0, UINT_MAX,
 				       blk_revalidate_zone_cb, &args);
 	if (!ret) {
@@ -989,12 +1400,14 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 	 */
 	blk_mq_freeze_queue(q);
 	if (ret > 0) {
+		mutex_lock(&disk->zone_wplugs_mutex);
 		disk->nr_zones = args.nr_zones;
 		swap(disk->seq_zones_wlock, args.seq_zones_wlock);
 		swap(disk->conv_zones_bitmap, args.conv_zones_bitmap);
 		swap(disk->zone_wplugs, args.zone_wplugs);
 		if (update_driver_data)
 			update_driver_data(disk);
+		mutex_unlock(&disk->zone_wplugs_mutex);
 		ret = 0;
 	} else {
 		pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
diff --git a/block/blk.h b/block/blk.h
index d0ecd5a2002c..7fbef6bb1aee 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -408,6 +408,11 @@ static inline bool bio_zone_write_plugging(struct bio *bio)
 {
 	return bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING);
 }
+static inline bool bio_is_zone_append(struct bio *bio)
+{
+	return bio_op(bio) == REQ_OP_ZONE_APPEND ||
+		bio_flagged(bio, BIO_EMULATES_ZONE_APPEND);
+}
 void blk_zone_write_plug_bio_merged(struct bio *bio);
 void blk_zone_write_plug_attempt_merge(struct request *rq);
 static inline void blk_zone_complete_request_bio(struct request *rq,
@@ -417,8 +422,9 @@ static inline void blk_zone_complete_request_bio(struct request *rq,
 	 * For zone append requests, the request sector indicates the location
 	 * at which the BIO data was written. Return this value to the BIO
 	 * issuer through the BIO iter sector.
-	 * For plugged zone writes, we need the original BIO sector so
-	 * that blk_zone_write_plug_bio_endio() can lookup the zone write plug.
+	 * For plugged zone writes, which include emulated zone append, we need
+	 * the original BIO sector so that blk_zone_write_plug_bio_endio() can
+	 * lookup the zone write plug.
 	 */
 	if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio))
 		bio->bi_iter.bi_sector = rq->__sector;
@@ -437,6 +443,10 @@ static inline bool bio_zone_write_plugging(struct bio *bio)
 {
 	return false;
 }
+static inline bool bio_is_zone_append(struct bio *bio)
+{
+	return false;
+}
 static inline void blk_zone_write_plug_bio_merged(struct bio *bio)
 {
 }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 19839d303289..5c5343099800 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -309,6 +309,7 @@ enum {
 	BIO_REMAPPED,
 	BIO_ZONE_WRITE_LOCKED,	/* Owns a zoned device zone write lock */
 	BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
+	BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
 	BIO_FLAG_LAST
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 87fba5af34ba..e619e10847bd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -195,6 +195,9 @@ struct gendisk {
 	unsigned long		*conv_zones_bitmap;
 	unsigned long		*seq_zones_wlock;
 	struct blk_zone_wplug	*zone_wplugs;
+	struct mutex		zone_wplugs_mutex;
+	atomic_t		zone_nr_wplugs_with_error;
+	struct delayed_work	zone_wplugs_work;
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 #if IS_ENABLED(CONFIG_CDROM)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 09/26] block: Allow BIO-based drivers to use blk_revalidate_disk_zones()
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (7 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 08/26] block: Implement zone append emulation Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 12:26   ` Hannes Reinecke
  2024-02-02  7:30 ` [PATCH 10/26] dm: Use the block layer zone append emulation Damien Le Moal
                   ` (19 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

Remove the check in blk_revalidate_disk_zones() restricting the use of
this function to mq request-based drivers to allow also BIO-based
drivers to use it. This is safe to do as long as the BIO-based block
device queue is already setup and usable, as it should, and can be
safely frozen.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 929c28796c41..8bf6821735f3 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1316,8 +1316,8 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
  * be called within the disk ->revalidate method for blk-mq based drivers.
  * Before calling this function, the device driver must already have set the
  * device zone size (chunk_sector limit) and the max zone append limit.
- * For BIO based drivers, this function cannot be used. BIO based device drivers
- * only need to set disk->nr_zones so that the sysfs exposed value is correct.
+ * BIO based drivers can also use this function as long as the device queue
+ * can be safely frozen.
  * If the @update_driver_data callback function is not NULL, the callback is
  * executed with the device request queue frozen after all zones have been
  * checked.
@@ -1334,8 +1334,6 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 
 	if (WARN_ON_ONCE(!blk_queue_is_zoned(q)))
 		return -EIO;
-	if (WARN_ON_ONCE(!queue_is_mq(q)))
-		return -EIO;
 
 	if (!capacity)
 		return -ENODEV;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 10/26] dm: Use the block layer zone append emulation
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (8 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 09/26] block: Allow BIO-based drivers to use blk_revalidate_disk_zones() Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-03 17:58   ` Mike Snitzer
  2024-02-04 12:30   ` Hannes Reinecke
  2024-02-02  7:30 ` [PATCH 11/26] scsi: sd: " Damien Le Moal
                   ` (18 subsequent siblings)
  28 siblings, 2 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

For targets requiring zone append operation emulation with regular
writes (e.g. dm-crypt), we can use the block layer emulation provided by
zone write plugging. Remove DM implemented zone append emulation and
enable the block layer one.

This is done by setting the max_zone_append_sectors limit of the
mapped device queue to 0 for mapped devices that have a target table
that cannot support native zone append operations. These includes
mixed zoned and non-zoned targets, or targets that explicitly requested
emulation of zone append (e.g. dm-crypt). For these mapped devices, the
new field emulate_zone_append is set to true. dm_split_and_process_bio()
is modified to call blk_zone_write_plug_bio() for such device to let the
block layer transform zone append operations into regular writes. This
is done after ensuring that the submitted BIO is split if it straddles
zone boundaries.

dm_revalidate_zones() is also modified to use the block layer provided
function blk_revalidate_disk_zones() so that all zone resources needed
for zone append emulation are allocated and initialized by the block
layer without DM core needing to do anything. Since the device table is
not yet live when dm_revalidate_zones() is executed, enabling the use of
blk_revalidate_disk_zones() requires adding a pointer to the device
table in struct mapped_device. This avoids errors in
dm_blk_report_zones() trying to get the table with dm_get_live_table().
The mapped device table pointer is set to the table passed as argument
to dm_revalidate_zones() before calling blk_revalidate_disk_zones() and
reset to NULL after this function returns to restore the live table
handling for user call of report zones.

All the code related to zone append emulation is removed from
dm-zone.c. This leads to simplifications of the functions __map_bio()
and dm_zone_endio(). This later function now only needs to deal with
completions of real zone append operations for targets that support it.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/md/dm-core.h |  11 +-
 drivers/md/dm-zone.c | 470 ++++---------------------------------------
 drivers/md/dm.c      |  44 ++--
 drivers/md/dm.h      |   7 -
 4 files changed, 68 insertions(+), 464 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 095b9b49aa82..42d6b21e2395 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -138,7 +138,8 @@ struct mapped_device {
 
 #ifdef CONFIG_BLK_DEV_ZONED
 	unsigned int nr_zones;
-	unsigned int *zwp_offset;
+	bool emulate_zone_append;
+	void *zone_revalidate_map;
 #endif
 
 #ifdef CONFIG_IMA
@@ -158,7 +159,6 @@ struct mapped_device {
 #define DMF_DEFERRED_REMOVE 6
 #define DMF_SUSPENDED_INTERNALLY 7
 #define DMF_POST_SUSPENDING 8
-#define DMF_EMULATE_ZONE_APPEND 9
 
 void disable_discard(struct mapped_device *md);
 void disable_write_zeroes(struct mapped_device *md);
@@ -177,13 +177,6 @@ DECLARE_STATIC_KEY_FALSE(stats_enabled);
 DECLARE_STATIC_KEY_FALSE(swap_bios_enabled);
 DECLARE_STATIC_KEY_FALSE(zoned_enabled);
 
-static inline bool dm_emulate_zone_append(struct mapped_device *md)
-{
-	if (blk_queue_is_zoned(md->queue))
-		return test_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
-	return false;
-}
-
 #define DM_TABLE_MAX_DEPTH 16
 
 struct dm_table {
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index eb9832b22b14..570b44b924b8 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -60,16 +60,23 @@ int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
 	struct dm_table *map;
 	int srcu_idx, ret;
 
-	if (dm_suspended_md(md))
-		return -EAGAIN;
+	if (!md->zone_revalidate_map) {
+		/* Regular user context */
+		if (dm_suspended_md(md))
+			return -EAGAIN;
 
-	map = dm_get_live_table(md, &srcu_idx);
-	if (!map)
-		return -EIO;
+		map = dm_get_live_table(md, &srcu_idx);
+		if (!map)
+			return -EIO;
+	} else {
+		/* Zone revalidation during __bind() */
+		map = md->zone_revalidate_map;
+	}
 
 	ret = dm_blk_do_report_zones(md, map, sector, nr_zones, cb, data);
 
-	dm_put_live_table(md, srcu_idx);
+	if (!md->zone_revalidate_map)
+		dm_put_live_table(md, srcu_idx);
 
 	return ret;
 }
@@ -138,85 +145,6 @@ bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
 	}
 }
 
-void dm_cleanup_zoned_dev(struct mapped_device *md)
-{
-	if (md->disk) {
-		bitmap_free(md->disk->conv_zones_bitmap);
-		md->disk->conv_zones_bitmap = NULL;
-		bitmap_free(md->disk->seq_zones_wlock);
-		md->disk->seq_zones_wlock = NULL;
-	}
-
-	kvfree(md->zwp_offset);
-	md->zwp_offset = NULL;
-	md->nr_zones = 0;
-}
-
-static unsigned int dm_get_zone_wp_offset(struct blk_zone *zone)
-{
-	switch (zone->cond) {
-	case BLK_ZONE_COND_IMP_OPEN:
-	case BLK_ZONE_COND_EXP_OPEN:
-	case BLK_ZONE_COND_CLOSED:
-		return zone->wp - zone->start;
-	case BLK_ZONE_COND_FULL:
-		return zone->len;
-	case BLK_ZONE_COND_EMPTY:
-	case BLK_ZONE_COND_NOT_WP:
-	case BLK_ZONE_COND_OFFLINE:
-	case BLK_ZONE_COND_READONLY:
-	default:
-		/*
-		 * Conventional, offline and read-only zones do not have a valid
-		 * write pointer. Use 0 as for an empty zone.
-		 */
-		return 0;
-	}
-}
-
-static int dm_zone_revalidate_cb(struct blk_zone *zone, unsigned int idx,
-				 void *data)
-{
-	struct mapped_device *md = data;
-	struct gendisk *disk = md->disk;
-
-	switch (zone->type) {
-	case BLK_ZONE_TYPE_CONVENTIONAL:
-		if (!disk->conv_zones_bitmap) {
-			disk->conv_zones_bitmap = bitmap_zalloc(disk->nr_zones,
-								GFP_NOIO);
-			if (!disk->conv_zones_bitmap)
-				return -ENOMEM;
-		}
-		set_bit(idx, disk->conv_zones_bitmap);
-		break;
-	case BLK_ZONE_TYPE_SEQWRITE_REQ:
-	case BLK_ZONE_TYPE_SEQWRITE_PREF:
-		if (!disk->seq_zones_wlock) {
-			disk->seq_zones_wlock = bitmap_zalloc(disk->nr_zones,
-							      GFP_NOIO);
-			if (!disk->seq_zones_wlock)
-				return -ENOMEM;
-		}
-		if (!md->zwp_offset) {
-			md->zwp_offset =
-				kvcalloc(disk->nr_zones, sizeof(unsigned int),
-					 GFP_KERNEL);
-			if (!md->zwp_offset)
-				return -ENOMEM;
-		}
-		md->zwp_offset[idx] = dm_get_zone_wp_offset(zone);
-
-		break;
-	default:
-		DMERR("Invalid zone type 0x%x at sectors %llu",
-		      (int)zone->type, zone->start);
-		return -ENODEV;
-	}
-
-	return 0;
-}
-
 /*
  * Revalidate the zones of a mapped device to initialize resource necessary
  * for zone append emulation. Note that we cannot simply use the block layer
@@ -226,41 +154,32 @@ static int dm_zone_revalidate_cb(struct blk_zone *zone, unsigned int idx,
 static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
 {
 	struct gendisk *disk = md->disk;
-	unsigned int noio_flag;
 	int ret;
 
-	/*
-	 * Check if something changed. If yes, cleanup the current resources
-	 * and reallocate everything.
-	 */
+	/* Revalidate ionly if something changed. */
 	if (!disk->nr_zones || disk->nr_zones != md->nr_zones)
-		dm_cleanup_zoned_dev(md);
+		md->nr_zones = 0;
+
 	if (md->nr_zones)
 		return 0;
 
 	/*
-	 * Scan all zones to initialize everything. Ensure that all vmalloc
-	 * operations in this context are done as if GFP_NOIO was specified.
+	 * Our table is not live yet. So the call to dm_get_live_table()
+	 * in dm_blk_report_zones() will fail. So set a temporary pointer to
+	 * our table for dm_blk_report_zones() to use directly.
 	 */
-	noio_flag = memalloc_noio_save();
-	ret = dm_blk_do_report_zones(md, t, 0, disk->nr_zones,
-				     dm_zone_revalidate_cb, md);
-	memalloc_noio_restore(noio_flag);
-	if (ret < 0)
-		goto err;
-	if (ret != disk->nr_zones) {
-		ret = -EIO;
-		goto err;
+	md->zone_revalidate_map = t;
+	ret = blk_revalidate_disk_zones(disk, NULL);
+	md->zone_revalidate_map = NULL;
+
+	if (ret) {
+		DMERR("Revalidate zones failed %d", ret);
+		return ret;
 	}
 
 	md->nr_zones = disk->nr_zones;
 
 	return 0;
-
-err:
-	DMERR("Revalidate zones failed %d", ret);
-	dm_cleanup_zoned_dev(md);
-	return ret;
 }
 
 static int device_not_zone_append_capable(struct dm_target *ti,
@@ -289,6 +208,7 @@ static bool dm_table_supports_zone_append(struct dm_table *t)
 int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
 {
 	struct mapped_device *md = t->md;
+	int ret;
 
 	/*
 	 * For a zoned target, the number of zones should be updated for the
@@ -298,287 +218,29 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
 	md->disk->nr_zones = bdev_nr_zones(md->disk->part0);
 
 	/* Check if zone append is natively supported */
-	if (dm_table_supports_zone_append(t)) {
-		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
-		dm_cleanup_zoned_dev(md);
+	if (dm_table_supports_zone_append(t))
 		return 0;
-	}
 
 	/*
-	 * Mark the mapped device as needing zone append emulation and
+	 * Set the mapped device queue as needing zone append emulation and
 	 * initialize the emulation resources once the capacity is set.
 	 */
-	set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+	md->emulate_zone_append = true;
+	blk_queue_max_zone_append_sectors(q, 0);
 	if (!get_capacity(md->disk))
 		return 0;
 
-	return dm_revalidate_zones(md, t);
-}
+	ret = dm_revalidate_zones(md, t);
+	if (ret)
+		return ret;
 
-static int dm_update_zone_wp_offset_cb(struct blk_zone *zone, unsigned int idx,
-				       void *data)
-{
-	unsigned int *wp_offset = data;
-
-	*wp_offset = dm_get_zone_wp_offset(zone);
+	DMINFO("%s using %s zone append",
+	       md->disk->disk_name,
+	       queue_emulates_zone_append(q) ? "emulated" : "native");
 
 	return 0;
 }
 
-static int dm_update_zone_wp_offset(struct mapped_device *md, unsigned int zno,
-				    unsigned int *wp_ofst)
-{
-	sector_t sector = zno * bdev_zone_sectors(md->disk->part0);
-	unsigned int noio_flag;
-	struct dm_table *t;
-	int srcu_idx, ret;
-
-	t = dm_get_live_table(md, &srcu_idx);
-	if (!t)
-		return -EIO;
-
-	/*
-	 * Ensure that all memory allocations in this context are done as if
-	 * GFP_NOIO was specified.
-	 */
-	noio_flag = memalloc_noio_save();
-	ret = dm_blk_do_report_zones(md, t, sector, 1,
-				     dm_update_zone_wp_offset_cb, wp_ofst);
-	memalloc_noio_restore(noio_flag);
-
-	dm_put_live_table(md, srcu_idx);
-
-	if (ret != 1)
-		return -EIO;
-
-	return 0;
-}
-
-struct orig_bio_details {
-	enum req_op op;
-	unsigned int nr_sectors;
-};
-
-/*
- * First phase of BIO mapping for targets with zone append emulation:
- * check all BIO that change a zone writer pointer and change zone
- * append operations into regular write operations.
- */
-static bool dm_zone_map_bio_begin(struct mapped_device *md,
-				  unsigned int zno, struct bio *clone)
-{
-	sector_t zsectors = bdev_zone_sectors(md->disk->part0);
-	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
-
-	/*
-	 * If the target zone is in an error state, recover by inspecting the
-	 * zone to get its current write pointer position. Note that since the
-	 * target zone is already locked, a BIO issuing context should never
-	 * see the zone write in the DM_ZONE_UPDATING_WP_OFST state.
-	 */
-	if (zwp_offset == DM_ZONE_INVALID_WP_OFST) {
-		if (dm_update_zone_wp_offset(md, zno, &zwp_offset))
-			return false;
-		WRITE_ONCE(md->zwp_offset[zno], zwp_offset);
-	}
-
-	switch (bio_op(clone)) {
-	case REQ_OP_ZONE_RESET:
-	case REQ_OP_ZONE_FINISH:
-		return true;
-	case REQ_OP_WRITE_ZEROES:
-	case REQ_OP_WRITE:
-		/* Writes must be aligned to the zone write pointer */
-		if ((clone->bi_iter.bi_sector & (zsectors - 1)) != zwp_offset)
-			return false;
-		break;
-	case REQ_OP_ZONE_APPEND:
-		/*
-		 * Change zone append operations into a non-mergeable regular
-		 * writes directed at the current write pointer position of the
-		 * target zone.
-		 */
-		clone->bi_opf = REQ_OP_WRITE | REQ_NOMERGE |
-			(clone->bi_opf & (~REQ_OP_MASK));
-		clone->bi_iter.bi_sector += zwp_offset;
-		break;
-	default:
-		DMWARN_LIMIT("Invalid BIO operation");
-		return false;
-	}
-
-	/* Cannot write to a full zone */
-	if (zwp_offset >= zsectors)
-		return false;
-
-	return true;
-}
-
-/*
- * Second phase of BIO mapping for targets with zone append emulation:
- * update the zone write pointer offset array to account for the additional
- * data written to a zone. Note that at this point, the remapped clone BIO
- * may already have completed, so we do not touch it.
- */
-static blk_status_t dm_zone_map_bio_end(struct mapped_device *md, unsigned int zno,
-					struct orig_bio_details *orig_bio_details,
-					unsigned int nr_sectors)
-{
-	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
-
-	/* The clone BIO may already have been completed and failed */
-	if (zwp_offset == DM_ZONE_INVALID_WP_OFST)
-		return BLK_STS_IOERR;
-
-	/* Update the zone wp offset */
-	switch (orig_bio_details->op) {
-	case REQ_OP_ZONE_RESET:
-		WRITE_ONCE(md->zwp_offset[zno], 0);
-		return BLK_STS_OK;
-	case REQ_OP_ZONE_FINISH:
-		WRITE_ONCE(md->zwp_offset[zno],
-			   bdev_zone_sectors(md->disk->part0));
-		return BLK_STS_OK;
-	case REQ_OP_WRITE_ZEROES:
-	case REQ_OP_WRITE:
-		WRITE_ONCE(md->zwp_offset[zno], zwp_offset + nr_sectors);
-		return BLK_STS_OK;
-	case REQ_OP_ZONE_APPEND:
-		/*
-		 * Check that the target did not truncate the write operation
-		 * emulating a zone append.
-		 */
-		if (nr_sectors != orig_bio_details->nr_sectors) {
-			DMWARN_LIMIT("Truncated write for zone append");
-			return BLK_STS_IOERR;
-		}
-		WRITE_ONCE(md->zwp_offset[zno], zwp_offset + nr_sectors);
-		return BLK_STS_OK;
-	default:
-		DMWARN_LIMIT("Invalid BIO operation");
-		return BLK_STS_IOERR;
-	}
-}
-
-static inline void dm_zone_lock(struct gendisk *disk, unsigned int zno,
-				struct bio *clone)
-{
-	if (WARN_ON_ONCE(bio_flagged(clone, BIO_ZONE_WRITE_LOCKED)))
-		return;
-
-	wait_on_bit_lock_io(disk->seq_zones_wlock, zno, TASK_UNINTERRUPTIBLE);
-	bio_set_flag(clone, BIO_ZONE_WRITE_LOCKED);
-}
-
-static inline void dm_zone_unlock(struct gendisk *disk, unsigned int zno,
-				  struct bio *clone)
-{
-	if (!bio_flagged(clone, BIO_ZONE_WRITE_LOCKED))
-		return;
-
-	WARN_ON_ONCE(!test_bit(zno, disk->seq_zones_wlock));
-	clear_bit_unlock(zno, disk->seq_zones_wlock);
-	smp_mb__after_atomic();
-	wake_up_bit(disk->seq_zones_wlock, zno);
-
-	bio_clear_flag(clone, BIO_ZONE_WRITE_LOCKED);
-}
-
-static bool dm_need_zone_wp_tracking(struct bio *bio)
-{
-	/*
-	 * Special processing is not needed for operations that do not need the
-	 * zone write lock, that is, all operations that target conventional
-	 * zones and all operations that do not modify directly a sequential
-	 * zone write pointer.
-	 */
-	if (op_is_flush(bio->bi_opf) && !bio_sectors(bio))
-		return false;
-	switch (bio_op(bio)) {
-	case REQ_OP_WRITE_ZEROES:
-	case REQ_OP_WRITE:
-	case REQ_OP_ZONE_RESET:
-	case REQ_OP_ZONE_FINISH:
-	case REQ_OP_ZONE_APPEND:
-		return bio_zone_is_seq(bio);
-	default:
-		return false;
-	}
-}
-
-/*
- * Special IO mapping for targets needing zone append emulation.
- */
-int dm_zone_map_bio(struct dm_target_io *tio)
-{
-	struct dm_io *io = tio->io;
-	struct dm_target *ti = tio->ti;
-	struct mapped_device *md = io->md;
-	struct bio *clone = &tio->clone;
-	struct orig_bio_details orig_bio_details;
-	unsigned int zno;
-	blk_status_t sts;
-	int r;
-
-	/*
-	 * IOs that do not change a zone write pointer do not need
-	 * any additional special processing.
-	 */
-	if (!dm_need_zone_wp_tracking(clone))
-		return ti->type->map(ti, clone);
-
-	/* Lock the target zone */
-	zno = bio_zone_no(clone);
-	dm_zone_lock(md->disk, zno, clone);
-
-	orig_bio_details.nr_sectors = bio_sectors(clone);
-	orig_bio_details.op = bio_op(clone);
-
-	/*
-	 * Check that the bio and the target zone write pointer offset are
-	 * both valid, and if the bio is a zone append, remap it to a write.
-	 */
-	if (!dm_zone_map_bio_begin(md, zno, clone)) {
-		dm_zone_unlock(md->disk, zno, clone);
-		return DM_MAPIO_KILL;
-	}
-
-	/* Let the target do its work */
-	r = ti->type->map(ti, clone);
-	switch (r) {
-	case DM_MAPIO_SUBMITTED:
-		/*
-		 * The target submitted the clone BIO. The target zone will
-		 * be unlocked on completion of the clone.
-		 */
-		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
-					  *tio->len_ptr);
-		break;
-	case DM_MAPIO_REMAPPED:
-		/*
-		 * The target only remapped the clone BIO. In case of error,
-		 * unlock the target zone here as the clone will not be
-		 * submitted.
-		 */
-		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
-					  *tio->len_ptr);
-		if (sts != BLK_STS_OK)
-			dm_zone_unlock(md->disk, zno, clone);
-		break;
-	case DM_MAPIO_REQUEUE:
-	case DM_MAPIO_KILL:
-	default:
-		dm_zone_unlock(md->disk, zno, clone);
-		sts = BLK_STS_IOERR;
-		break;
-	}
-
-	if (sts != BLK_STS_OK)
-		return DM_MAPIO_KILL;
-
-	return r;
-}
-
 /*
  * IO completion callback called from clone_endio().
  */
@@ -587,61 +249,17 @@ void dm_zone_endio(struct dm_io *io, struct bio *clone)
 	struct mapped_device *md = io->md;
 	struct gendisk *disk = md->disk;
 	struct bio *orig_bio = io->orig_bio;
-	unsigned int zwp_offset;
-	unsigned int zno;
 
 	/*
-	 * For targets that do not emulate zone append, we only need to
-	 * handle native zone-append bios.
+	 * Get the offset within the zone of the written sector
+	 * and add that to the original bio sector position.
 	 */
-	if (!dm_emulate_zone_append(md)) {
-		/*
-		 * Get the offset within the zone of the written sector
-		 * and add that to the original bio sector position.
-		 */
-		if (clone->bi_status == BLK_STS_OK &&
-		    bio_op(clone) == REQ_OP_ZONE_APPEND) {
-			sector_t mask =
-				(sector_t)bdev_zone_sectors(disk->part0) - 1;
-
-			orig_bio->bi_iter.bi_sector +=
-				clone->bi_iter.bi_sector & mask;
-		}
-
-		return;
-	}
+	if (clone->bi_status == BLK_STS_OK &&
+	    bio_op(clone) == REQ_OP_ZONE_APPEND) {
+		sector_t mask = bdev_zone_sectors(disk->part0) - 1;
 
-	/*
-	 * For targets that do emulate zone append, if the clone BIO does not
-	 * own the target zone write lock, we have nothing to do.
-	 */
-	if (!bio_flagged(clone, BIO_ZONE_WRITE_LOCKED))
-		return;
-
-	zno = bio_zone_no(orig_bio);
-
-	if (clone->bi_status != BLK_STS_OK) {
-		/*
-		 * BIOs that modify a zone write pointer may leave the zone
-		 * in an unknown state in case of failure (e.g. the write
-		 * pointer was only partially advanced). In this case, set
-		 * the target zone write pointer as invalid unless it is
-		 * already being updated.
-		 */
-		WRITE_ONCE(md->zwp_offset[zno], DM_ZONE_INVALID_WP_OFST);
-	} else if (bio_op(orig_bio) == REQ_OP_ZONE_APPEND) {
-		/*
-		 * Get the written sector for zone append operation that were
-		 * emulated using regular write operations.
-		 */
-		zwp_offset = READ_ONCE(md->zwp_offset[zno]);
-		if (WARN_ON_ONCE(zwp_offset < bio_sectors(orig_bio)))
-			WRITE_ONCE(md->zwp_offset[zno],
-				   DM_ZONE_INVALID_WP_OFST);
-		else
-			orig_bio->bi_iter.bi_sector +=
-				zwp_offset - bio_sectors(orig_bio);
+		orig_bio->bi_iter.bi_sector += clone->bi_iter.bi_sector & mask;
 	}
 
-	dm_zone_unlock(disk, zno, clone);
+	return;
 }
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8dcabf84d866..92ce3b2eb4ae 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1419,25 +1419,12 @@ static void __map_bio(struct bio *clone)
 		down(&md->swap_bios_semaphore);
 	}
 
-	if (static_branch_unlikely(&zoned_enabled)) {
-		/*
-		 * Check if the IO needs a special mapping due to zone append
-		 * emulation on zoned target. In this case, dm_zone_map_bio()
-		 * calls the target map operation.
-		 */
-		if (unlikely(dm_emulate_zone_append(md)))
-			r = dm_zone_map_bio(tio);
-		else
-			goto do_map;
-	} else {
-do_map:
-		if (likely(ti->type->map == linear_map))
-			r = linear_map(ti, clone);
-		else if (ti->type->map == stripe_map)
-			r = stripe_map(ti, clone);
-		else
-			r = ti->type->map(ti, clone);
-	}
+	if (likely(ti->type->map == linear_map))
+		r = linear_map(ti, clone);
+	else if (ti->type->map == stripe_map)
+		r = stripe_map(ti, clone);
+	else
+		r = ti->type->map(ti, clone);
 
 	switch (r) {
 	case DM_MAPIO_SUBMITTED:
@@ -1774,19 +1761,33 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	struct clone_info ci;
 	struct dm_io *io;
 	blk_status_t error = BLK_STS_OK;
-	bool is_abnormal;
+	bool is_abnormal, need_split;
 
 	is_abnormal = is_abnormal_io(bio);
-	if (unlikely(is_abnormal)) {
+	if (likely(!md->emulate_zone_append))
+		need_split = is_abnormal;
+	else
+		need_split = is_abnormal || bio_straddle_zones(bio);
+	if (unlikely(need_split)) {
 		/*
 		 * Use bio_split_to_limits() for abnormal IO (e.g. discard, etc)
 		 * otherwise associated queue_limits won't be imposed.
+		 * Also split the BIO for mapped devices needing zone append
+		 * emulation to ensure that the BIO does not cross zone
+		 * boundaries.
 		 */
 		bio = bio_split_to_limits(bio);
 		if (!bio)
 			return;
 	}
 
+	/*
+	 * Use the block layer zone write plugging for mapped devices that
+	 * need zone append emulation (e.g. dm-crypt).
+	 */
+	if (md->emulate_zone_append && blk_zone_write_plug_bio(bio, 0))
+		return;
+
 	/* Only support nowait for normal IO */
 	if (unlikely(bio->bi_opf & REQ_NOWAIT) && !is_abnormal) {
 		io = alloc_io(md, bio, GFP_NOWAIT);
@@ -2007,7 +2008,6 @@ static void cleanup_mapped_device(struct mapped_device *md)
 		md->dax_dev = NULL;
 	}
 
-	dm_cleanup_zoned_dev(md);
 	if (md->disk) {
 		spin_lock(&_minor_lock);
 		md->disk->private_data = NULL;
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 7f1acbf6bd9e..08a7c34eeca0 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -104,22 +104,15 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t);
 int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q);
 void dm_zone_endio(struct dm_io *io, struct bio *clone);
 #ifdef CONFIG_BLK_DEV_ZONED
-void dm_cleanup_zoned_dev(struct mapped_device *md);
 int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
 			unsigned int nr_zones, report_zones_cb cb, void *data);
 bool dm_is_zone_write(struct mapped_device *md, struct bio *bio);
-int dm_zone_map_bio(struct dm_target_io *io);
 #else
-static inline void dm_cleanup_zoned_dev(struct mapped_device *md) {}
 #define dm_blk_report_zones	NULL
 static inline bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
 {
 	return false;
 }
-static inline int dm_zone_map_bio(struct dm_target_io *tio)
-{
-	return DM_MAPIO_KILL;
-}
 #endif
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 11/26] scsi: sd: Use the block layer zone append emulation
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (9 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 10/26] dm: Use the block layer zone append emulation Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 12:29   ` Hannes Reinecke
  2024-02-06  1:55   ` Martin K. Petersen
  2024-02-02  7:30 ` [PATCH 12/26] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature Damien Le Moal
                   ` (17 subsequent siblings)
  28 siblings, 2 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

Set the request queue of a TYPE_ZBC device as needing zone append
emulation by setting the device queue max_zone_append_sectors limit to
0. This enables the block layer generic implementation provided by zone
write plugging. With this, the sd driver will never see a
REQ_OP_ZONE_APPEND request and the zone append emulation code
implemented in sd_zbc.c can be removed.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/scsi/sd.c     |   8 -
 drivers/scsi/sd.h     |  19 ---
 drivers/scsi/sd_zbc.c | 335 ++----------------------------------------
 3 files changed, 10 insertions(+), 352 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 0833b3e6aa6e..38e4c7cb9e3d 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1233,12 +1233,6 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 		}
 	}
 
-	if (req_op(rq) == REQ_OP_ZONE_APPEND) {
-		ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
-		if (ret)
-			goto fail;
-	}
-
 	fua = rq->cmd_flags & REQ_FUA ? 0x8 : 0;
 	dix = scsi_prot_sg_count(cmd);
 	dif = scsi_host_dif_capable(cmd->device->host, sdkp->protection_type);
@@ -1321,7 +1315,6 @@ static blk_status_t sd_init_command(struct scsi_cmnd *cmd)
 		return sd_setup_flush_cmnd(cmd);
 	case REQ_OP_READ:
 	case REQ_OP_WRITE:
-	case REQ_OP_ZONE_APPEND:
 		return sd_setup_read_write_cmnd(cmd);
 	case REQ_OP_ZONE_RESET:
 		return sd_zbc_setup_zone_mgmt_cmnd(cmd, ZO_RESET_WRITE_POINTER,
@@ -3792,7 +3785,6 @@ static void scsi_disk_release(struct device *dev)
 	struct scsi_disk *sdkp = to_scsi_disk(dev);
 
 	ida_free(&sd_index_ida, sdkp->index);
-	sd_zbc_free_zone_info(sdkp);
 	put_device(&sdkp->device->sdev_gendev);
 	free_opal_dev(sdkp->opal_dev);
 
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 409dda5350d1..bba7ad04d1c4 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -104,12 +104,6 @@ struct scsi_disk {
 	 * between zone starting LBAs is constant.
 	 */
 	u32		zone_starting_lba_gran;
-	u32		*zones_wp_offset;
-	spinlock_t	zones_wp_offset_lock;
-	u32		*rev_wp_offset;
-	struct mutex	rev_mutex;
-	struct work_struct zone_wp_offset_work;
-	char		*zone_wp_update_buf;
 #endif
 	atomic_t	openers;
 	sector_t	capacity;	/* size in logical blocks */
@@ -242,7 +236,6 @@ static inline int sd_is_zoned(struct scsi_disk *sdkp)
 
 #ifdef CONFIG_BLK_DEV_ZONED
 
-void sd_zbc_free_zone_info(struct scsi_disk *sdkp);
 int sd_zbc_read_zones(struct scsi_disk *sdkp, u8 buf[SD_BUF_SIZE]);
 int sd_zbc_revalidate_zones(struct scsi_disk *sdkp);
 blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
@@ -252,13 +245,8 @@ unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
 int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 		unsigned int nr_zones, report_zones_cb cb, void *data);
 
-blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba,
-				        unsigned int nr_blocks);
-
 #else /* CONFIG_BLK_DEV_ZONED */
 
-static inline void sd_zbc_free_zone_info(struct scsi_disk *sdkp) {}
-
 static inline int sd_zbc_read_zones(struct scsi_disk *sdkp, u8 buf[SD_BUF_SIZE])
 {
 	return 0;
@@ -282,13 +270,6 @@ static inline unsigned int sd_zbc_complete(struct scsi_cmnd *cmd,
 	return good_bytes;
 }
 
-static inline blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd,
-						      sector_t *lba,
-						      unsigned int nr_blocks)
-{
-	return BLK_STS_TARGET;
-}
-
 #define sd_zbc_report_zones NULL
 
 #endif /* CONFIG_BLK_DEV_ZONED */
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 26af5ab7d7c1..d0ead9858954 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -23,36 +23,6 @@
 #define CREATE_TRACE_POINTS
 #include "sd_trace.h"
 
-/**
- * sd_zbc_get_zone_wp_offset - Get zone write pointer offset.
- * @zone: Zone for which to return the write pointer offset.
- *
- * Return: offset of the write pointer from the start of the zone.
- */
-static unsigned int sd_zbc_get_zone_wp_offset(struct blk_zone *zone)
-{
-	if (zone->type == ZBC_ZONE_TYPE_CONV)
-		return 0;
-
-	switch (zone->cond) {
-	case BLK_ZONE_COND_IMP_OPEN:
-	case BLK_ZONE_COND_EXP_OPEN:
-	case BLK_ZONE_COND_CLOSED:
-		return zone->wp - zone->start;
-	case BLK_ZONE_COND_FULL:
-		return zone->len;
-	case BLK_ZONE_COND_EMPTY:
-	case BLK_ZONE_COND_OFFLINE:
-	case BLK_ZONE_COND_READONLY:
-	default:
-		/*
-		 * Offline and read-only zones do not have a valid
-		 * write pointer. Use 0 as for an empty zone.
-		 */
-		return 0;
-	}
-}
-
 /* Whether or not a SCSI zone descriptor describes a gap zone. */
 static bool sd_zbc_is_gap_zone(const u8 buf[64])
 {
@@ -121,9 +91,6 @@ static int sd_zbc_parse_report(struct scsi_disk *sdkp, const u8 buf[64],
 	if (ret)
 		return ret;
 
-	if (sdkp->rev_wp_offset)
-		sdkp->rev_wp_offset[idx] = sd_zbc_get_zone_wp_offset(&zone);
-
 	return 0;
 }
 
@@ -347,123 +314,6 @@ static blk_status_t sd_zbc_cmnd_checks(struct scsi_cmnd *cmd)
 	return BLK_STS_OK;
 }
 
-#define SD_ZBC_INVALID_WP_OFST	(~0u)
-#define SD_ZBC_UPDATING_WP_OFST	(SD_ZBC_INVALID_WP_OFST - 1)
-
-static int sd_zbc_update_wp_offset_cb(struct blk_zone *zone, unsigned int idx,
-				    void *data)
-{
-	struct scsi_disk *sdkp = data;
-
-	lockdep_assert_held(&sdkp->zones_wp_offset_lock);
-
-	sdkp->zones_wp_offset[idx] = sd_zbc_get_zone_wp_offset(zone);
-
-	return 0;
-}
-
-/*
- * An attempt to append a zone triggered an invalid write pointer error.
- * Reread the write pointer of the zone(s) in which the append failed.
- */
-static void sd_zbc_update_wp_offset_workfn(struct work_struct *work)
-{
-	struct scsi_disk *sdkp;
-	unsigned long flags;
-	sector_t zno;
-	int ret;
-
-	sdkp = container_of(work, struct scsi_disk, zone_wp_offset_work);
-
-	spin_lock_irqsave(&sdkp->zones_wp_offset_lock, flags);
-	for (zno = 0; zno < sdkp->zone_info.nr_zones; zno++) {
-		if (sdkp->zones_wp_offset[zno] != SD_ZBC_UPDATING_WP_OFST)
-			continue;
-
-		spin_unlock_irqrestore(&sdkp->zones_wp_offset_lock, flags);
-		ret = sd_zbc_do_report_zones(sdkp, sdkp->zone_wp_update_buf,
-					     SD_BUF_SIZE,
-					     zno * sdkp->zone_info.zone_blocks, true);
-		spin_lock_irqsave(&sdkp->zones_wp_offset_lock, flags);
-		if (!ret)
-			sd_zbc_parse_report(sdkp, sdkp->zone_wp_update_buf + 64,
-					    zno, sd_zbc_update_wp_offset_cb,
-					    sdkp);
-	}
-	spin_unlock_irqrestore(&sdkp->zones_wp_offset_lock, flags);
-
-	scsi_device_put(sdkp->device);
-}
-
-/**
- * sd_zbc_prepare_zone_append() - Prepare an emulated ZONE_APPEND command.
- * @cmd: the command to setup
- * @lba: the LBA to patch
- * @nr_blocks: the number of LBAs to be written
- *
- * Called from sd_setup_read_write_cmnd() for REQ_OP_ZONE_APPEND.
- * @sd_zbc_prepare_zone_append() handles the necessary zone wrote locking and
- * patching of the lba for an emulated ZONE_APPEND command.
- *
- * In case the cached write pointer offset is %SD_ZBC_INVALID_WP_OFST it will
- * schedule a REPORT ZONES command and return BLK_STS_IOERR.
- */
-blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba,
-					unsigned int nr_blocks)
-{
-	struct request *rq = scsi_cmd_to_rq(cmd);
-	struct scsi_disk *sdkp = scsi_disk(rq->q->disk);
-	unsigned int wp_offset, zno = blk_rq_zone_no(rq);
-	unsigned long flags;
-	blk_status_t ret;
-
-	ret = sd_zbc_cmnd_checks(cmd);
-	if (ret != BLK_STS_OK)
-		return ret;
-
-	if (!blk_rq_zone_is_seq(rq))
-		return BLK_STS_IOERR;
-
-	/* Unlock of the write lock will happen in sd_zbc_complete() */
-	if (!blk_req_zone_write_trylock(rq))
-		return BLK_STS_ZONE_RESOURCE;
-
-	spin_lock_irqsave(&sdkp->zones_wp_offset_lock, flags);
-	wp_offset = sdkp->zones_wp_offset[zno];
-	switch (wp_offset) {
-	case SD_ZBC_INVALID_WP_OFST:
-		/*
-		 * We are about to schedule work to update a zone write pointer
-		 * offset, which will cause the zone append command to be
-		 * requeued. So make sure that the scsi device does not go away
-		 * while the work is being processed.
-		 */
-		if (scsi_device_get(sdkp->device)) {
-			ret = BLK_STS_IOERR;
-			break;
-		}
-		sdkp->zones_wp_offset[zno] = SD_ZBC_UPDATING_WP_OFST;
-		schedule_work(&sdkp->zone_wp_offset_work);
-		fallthrough;
-	case SD_ZBC_UPDATING_WP_OFST:
-		ret = BLK_STS_DEV_RESOURCE;
-		break;
-	default:
-		wp_offset = sectors_to_logical(sdkp->device, wp_offset);
-		if (wp_offset + nr_blocks > sdkp->zone_info.zone_blocks) {
-			ret = BLK_STS_IOERR;
-			break;
-		}
-
-		trace_scsi_prepare_zone_append(cmd, *lba, wp_offset);
-		*lba += wp_offset;
-	}
-	spin_unlock_irqrestore(&sdkp->zones_wp_offset_lock, flags);
-	if (ret)
-		blk_req_zone_write_unlock(rq);
-	return ret;
-}
-
 /**
  * sd_zbc_setup_zone_mgmt_cmnd - Prepare a zone ZBC_OUT command. The operations
  *			can be RESET WRITE POINTER, OPEN, CLOSE or FINISH.
@@ -504,96 +354,6 @@ blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 	return BLK_STS_OK;
 }
 
-static bool sd_zbc_need_zone_wp_update(struct request *rq)
-{
-	switch (req_op(rq)) {
-	case REQ_OP_ZONE_APPEND:
-	case REQ_OP_ZONE_FINISH:
-	case REQ_OP_ZONE_RESET:
-	case REQ_OP_ZONE_RESET_ALL:
-		return true;
-	case REQ_OP_WRITE:
-	case REQ_OP_WRITE_ZEROES:
-		return blk_rq_zone_is_seq(rq);
-	default:
-		return false;
-	}
-}
-
-/**
- * sd_zbc_zone_wp_update - Update cached zone write pointer upon cmd completion
- * @cmd: Completed command
- * @good_bytes: Command reply bytes
- *
- * Called from sd_zbc_complete() to handle the update of the cached zone write
- * pointer value in case an update is needed.
- */
-static unsigned int sd_zbc_zone_wp_update(struct scsi_cmnd *cmd,
-					  unsigned int good_bytes)
-{
-	int result = cmd->result;
-	struct request *rq = scsi_cmd_to_rq(cmd);
-	struct scsi_disk *sdkp = scsi_disk(rq->q->disk);
-	unsigned int zno = blk_rq_zone_no(rq);
-	enum req_op op = req_op(rq);
-	unsigned long flags;
-
-	/*
-	 * If we got an error for a command that needs updating the write
-	 * pointer offset cache, we must mark the zone wp offset entry as
-	 * invalid to force an update from disk the next time a zone append
-	 * command is issued.
-	 */
-	spin_lock_irqsave(&sdkp->zones_wp_offset_lock, flags);
-
-	if (result && op != REQ_OP_ZONE_RESET_ALL) {
-		if (op == REQ_OP_ZONE_APPEND) {
-			/* Force complete completion (no retry) */
-			good_bytes = 0;
-			scsi_set_resid(cmd, blk_rq_bytes(rq));
-		}
-
-		/*
-		 * Force an update of the zone write pointer offset on
-		 * the next zone append access.
-		 */
-		if (sdkp->zones_wp_offset[zno] != SD_ZBC_UPDATING_WP_OFST)
-			sdkp->zones_wp_offset[zno] = SD_ZBC_INVALID_WP_OFST;
-		goto unlock_wp_offset;
-	}
-
-	switch (op) {
-	case REQ_OP_ZONE_APPEND:
-		trace_scsi_zone_wp_update(cmd, rq->__sector,
-				  sdkp->zones_wp_offset[zno], good_bytes);
-		rq->__sector += sdkp->zones_wp_offset[zno];
-		fallthrough;
-	case REQ_OP_WRITE_ZEROES:
-	case REQ_OP_WRITE:
-		if (sdkp->zones_wp_offset[zno] < sd_zbc_zone_sectors(sdkp))
-			sdkp->zones_wp_offset[zno] +=
-						good_bytes >> SECTOR_SHIFT;
-		break;
-	case REQ_OP_ZONE_RESET:
-		sdkp->zones_wp_offset[zno] = 0;
-		break;
-	case REQ_OP_ZONE_FINISH:
-		sdkp->zones_wp_offset[zno] = sd_zbc_zone_sectors(sdkp);
-		break;
-	case REQ_OP_ZONE_RESET_ALL:
-		memset(sdkp->zones_wp_offset, 0,
-		       sdkp->zone_info.nr_zones * sizeof(unsigned int));
-		break;
-	default:
-		break;
-	}
-
-unlock_wp_offset:
-	spin_unlock_irqrestore(&sdkp->zones_wp_offset_lock, flags);
-
-	return good_bytes;
-}
-
 /**
  * sd_zbc_complete - ZBC command post processing.
  * @cmd: Completed command
@@ -619,11 +379,7 @@ unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
 		 * so be quiet about the error.
 		 */
 		rq->rq_flags |= RQF_QUIET;
-	} else if (sd_zbc_need_zone_wp_update(rq))
-		good_bytes = sd_zbc_zone_wp_update(cmd, good_bytes);
-
-	if (req_op(rq) == REQ_OP_ZONE_APPEND)
-		blk_req_zone_write_unlock(rq);
+	}
 
 	return good_bytes;
 }
@@ -780,46 +536,6 @@ static void sd_zbc_print_zones(struct scsi_disk *sdkp)
 			  sdkp->zone_info.zone_blocks);
 }
 
-static int sd_zbc_init_disk(struct scsi_disk *sdkp)
-{
-	sdkp->zones_wp_offset = NULL;
-	spin_lock_init(&sdkp->zones_wp_offset_lock);
-	sdkp->rev_wp_offset = NULL;
-	mutex_init(&sdkp->rev_mutex);
-	INIT_WORK(&sdkp->zone_wp_offset_work, sd_zbc_update_wp_offset_workfn);
-	sdkp->zone_wp_update_buf = kzalloc(SD_BUF_SIZE, GFP_KERNEL);
-	if (!sdkp->zone_wp_update_buf)
-		return -ENOMEM;
-
-	return 0;
-}
-
-void sd_zbc_free_zone_info(struct scsi_disk *sdkp)
-{
-	if (!sdkp->zone_wp_update_buf)
-		return;
-
-	/* Serialize against revalidate zones */
-	mutex_lock(&sdkp->rev_mutex);
-
-	kvfree(sdkp->zones_wp_offset);
-	sdkp->zones_wp_offset = NULL;
-	kfree(sdkp->zone_wp_update_buf);
-	sdkp->zone_wp_update_buf = NULL;
-
-	sdkp->early_zone_info = (struct zoned_disk_info){ };
-	sdkp->zone_info = (struct zoned_disk_info){ };
-
-	mutex_unlock(&sdkp->rev_mutex);
-}
-
-static void sd_zbc_revalidate_zones_cb(struct gendisk *disk)
-{
-	struct scsi_disk *sdkp = scsi_disk(disk);
-
-	swap(sdkp->zones_wp_offset, sdkp->rev_wp_offset);
-}
-
 /*
  * Call blk_revalidate_disk_zones() if any of the zoned disk properties have
  * changed that make it necessary to call that function. Called by
@@ -831,18 +547,8 @@ int sd_zbc_revalidate_zones(struct scsi_disk *sdkp)
 	struct request_queue *q = disk->queue;
 	u32 zone_blocks = sdkp->early_zone_info.zone_blocks;
 	unsigned int nr_zones = sdkp->early_zone_info.nr_zones;
-	int ret = 0;
 	unsigned int flags;
-
-	/*
-	 * For all zoned disks, initialize zone append emulation data if not
-	 * already done.
-	 */
-	if (sd_is_zoned(sdkp) && !sdkp->zone_wp_update_buf) {
-		ret = sd_zbc_init_disk(sdkp);
-		if (ret)
-			return ret;
-	}
+	int ret;
 
 	/*
 	 * There is nothing to do for regular disks, including host-aware disks
@@ -851,50 +557,32 @@ int sd_zbc_revalidate_zones(struct scsi_disk *sdkp)
 	if (!blk_queue_is_zoned(q))
 		return 0;
 
-	/*
-	 * Make sure revalidate zones are serialized to ensure exclusive
-	 * updates of the scsi disk data.
-	 */
-	mutex_lock(&sdkp->rev_mutex);
-
 	if (sdkp->zone_info.zone_blocks == zone_blocks &&
 	    sdkp->zone_info.nr_zones == nr_zones &&
 	    disk->nr_zones == nr_zones)
-		goto unlock;
+		return 0;
 
-	flags = memalloc_noio_save();
 	sdkp->zone_info.zone_blocks = zone_blocks;
 	sdkp->zone_info.nr_zones = nr_zones;
-	sdkp->rev_wp_offset = kvcalloc(nr_zones, sizeof(u32), GFP_KERNEL);
-	if (!sdkp->rev_wp_offset) {
-		ret = -ENOMEM;
-		memalloc_noio_restore(flags);
-		goto unlock;
-	}
 
 	blk_queue_chunk_sectors(q,
 			logical_to_sectors(sdkp->device, zone_blocks));
-	blk_queue_max_zone_append_sectors(q,
-			q->limits.max_segments << PAGE_SECTORS_SHIFT);
 
-	ret = blk_revalidate_disk_zones(disk, sd_zbc_revalidate_zones_cb);
+	/* Enable block layer zone append emulation */
+	blk_queue_max_zone_append_sectors(q, 0);
 
+	flags = memalloc_noio_save();
+	ret = blk_revalidate_disk_zones(disk, NULL);
 	memalloc_noio_restore(flags);
-	kvfree(sdkp->rev_wp_offset);
-	sdkp->rev_wp_offset = NULL;
-
 	if (ret) {
 		sdkp->zone_info = (struct zoned_disk_info){ };
 		sdkp->capacity = 0;
-		goto unlock;
+		return ret;
 	}
 
 	sd_zbc_print_zones(sdkp);
 
-unlock:
-	mutex_unlock(&sdkp->rev_mutex);
-
-	return ret;
+	return 0;
 }
 
 /**
@@ -917,10 +605,8 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, u8 buf[SD_BUF_SIZE])
 	if (!sd_is_zoned(sdkp)) {
 		/*
 		 * Device managed or normal SCSI disk, no special handling
-		 * required. Nevertheless, free the disk zone information in
-		 * case the device type changed.
+		 * required.
 		 */
-		sd_zbc_free_zone_info(sdkp);
 		return 0;
 	}
 
@@ -941,7 +627,6 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, u8 buf[SD_BUF_SIZE])
 
 	/* The drive satisfies the kernel restrictions: set it up */
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
-	blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE);
 	if (sdkp->zones_max_open == U32_MAX)
 		disk_set_max_open_zones(disk, 0);
 	else
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 12/26] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (10 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 11/26] scsi: sd: " Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 12:31   ` Hannes Reinecke
  2024-02-02  7:30 ` [PATCH 13/26] null_blk: " Damien Le Moal
                   ` (16 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

With zone write plugging enabled at the block layer level, any zone
device can only ever see at most a single write operation per zone.
There is thus no need to request a block scheduler with strick per-zone
sequential write ordering control through the ELEVATOR_F_ZBD_SEQ_WRITE
feature. Removing this allows using a zoned ublk device with any
scheduler, including "none".

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/block/ublk_drv.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 1dfb2e77898b..35fb9cc739eb 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -252,8 +252,6 @@ static int ublk_dev_param_zoned_apply(struct ublk_device *ub)
 
 	disk_set_zoned(ub->ub_disk);
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, ub->ub_disk->queue);
-	blk_queue_required_elevator_features(ub->ub_disk->queue,
-					     ELEVATOR_F_ZBD_SEQ_WRITE);
 	disk_set_max_active_zones(ub->ub_disk, p->max_active_zones);
 	disk_set_max_open_zones(ub->ub_disk, p->max_open_zones);
 	blk_queue_max_zone_append_sectors(ub->ub_disk->queue, p->max_zone_append_sectors);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 13/26] null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (11 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 12/26] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 12:31   ` Hannes Reinecke
  2024-02-02  7:30 ` [PATCH 14/26] null_blk: Introduce zone_append_max_sectors attribute Damien Le Moal
                   ` (15 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

With zone write plugging enabled at the block layer level, any zone
device can only ever see at most a single write operation per zone.
There is thus no need to request a block scheduler with strick per-zone
sequential write ordering control through the ELEVATOR_F_ZBD_SEQ_WRITE
feature. Removing this allows using a zoned null_blk device with any
scheduler, including "none".

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/block/null_blk/zoned.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 6f5e0994862e..f2cb6da0dd0d 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -161,7 +161,6 @@ int null_register_zoned_dev(struct nullb *nullb)
 
 	disk_set_zoned(nullb->disk);
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
-	blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE);
 	blk_queue_chunk_sectors(q, dev->zone_size_sects);
 	nullb->disk->nr_zones = bdev_nr_zones(nullb->disk->part0);
 	blk_queue_max_zone_append_sectors(q, dev->zone_size_sects);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 14/26] null_blk: Introduce zone_append_max_sectors attribute
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (12 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 13/26] null_blk: " Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 12:32   ` Hannes Reinecke
  2024-02-02  7:30 ` [PATCH 15/26] null_blk: Introduce fua attribute Damien Le Moal
                   ` (14 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

Add the zone_append_max_sectors configfs attribute and module parameter
to allow configuring the maximum number of 512B sectors of zone append
operations. This attribute is meaningful only for zoned null block
devices.

If not specified, the default is unchanged and the zoned device max
append sectors limit is set to the device max sectors limit.
If a non 0 value is used for this attribute, which is the default,
then native support for zone append operations is enabled.
Setting a 0 value disables native zone append operations support to
instead use the block layer emulation.

null_submit_bio() is modified to use blk_zone_write_plug_bio() to
handle zone append emulation if that is enabled.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/block/null_blk/main.c     | 40 +++++++++++++++++++++----------
 drivers/block/null_blk/null_blk.h |  1 +
 drivers/block/null_blk/zoned.c    | 31 ++++++++++++++++++------
 3 files changed, 52 insertions(+), 20 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 514c2592046a..c294792fc451 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -241,6 +241,11 @@ static unsigned int g_zone_max_active;
 module_param_named(zone_max_active, g_zone_max_active, uint, 0444);
 MODULE_PARM_DESC(zone_max_active, "Maximum number of active zones when block device is zoned. Default: 0 (no limit)");
 
+static int g_zone_append_max_sectors = INT_MAX;
+module_param_named(zone_append_max_sectors, g_zone_append_max_sectors, int, 0444);
+MODULE_PARM_DESC(zone_append_max_sectors,
+		 "Maximum size of a zone append command (in 512B sectors). Specify 0 for zone append emulation");
+
 static struct nullb_device *null_alloc_dev(void);
 static void null_free_dev(struct nullb_device *dev);
 static void null_del_dev(struct nullb *nullb);
@@ -424,6 +429,7 @@ NULLB_DEVICE_ATTR(zone_capacity, ulong, NULL);
 NULLB_DEVICE_ATTR(zone_nr_conv, uint, NULL);
 NULLB_DEVICE_ATTR(zone_max_open, uint, NULL);
 NULLB_DEVICE_ATTR(zone_max_active, uint, NULL);
+NULLB_DEVICE_ATTR(zone_append_max_sectors, uint, NULL);
 NULLB_DEVICE_ATTR(virt_boundary, bool, NULL);
 NULLB_DEVICE_ATTR(no_sched, bool, NULL);
 NULLB_DEVICE_ATTR(shared_tag_bitmap, bool, NULL);
@@ -567,6 +573,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
 	&nullb_device_attr_zone_nr_conv,
 	&nullb_device_attr_zone_max_open,
 	&nullb_device_attr_zone_max_active,
+	&nullb_device_attr_zone_append_max_sectors,
 	&nullb_device_attr_zone_readonly,
 	&nullb_device_attr_zone_offline,
 	&nullb_device_attr_virt_boundary,
@@ -656,7 +663,8 @@ static ssize_t memb_group_features_show(struct config_item *item, char *page)
 			"poll_queues,power,queue_mode,shared_tag_bitmap,size,"
 			"submit_queues,use_per_node_hctx,virt_boundary,zoned,"
 			"zone_capacity,zone_max_active,zone_max_open,"
-			"zone_nr_conv,zone_offline,zone_readonly,zone_size\n");
+			"zone_nr_conv,zone_offline,zone_readonly,zone_size,"
+			"zone_append_max_sectors\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -736,6 +744,7 @@ static struct nullb_device *null_alloc_dev(void)
 	dev->zone_nr_conv = g_zone_nr_conv;
 	dev->zone_max_open = g_zone_max_open;
 	dev->zone_max_active = g_zone_max_active;
+	dev->zone_append_max_sectors = g_zone_append_max_sectors;
 	dev->virt_boundary = g_virt_boundary;
 	dev->no_sched = g_no_sched;
 	dev->shared_tag_bitmap = g_shared_tag_bitmap;
@@ -1528,14 +1537,19 @@ static struct nullb_queue *nullb_to_queue(struct nullb *nullb)
 
 static void null_submit_bio(struct bio *bio)
 {
-	struct nullb_queue *nq =
-		nullb_to_queue(bio->bi_bdev->bd_disk->private_data);
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	struct nullb_queue *nq = nullb_to_queue(disk->private_data);
 
 	/* Respect the queue limits */
 	bio = bio_split_to_limits(bio);
 	if (!bio)
 		return;
 
+	/* Use zone write plugging to emulate zone append. */
+	if (queue_emulates_zone_append(disk->queue) &&
+	    blk_zone_write_plug_bio(bio, 0))
+		return;
+
 	null_handle_cmd(alloc_cmd(nq, bio), bio->bi_iter.bi_sector,
 			bio_sectors(bio), bio_op(bio));
 }
@@ -2168,12 +2182,6 @@ static int null_add_dev(struct nullb_device *dev)
 		blk_queue_write_cache(nullb->q, true, true);
 	}
 
-	if (dev->zoned) {
-		rv = null_init_zoned_dev(dev, nullb->q);
-		if (rv)
-			goto out_cleanup_disk;
-	}
-
 	nullb->q->queuedata = nullb;
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, nullb->q);
 
@@ -2181,7 +2189,7 @@ static int null_add_dev(struct nullb_device *dev)
 	rv = ida_alloc(&nullb_indexes, GFP_KERNEL);
 	if (rv < 0) {
 		mutex_unlock(&lock);
-		goto out_cleanup_zone;
+		goto out_cleanup_disk;
 	}
 	nullb->index = rv;
 	dev->index = rv;
@@ -2195,6 +2203,12 @@ static int null_add_dev(struct nullb_device *dev)
 	if (dev->virt_boundary)
 		blk_queue_virt_boundary(nullb->q, PAGE_SIZE - 1);
 
+	if (dev->zoned) {
+		rv = null_init_zoned_dev(dev, nullb->q);
+		if (rv)
+			goto out_ida_free;
+	}
+
 	null_config_discard(nullb);
 
 	if (config_item_name(&dev->group.cg_item)) {
@@ -2207,7 +2221,7 @@ static int null_add_dev(struct nullb_device *dev)
 
 	rv = null_gendisk_register(nullb);
 	if (rv)
-		goto out_ida_free;
+		goto out_cleanup_zone;
 
 	mutex_lock(&lock);
 	list_add_tail(&nullb->list, &nullb_list);
@@ -2217,10 +2231,10 @@ static int null_add_dev(struct nullb_device *dev)
 
 	return 0;
 
-out_ida_free:
-	ida_free(&nullb_indexes, nullb->index);
 out_cleanup_zone:
 	null_free_zoned_dev(dev);
+out_ida_free:
+	ida_free(&nullb_indexes, nullb->index);
 out_cleanup_disk:
 	put_disk(nullb->disk);
 out_cleanup_tags:
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 929f659dd255..8001e398a016 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -99,6 +99,7 @@ struct nullb_device {
 	unsigned int zone_nr_conv; /* number of conventional zones */
 	unsigned int zone_max_open; /* max number of open zones */
 	unsigned int zone_max_active; /* max number of active zones */
+	unsigned int zone_append_max_sectors; /* Max sectors per zone append command */
 	unsigned int submit_queues; /* number of submission queues */
 	unsigned int prev_submit_queues; /* number of submission queues before change */
 	unsigned int poll_queues; /* number of IOPOLL submission queues */
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index f2cb6da0dd0d..dd418b174e03 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -61,6 +61,7 @@ static inline void null_unlock_zone(struct nullb_device *dev,
 int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 {
 	sector_t dev_capacity_sects, zone_capacity_sects;
+	sector_t zone_append_max_bytes;
 	struct nullb_zone *zone;
 	sector_t sector = 0;
 	unsigned int i;
@@ -102,6 +103,14 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 			dev->zone_nr_conv);
 	}
 
+	dev->zone_append_max_sectors =
+		min(dev->zone_append_max_sectors, queue_max_sectors(q));
+	zone_append_max_bytes =
+		ALIGN_DOWN(dev->zone_append_max_sectors << SECTOR_SHIFT,
+			   dev->blocksize);
+	dev->zone_append_max_sectors =
+		min(zone_append_max_bytes >> SECTOR_SHIFT, zone_capacity_sects);
+
 	/* Max active zones has to be < nbr of seq zones in order to be enforceable */
 	if (dev->zone_max_active >= dev->nr_zones - dev->zone_nr_conv) {
 		dev->zone_max_active = 0;
@@ -158,17 +167,22 @@ int null_register_zoned_dev(struct nullb *nullb)
 {
 	struct nullb_device *dev = nullb->dev;
 	struct request_queue *q = nullb->q;
+	struct gendisk *disk = nullb->disk;
 
-	disk_set_zoned(nullb->disk);
+	disk_set_zoned(disk);
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
 	blk_queue_chunk_sectors(q, dev->zone_size_sects);
-	nullb->disk->nr_zones = bdev_nr_zones(nullb->disk->part0);
-	blk_queue_max_zone_append_sectors(q, dev->zone_size_sects);
-	disk_set_max_open_zones(nullb->disk, dev->zone_max_open);
-	disk_set_max_active_zones(nullb->disk, dev->zone_max_active);
+	disk->nr_zones = bdev_nr_zones(disk->part0);
+	blk_queue_max_zone_append_sectors(q, dev->zone_append_max_sectors);
+	disk_set_max_open_zones(disk, dev->zone_max_open);
+	disk_set_max_active_zones(disk, dev->zone_max_active);
+
+	pr_info("%s: using %s zone append\n",
+		disk->disk_name,
+		queue_emulates_zone_append(q) ? "emulated" : "native");
 
-	if (queue_is_mq(q))
-		return blk_revalidate_disk_zones(nullb->disk, NULL);
+	if (queue_is_mq(q) || queue_emulates_zone_append(q))
+		return blk_revalidate_disk_zones(disk, NULL);
 
 	return 0;
 }
@@ -369,6 +383,9 @@ static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
 
 	trace_nullb_zone_op(cmd, zno, zone->cond);
 
+	if (WARN_ON_ONCE(append && !dev->zone_append_max_sectors))
+		return BLK_STS_IOERR;
+
 	if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL) {
 		if (append)
 			return BLK_STS_IOERR;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 15/26] null_blk: Introduce fua attribute
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (13 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 14/26] null_blk: Introduce zone_append_max_sectors attribute Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 12:33   ` Hannes Reinecke
  2024-02-02  7:30 ` [PATCH 16/26] nvmet: zns: Do not reference the gendisk conv_zones_bitmap Damien Le Moal
                   ` (13 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

Add the fua configfs attribute and module parameter to allow
configuring if the device supports FUA or not. Using this attribute
has an effect on the null_blk device only if memory backing is enabled
together with a write cache (cache_size option).

This new attribute allows configuring a null_blk device with a write
cache but without FUA support. This is convenient to test the block
layer flush machinery.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/block/null_blk/main.c     | 12 ++++++++++--
 drivers/block/null_blk/null_blk.h |  1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index c294792fc451..08ff8af67d76 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -213,6 +213,10 @@ static unsigned long g_cache_size;
 module_param_named(cache_size, g_cache_size, ulong, 0444);
 MODULE_PARM_DESC(mbps, "Cache size in MiB for memory-backed device. Default: 0 (none)");
 
+static bool g_fua = true;
+module_param_named(fua, g_fua, bool, S_IRUGO);
+MODULE_PARM_DESC(zoned, "Enable/disable FUA support when cache_size is used. Default: true");
+
 static unsigned int g_mbps;
 module_param_named(mbps, g_mbps, uint, 0444);
 MODULE_PARM_DESC(mbps, "Limit maximum bandwidth (in MiB/s). Default: 0 (no limit)");
@@ -433,6 +437,7 @@ NULLB_DEVICE_ATTR(zone_append_max_sectors, uint, NULL);
 NULLB_DEVICE_ATTR(virt_boundary, bool, NULL);
 NULLB_DEVICE_ATTR(no_sched, bool, NULL);
 NULLB_DEVICE_ATTR(shared_tag_bitmap, bool, NULL);
+NULLB_DEVICE_ATTR(fua, bool, NULL);
 
 static ssize_t nullb_device_power_show(struct config_item *item, char *page)
 {
@@ -579,6 +584,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
 	&nullb_device_attr_virt_boundary,
 	&nullb_device_attr_no_sched,
 	&nullb_device_attr_shared_tag_bitmap,
+	&nullb_device_attr_fua,
 	NULL,
 };
 
@@ -657,7 +663,7 @@ nullb_group_drop_item(struct config_group *group, struct config_item *item)
 static ssize_t memb_group_features_show(struct config_item *item, char *page)
 {
 	return snprintf(page, PAGE_SIZE,
-			"badblocks,blocking,blocksize,cache_size,"
+			"badblocks,blocking,blocksize,cache_size,fua,"
 			"completion_nsec,discard,home_node,hw_queue_depth,"
 			"irqmode,max_sectors,mbps,memory_backed,no_sched,"
 			"poll_queues,power,queue_mode,shared_tag_bitmap,size,"
@@ -748,6 +754,8 @@ static struct nullb_device *null_alloc_dev(void)
 	dev->virt_boundary = g_virt_boundary;
 	dev->no_sched = g_no_sched;
 	dev->shared_tag_bitmap = g_shared_tag_bitmap;
+	dev->fua = g_fua;
+
 	return dev;
 }
 
@@ -2179,7 +2187,7 @@ static int null_add_dev(struct nullb_device *dev)
 
 	if (dev->cache_size > 0) {
 		set_bit(NULLB_DEV_FL_CACHE, &nullb->dev->flags);
-		blk_queue_write_cache(nullb->q, true, true);
+		blk_queue_write_cache(nullb->q, true, dev->fua);
 	}
 
 	nullb->q->queuedata = nullb;
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 8001e398a016..5d58b204944e 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -121,6 +121,7 @@ struct nullb_device {
 	bool virt_boundary; /* virtual boundary on/off for the device */
 	bool no_sched; /* no IO scheduler for the device */
 	bool shared_tag_bitmap; /* use hostwide shared tags */
+	bool fua; /* Support FUA */
 };
 
 struct nullb {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 16/26] nvmet: zns: Do not reference the gendisk conv_zones_bitmap
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (14 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 15/26] null_blk: Introduce fua attribute Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 12:34   ` Hannes Reinecke
  2024-02-02  7:30 ` [PATCH 17/26] block: Remove BLK_STS_ZONE_RESOURCE Damien Le Moal
                   ` (12 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

The gendisk conventional zone bitmap is going away. So to check for the
presence of conventional zones on a zoned target device, always use
report zones.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/nvme/target/zns.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
index 5b5c1e481722..f95f0040e108 100644
--- a/drivers/nvme/target/zns.c
+++ b/drivers/nvme/target/zns.c
@@ -52,14 +52,10 @@ bool nvmet_bdev_zns_enable(struct nvmet_ns *ns)
 	if (get_capacity(bd_disk) & (bdev_zone_sectors(ns->bdev) - 1))
 		return false;
 	/*
-	 * ZNS does not define a conventional zone type. If the underlying
-	 * device has a bitmap set indicating the existence of conventional
-	 * zones, reject the device. Otherwise, use report zones to detect if
-	 * the device has conventional zones.
+	 * ZNS does not define a conventional zone type. Use report zones
+	 * to detect if the device has conventional zones and reject it if
+	 * it does.
 	 */
-	if (ns->bdev->bd_disk->conv_zones_bitmap)
-		return false;
-
 	ret = blkdev_report_zones(ns->bdev, 0, bdev_nr_zones(ns->bdev),
 				  validate_conv_zones_cb, NULL);
 	if (ret < 0)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 17/26] block: Remove BLK_STS_ZONE_RESOURCE
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (15 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 16/26] nvmet: zns: Do not reference the gendisk conv_zones_bitmap Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 12:34   ` Hannes Reinecke
  2024-02-02  7:30 ` [PATCH 18/26] block: Simplify blk_revalidate_disk_zones() interface Damien Le Moal
                   ` (11 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

The zone append emulation of the scsi disk driver was the only driver
using BLK_STS_ZONE_RESOURCE. With this code removed,
BLK_STS_ZONE_RESOURCE is now unused. Remove this macro definition and
simplify blk_mq_dispatch_rq_list() where this status code was handled.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-mq.c            | 26 --------------------------
 drivers/scsi/scsi_lib.c   |  1 -
 include/linux/blk_types.h | 20 ++++----------------
 3 files changed, 4 insertions(+), 43 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a112298a6541..8576940f8674 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1940,19 +1940,6 @@ static void blk_mq_handle_dev_resource(struct request *rq,
 	__blk_mq_requeue_request(rq);
 }
 
-static void blk_mq_handle_zone_resource(struct request *rq,
-					struct list_head *zone_list)
-{
-	/*
-	 * If we end up here it is because we cannot dispatch a request to a
-	 * specific zone due to LLD level zone-write locking or other zone
-	 * related resource not being available. In this case, set the request
-	 * aside in zone_list for retrying it later.
-	 */
-	list_add(&rq->queuelist, zone_list);
-	__blk_mq_requeue_request(rq);
-}
-
 enum prep_dispatch {
 	PREP_DISPATCH_OK,
 	PREP_DISPATCH_NO_TAG,
@@ -2038,7 +2025,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
 	struct request *rq;
 	int queued;
 	blk_status_t ret = BLK_STS_OK;
-	LIST_HEAD(zone_list);
 	bool needs_resource = false;
 
 	if (list_empty(list))
@@ -2080,23 +2066,11 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
 		case BLK_STS_DEV_RESOURCE:
 			blk_mq_handle_dev_resource(rq, list);
 			goto out;
-		case BLK_STS_ZONE_RESOURCE:
-			/*
-			 * Move the request to zone_list and keep going through
-			 * the dispatch list to find more requests the drive can
-			 * accept.
-			 */
-			blk_mq_handle_zone_resource(rq, &zone_list);
-			needs_resource = true;
-			break;
 		default:
 			blk_mq_end_request(rq, ret);
 		}
 	} while (!list_empty(list));
 out:
-	if (!list_empty(&zone_list))
-		list_splice_tail_init(&zone_list, list);
-
 	/* If we didn't flush the entire list, we could have told the driver
 	 * there was more coming, but that turned out to be a lie.
 	 */
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index cf3864f72093..80e692149e2c 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1776,7 +1776,6 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
-	case BLK_STS_ZONE_RESOURCE:
 		if (scsi_device_blocked(sdev))
 			ret = BLK_STS_DEV_RESOURCE;
 		break;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 5c5343099800..fd0dc2d08924 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -135,18 +135,6 @@ typedef u16 blk_short_t;
  */
 #define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
 
-/*
- * BLK_STS_ZONE_RESOURCE is returned from the driver to the block layer if zone
- * related resources are unavailable, but the driver can guarantee the queue
- * will be rerun in the future once the resources become available again.
- *
- * This is different from BLK_STS_DEV_RESOURCE in that it explicitly references
- * a zone specific resource and IO to a different zone on the same device could
- * still be served. Examples of that are zones that are write-locked, but a read
- * to the same zone could be served.
- */
-#define BLK_STS_ZONE_RESOURCE	((__force blk_status_t)14)
-
 /*
  * BLK_STS_ZONE_OPEN_RESOURCE is returned from the driver in the completion
  * path if the device returns a status indicating that too many zone resources
@@ -154,7 +142,7 @@ typedef u16 blk_short_t;
  * after the number of open zones decreases below the device's limits, which is
  * reported in the request_queue's max_open_zones.
  */
-#define BLK_STS_ZONE_OPEN_RESOURCE	((__force blk_status_t)15)
+#define BLK_STS_ZONE_OPEN_RESOURCE	((__force blk_status_t)14)
 
 /*
  * BLK_STS_ZONE_ACTIVE_RESOURCE is returned from the driver in the completion
@@ -163,20 +151,20 @@ typedef u16 blk_short_t;
  * after the number of active zones decreases below the device's limits, which
  * is reported in the request_queue's max_active_zones.
  */
-#define BLK_STS_ZONE_ACTIVE_RESOURCE	((__force blk_status_t)16)
+#define BLK_STS_ZONE_ACTIVE_RESOURCE	((__force blk_status_t)15)
 
 /*
  * BLK_STS_OFFLINE is returned from the driver when the target device is offline
  * or is being taken offline. This could help differentiate the case where a
  * device is intentionally being shut down from a real I/O error.
  */
-#define BLK_STS_OFFLINE		((__force blk_status_t)17)
+#define BLK_STS_OFFLINE		((__force blk_status_t)16)
 
 /*
  * BLK_STS_DURATION_LIMIT is returned from the driver when the target device
  * aborted the command because it exceeded one of its Command Duration Limits.
  */
-#define BLK_STS_DURATION_LIMIT	((__force blk_status_t)18)
+#define BLK_STS_DURATION_LIMIT	((__force blk_status_t)17)
 
 /**
  * blk_path_error - returns true if error may be path related
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 18/26] block: Simplify blk_revalidate_disk_zones() interface
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (16 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 17/26] block: Remove BLK_STS_ZONE_RESOURCE Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 12:35   ` Hannes Reinecke
  2024-02-02  7:30 ` [PATCH 19/26] block: mq-deadline: Remove support for zone write locking Damien Le Moal
                   ` (10 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

The only user of blk_revalidate_disk_zones() second argument was the
SCSI disk driver (sd). Now that this driver does not require this
update_driver_data argument, remove it to simplify the interface of
blk_revalidate_disk_zones(). Also update the function kdoc comment to
be more accurate (i.e. there is no gendisk ->revalidate method).

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c              | 16 +++++-----------
 drivers/block/null_blk/zoned.c |  2 +-
 drivers/block/ublk_drv.c       |  2 +-
 drivers/block/virtio_blk.c     |  2 +-
 drivers/md/dm-zone.c           |  2 +-
 drivers/nvme/host/zns.c        |  2 +-
 drivers/scsi/sd_zbc.c          |  2 +-
 include/linux/blkdev.h         |  3 +--
 8 files changed, 12 insertions(+), 19 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 8bf6821735f3..3dadf37ad787 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1309,21 +1309,17 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 /**
  * blk_revalidate_disk_zones - (re)allocate and initialize zone bitmaps
  * @disk:	Target disk
- * @update_driver_data:	Callback to update driver data on the frozen disk
  *
- * Helper function for low-level device drivers to check and (re) allocate and
- * initialize a disk request queue zone bitmaps. This functions should normally
- * be called within the disk ->revalidate method for blk-mq based drivers.
+ * Helper function for low-level device drivers to check, (re) allocate and
+ * initialize resources used for managing zoned disks. This function should
+ * normally be called by blk-mq based drivers when a zoned gendisk is probed
+ * and when the zone configuration of the gendisk changes (e.g. after a format).
  * Before calling this function, the device driver must already have set the
  * device zone size (chunk_sector limit) and the max zone append limit.
  * BIO based drivers can also use this function as long as the device queue
  * can be safely frozen.
- * If the @update_driver_data callback function is not NULL, the callback is
- * executed with the device request queue frozen after all zones have been
- * checked.
  */
-int blk_revalidate_disk_zones(struct gendisk *disk,
-			      void (*update_driver_data)(struct gendisk *disk))
+int blk_revalidate_disk_zones(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
 	sector_t zone_sectors = q->limits.chunk_sectors;
@@ -1403,8 +1399,6 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 		swap(disk->seq_zones_wlock, args.seq_zones_wlock);
 		swap(disk->conv_zones_bitmap, args.conv_zones_bitmap);
 		swap(disk->zone_wplugs, args.zone_wplugs);
-		if (update_driver_data)
-			update_driver_data(disk);
 		mutex_unlock(&disk->zone_wplugs_mutex);
 		ret = 0;
 	} else {
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index dd418b174e03..15fc325d8134 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -182,7 +182,7 @@ int null_register_zoned_dev(struct nullb *nullb)
 		queue_emulates_zone_append(q) ? "emulated" : "native");
 
 	if (queue_is_mq(q) || queue_emulates_zone_append(q))
-		return blk_revalidate_disk_zones(disk, NULL);
+		return blk_revalidate_disk_zones(disk);
 
 	return 0;
 }
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 35fb9cc739eb..daa0b5f5788c 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -221,7 +221,7 @@ static int ublk_get_nr_zones(const struct ublk_device *ub)
 
 static int ublk_revalidate_disk_zones(struct ublk_device *ub)
 {
-	return blk_revalidate_disk_zones(ub->ub_disk, NULL);
+	return blk_revalidate_disk_zones(ub->ub_disk);
 }
 
 static int ublk_dev_param_zoned_validate(const struct ublk_device *ub)
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 5bf98fd6a651..6c2b167ca136 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -788,7 +788,7 @@ static int virtblk_probe_zoned_device(struct virtio_device *vdev,
 	blk_queue_max_zone_append_sectors(q, v);
 	dev_dbg(&vdev->dev, "max append sectors = %u\n", v);
 
-	return blk_revalidate_disk_zones(vblk->disk, NULL);
+	return blk_revalidate_disk_zones(vblk->disk);
 }
 
 #else
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 570b44b924b8..b2ce19a78bbb 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -169,7 +169,7 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
 	 * our table for dm_blk_report_zones() to use directly.
 	 */
 	md->zone_revalidate_map = t;
-	ret = blk_revalidate_disk_zones(disk, NULL);
+	ret = blk_revalidate_disk_zones(disk);
 	md->zone_revalidate_map = NULL;
 
 	if (ret) {
diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 499bbb0eee8d..c02658af6c34 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -14,7 +14,7 @@ int nvme_revalidate_zones(struct nvme_ns *ns)
 	blk_queue_chunk_sectors(q, ns->head->zsze);
 	blk_queue_max_zone_append_sectors(q, ns->ctrl->max_zone_append);
 
-	return blk_revalidate_disk_zones(ns->disk, NULL);
+	return blk_revalidate_disk_zones(ns->disk);
 }
 
 static int nvme_set_max_append(struct nvme_ctrl *ctrl)
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index d0ead9858954..806036e48abe 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -572,7 +572,7 @@ int sd_zbc_revalidate_zones(struct scsi_disk *sdkp)
 	blk_queue_max_zone_append_sectors(q, 0);
 
 	flags = memalloc_noio_save();
-	ret = blk_revalidate_disk_zones(disk, NULL);
+	ret = blk_revalidate_disk_zones(disk);
 	memalloc_noio_restore(flags);
 	if (ret) {
 		sdkp->zone_info = (struct zoned_disk_info){ };
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e619e10847bd..a39ebf075d26 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -331,8 +331,7 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
 		unsigned int nr_zones, report_zones_cb cb, void *data);
 int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
 		sector_t sectors, sector_t nr_sectors, gfp_t gfp_mask);
-int blk_revalidate_disk_zones(struct gendisk *disk,
-		void (*update_driver_data)(struct gendisk *disk));
+int blk_revalidate_disk_zones(struct gendisk *disk);
 
 /*
  * Independent access ranges: struct blk_independent_access_range describes
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 19/26] block: mq-deadline: Remove support for zone write locking
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (17 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 18/26] block: Simplify blk_revalidate_disk_zones() interface Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 12:36   ` Hannes Reinecke
  2024-02-02  7:30 ` [PATCH 20/26] block: Remove elevator required features Damien Le Moal
                   ` (9 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

With the block layer generic plugging of write operations for zoned
block devices, mq-deadline, or any other scheduler, can only ever
see at most one write operation per zone at any time. There is thus no
sequentiality requirements for these writes and thus no need to tightly
control the dispatching of write requests using zone write locking.

Remove all the code that implement this control in the mq-deadline
scheduler and remove advertizing support for the
ELEVATOR_F_ZBD_SEQ_WRITE elevator feature.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/mq-deadline.c | 176 ++------------------------------------------
 1 file changed, 6 insertions(+), 170 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 1b0de4fc3958..42d967849bec 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -90,7 +90,6 @@ struct deadline_data {
 	struct {
 		spinlock_t lock;
 		spinlock_t insert_lock;
-		spinlock_t zone_lock;
 	} ____cacheline_aligned_in_smp;
 
 	unsigned long run_state;
@@ -171,8 +170,7 @@ deadline_latter_request(struct request *rq)
 }
 
 /*
- * Return the first request for which blk_rq_pos() >= @pos. For zoned devices,
- * return the first request after the start of the zone containing @pos.
+ * Return the first request for which blk_rq_pos() >= @pos.
  */
 static inline struct request *deadline_from_pos(struct dd_per_prio *per_prio,
 				enum dd_data_dir data_dir, sector_t pos)
@@ -184,14 +182,6 @@ static inline struct request *deadline_from_pos(struct dd_per_prio *per_prio,
 		return NULL;
 
 	rq = rb_entry_rq(node);
-	/*
-	 * A zoned write may have been requeued with a starting position that
-	 * is below that of the most recently dispatched request. Hence, for
-	 * zoned writes, start searching from the start of a zone.
-	 */
-	if (blk_rq_is_seq_zoned_write(rq))
-		pos = round_down(pos, rq->q->limits.chunk_sectors);
-
 	while (node) {
 		rq = rb_entry_rq(node);
 		if (blk_rq_pos(rq) >= pos) {
@@ -322,36 +312,6 @@ static inline bool deadline_check_fifo(struct dd_per_prio *per_prio,
 	return time_is_before_eq_jiffies((unsigned long)rq->fifo_time);
 }
 
-/*
- * Check if rq has a sequential request preceding it.
- */
-static bool deadline_is_seq_write(struct deadline_data *dd, struct request *rq)
-{
-	struct request *prev = deadline_earlier_request(rq);
-
-	if (!prev)
-		return false;
-
-	return blk_rq_pos(prev) + blk_rq_sectors(prev) == blk_rq_pos(rq);
-}
-
-/*
- * Skip all write requests that are sequential from @rq, even if we cross
- * a zone boundary.
- */
-static struct request *deadline_skip_seq_writes(struct deadline_data *dd,
-						struct request *rq)
-{
-	sector_t pos = blk_rq_pos(rq);
-
-	do {
-		pos += blk_rq_sectors(rq);
-		rq = deadline_latter_request(rq);
-	} while (rq && blk_rq_pos(rq) == pos);
-
-	return rq;
-}
-
 /*
  * For the specified data direction, return the next request to
  * dispatch using arrival ordered lists.
@@ -360,40 +320,10 @@ static struct request *
 deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 		      enum dd_data_dir data_dir)
 {
-	struct request *rq, *rb_rq, *next;
-	unsigned long flags;
-
 	if (list_empty(&per_prio->fifo_list[data_dir]))
 		return NULL;
 
-	rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
-	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
-		return rq;
-
-	/*
-	 * Look for a write request that can be dispatched, that is one with
-	 * an unlocked target zone. For some HDDs, breaking a sequential
-	 * write stream can lead to lower throughput, so make sure to preserve
-	 * sequential write streams, even if that stream crosses into the next
-	 * zones and these zones are unlocked.
-	 */
-	spin_lock_irqsave(&dd->zone_lock, flags);
-	list_for_each_entry_safe(rq, next, &per_prio->fifo_list[DD_WRITE],
-				 queuelist) {
-		/* Check whether a prior request exists for the same zone. */
-		rb_rq = deadline_from_pos(per_prio, data_dir, blk_rq_pos(rq));
-		if (rb_rq && blk_rq_pos(rb_rq) < blk_rq_pos(rq))
-			rq = rb_rq;
-		if (blk_req_can_dispatch_to_zone(rq) &&
-		    (blk_queue_nonrot(rq->q) ||
-		     !deadline_is_seq_write(dd, rq)))
-			goto out;
-	}
-	rq = NULL;
-out:
-	spin_unlock_irqrestore(&dd->zone_lock, flags);
-
-	return rq;
+	return rq_entry_fifo(per_prio->fifo_list[data_dir].next);
 }
 
 /*
@@ -404,36 +334,8 @@ static struct request *
 deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 		      enum dd_data_dir data_dir)
 {
-	struct request *rq;
-	unsigned long flags;
-
-	rq = deadline_from_pos(per_prio, data_dir,
-			       per_prio->latest_pos[data_dir]);
-	if (!rq)
-		return NULL;
-
-	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
-		return rq;
-
-	/*
-	 * Look for a write request that can be dispatched, that is one with
-	 * an unlocked target zone. For some HDDs, breaking a sequential
-	 * write stream can lead to lower throughput, so make sure to preserve
-	 * sequential write streams, even if that stream crosses into the next
-	 * zones and these zones are unlocked.
-	 */
-	spin_lock_irqsave(&dd->zone_lock, flags);
-	while (rq) {
-		if (blk_req_can_dispatch_to_zone(rq))
-			break;
-		if (blk_queue_nonrot(rq->q))
-			rq = deadline_latter_request(rq);
-		else
-			rq = deadline_skip_seq_writes(dd, rq);
-	}
-	spin_unlock_irqrestore(&dd->zone_lock, flags);
-
-	return rq;
+	return deadline_from_pos(per_prio, data_dir,
+				 per_prio->latest_pos[data_dir]);
 }
 
 /*
@@ -539,10 +441,6 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 		rq = next_rq;
 	}
 
-	/*
-	 * For a zoned block device, if we only have writes queued and none of
-	 * them can be dispatched, rq will be NULL.
-	 */
 	if (!rq)
 		return NULL;
 
@@ -563,10 +461,6 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	prio = ioprio_class_to_prio[ioprio_class];
 	dd->per_prio[prio].latest_pos[data_dir] = blk_rq_pos(rq);
 	dd->per_prio[prio].stats.dispatched++;
-	/*
-	 * If the request needs its target zone locked, do it.
-	 */
-	blk_req_zone_write_lock(rq);
 	rq->rq_flags |= RQF_STARTED;
 	return rq;
 }
@@ -766,7 +660,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->insert_lock);
-	spin_lock_init(&dd->zone_lock);
 
 	INIT_LIST_HEAD(&dd->at_head);
 	INIT_LIST_HEAD(&dd->at_tail);
@@ -879,12 +772,6 @@ static void dd_insert_request(struct request_queue *q, struct request *rq,
 
 	lockdep_assert_held(&dd->lock);
 
-	/*
-	 * This may be a requeue of a write request that has locked its
-	 * target zone. If it is the case, this releases the zone lock.
-	 */
-	blk_req_zone_write_unlock(rq);
-
 	prio = ioprio_class_to_prio[ioprio_class];
 	per_prio = &dd->per_prio[prio];
 	if (!rq->elv.priv[0]) {
@@ -916,18 +803,6 @@ static void dd_insert_request(struct request_queue *q, struct request *rq,
 		 */
 		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
 		insert_before = &per_prio->fifo_list[data_dir];
-#ifdef CONFIG_BLK_DEV_ZONED
-		/*
-		 * Insert zoned writes such that requests are sorted by
-		 * position per zone.
-		 */
-		if (blk_rq_is_seq_zoned_write(rq)) {
-			struct request *rq2 = deadline_latter_request(rq);
-
-			if (rq2 && blk_rq_zone_no(rq2) == blk_rq_zone_no(rq))
-				insert_before = &rq2->queuelist;
-		}
-#endif
 		list_add_tail(&rq->queuelist, insert_before);
 	}
 }
@@ -956,33 +831,8 @@ static void dd_prepare_request(struct request *rq)
 	rq->elv.priv[0] = NULL;
 }
 
-static bool dd_has_write_work(struct blk_mq_hw_ctx *hctx)
-{
-	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
-	enum dd_prio p;
-
-	for (p = 0; p <= DD_PRIO_MAX; p++)
-		if (!list_empty_careful(&dd->per_prio[p].fifo_list[DD_WRITE]))
-			return true;
-
-	return false;
-}
-
 /*
  * Callback from inside blk_mq_free_request().
- *
- * For zoned block devices, write unlock the target zone of
- * completed write requests. Do this while holding the zone lock
- * spinlock so that the zone is never unlocked while deadline_fifo_request()
- * or deadline_next_request() are executing. This function is called for
- * all requests, whether or not these requests complete successfully.
- *
- * For a zoned block device, __dd_dispatch_request() may have stopped
- * dispatching requests if all the queued requests are write requests directed
- * at zones that are already locked due to on-going write requests. To ensure
- * write request dispatch progress in this case, mark the queue as needing a
- * restart to ensure that the queue is run again after completion of the
- * request and zones being unlocked.
  */
 static void dd_finish_request(struct request *rq)
 {
@@ -997,21 +847,8 @@ static void dd_finish_request(struct request *rq)
 	 * called dd_insert_requests(). Skip requests that bypassed I/O
 	 * scheduling. See also blk_mq_request_bypass_insert().
 	 */
-	if (!rq->elv.priv[0])
-		return;
-
-	atomic_inc(&per_prio->stats.completed);
-
-	if (blk_queue_is_zoned(q)) {
-		unsigned long flags;
-
-		spin_lock_irqsave(&dd->zone_lock, flags);
-		blk_req_zone_write_unlock(rq);
-		spin_unlock_irqrestore(&dd->zone_lock, flags);
-
-		if (dd_has_write_work(rq->mq_hctx))
-			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
-	}
+	if (rq->elv.priv[0])
+		atomic_inc(&per_prio->stats.completed);
 }
 
 static bool dd_has_work_for_prio(struct dd_per_prio *per_prio)
@@ -1339,7 +1176,6 @@ static struct elevator_type mq_deadline = {
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "mq-deadline",
 	.elevator_alias = "deadline",
-	.elevator_features = ELEVATOR_F_ZBD_SEQ_WRITE,
 	.elevator_owner = THIS_MODULE,
 };
 MODULE_ALIAS("mq-deadline-iosched");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 20/26] block: Remove elevator required features
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (18 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 19/26] block: mq-deadline: Remove support for zone write locking Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 12:36   ` Hannes Reinecke
  2024-02-02  7:30 ` [PATCH 21/26] block: Do not check zone type in blk_check_zone_append() Damien Le Moal
                   ` (8 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

The only elevator feature ever implemented is ELEVATOR_F_ZBD_SEQ_WRITE
for signaling that a scheduler implements zone write locking to tightly
control the dispatching order of write operations to zoned block
devices. With the removal of zone write locking support in mq-deadline
and the reliance of all block device drivers on the block layer zone
write plugging to control ordering of write operations to zones, the
elevator feature ELEVATOR_F_ZBD_SEQ_WRITE is completely unused.
Remove it, and also remove the now unused code for filtering the
possible schedulers for a block device based on required features.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-settings.c   | 16 ---------------
 block/elevator.c       | 46 +++++-------------------------------------
 block/elevator.h       |  1 -
 include/linux/blkdev.h | 10 ---------
 4 files changed, 5 insertions(+), 68 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index f00bcb595444..b4537237fc5f 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -857,22 +857,6 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
 }
 EXPORT_SYMBOL_GPL(blk_queue_write_cache);
 
-/**
- * blk_queue_required_elevator_features - Set a queue required elevator features
- * @q:		the request queue for the target device
- * @features:	Required elevator features OR'ed together
- *
- * Tell the block layer that for the device controlled through @q, only the
- * only elevators that can be used are those that implement at least the set of
- * features specified by @features.
- */
-void blk_queue_required_elevator_features(struct request_queue *q,
-					  unsigned int features)
-{
-	q->required_elevator_features = features;
-}
-EXPORT_SYMBOL_GPL(blk_queue_required_elevator_features);
-
 /**
  * blk_queue_can_use_dma_map_merging - configure queue for merging segments.
  * @q:		the request queue for the device
diff --git a/block/elevator.c b/block/elevator.c
index 5ff093cb3cf8..f64ebd726e58 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -83,13 +83,6 @@ bool elv_bio_merge_ok(struct request *rq, struct bio *bio)
 }
 EXPORT_SYMBOL(elv_bio_merge_ok);
 
-static inline bool elv_support_features(struct request_queue *q,
-		const struct elevator_type *e)
-{
-	return (q->required_elevator_features & e->elevator_features) ==
-		q->required_elevator_features;
-}
-
 /**
  * elevator_match - Check whether @e's name or alias matches @name
  * @e: Scheduler to test
@@ -120,7 +113,7 @@ static struct elevator_type *elevator_find_get(struct request_queue *q,
 
 	spin_lock(&elv_list_lock);
 	e = __elevator_find(name);
-	if (e && (!elv_support_features(q, e) || !elevator_tryget(e)))
+	if (e && (!elevator_tryget(e)))
 		e = NULL;
 	spin_unlock(&elv_list_lock);
 	return e;
@@ -580,34 +573,8 @@ static struct elevator_type *elevator_get_default(struct request_queue *q)
 }
 
 /*
- * Get the first elevator providing the features required by the request queue.
- * Default to "none" if no matching elevator is found.
- */
-static struct elevator_type *elevator_get_by_features(struct request_queue *q)
-{
-	struct elevator_type *e, *found = NULL;
-
-	spin_lock(&elv_list_lock);
-
-	list_for_each_entry(e, &elv_list, list) {
-		if (elv_support_features(q, e)) {
-			found = e;
-			break;
-		}
-	}
-
-	if (found && !elevator_tryget(found))
-		found = NULL;
-
-	spin_unlock(&elv_list_lock);
-	return found;
-}
-
-/*
- * For a device queue that has no required features, use the default elevator
- * settings. Otherwise, use the first elevator available matching the required
- * features. If no suitable elevator is find or if the chosen elevator
- * initialization fails, fall back to the "none" elevator (no elevator).
+ * Use the default elevator settings. If the chosen elevator initialization
+ * fails, fall back to the "none" elevator (no elevator).
  */
 void elevator_init_mq(struct request_queue *q)
 {
@@ -622,10 +589,7 @@ void elevator_init_mq(struct request_queue *q)
 	if (unlikely(q->elevator))
 		return;
 
-	if (!q->required_elevator_features)
-		e = elevator_get_default(q);
-	else
-		e = elevator_get_by_features(q);
+	e = elevator_get_default(q);
 	if (!e)
 		return;
 
@@ -781,7 +745,7 @@ ssize_t elv_iosched_show(struct request_queue *q, char *name)
 	list_for_each_entry(e, &elv_list, list) {
 		if (e == cur)
 			len += sprintf(name+len, "[%s] ", e->elevator_name);
-		else if (elv_support_features(q, e))
+		else
 			len += sprintf(name+len, "%s ", e->elevator_name);
 	}
 	spin_unlock(&elv_list_lock);
diff --git a/block/elevator.h b/block/elevator.h
index 7ca3d7b6ed82..e9a050a96e53 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -74,7 +74,6 @@ struct elevator_type
 	struct elv_fs_entry *elevator_attrs;
 	const char *elevator_name;
 	const char *elevator_alias;
-	const unsigned int elevator_features;
 	struct module *elevator_owner;
 #ifdef CONFIG_BLK_DEBUG_FS
 	const struct blk_mq_debugfs_attr *queue_debugfs_attrs;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a39ebf075d26..cb90a59d35cb 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -448,8 +448,6 @@ struct request_queue {
 
 	atomic_t		nr_active_requests_shared_tags;
 
-	unsigned int		required_elevator_features;
-
 	struct blk_mq_tags	*sched_shared_tags;
 
 	struct list_head	icq_list;
@@ -925,14 +923,6 @@ disk_alloc_independent_access_ranges(struct gendisk *disk, int nr_ia_ranges);
 void disk_set_independent_access_ranges(struct gendisk *disk,
 				struct blk_independent_access_ranges *iars);
 
-/*
- * Elevator features for blk_queue_required_elevator_features:
- */
-/* Supports zoned block devices sequential write constraint */
-#define ELEVATOR_F_ZBD_SEQ_WRITE	(1U << 0)
-
-extern void blk_queue_required_elevator_features(struct request_queue *q,
-						 unsigned int features);
 extern bool blk_queue_can_use_dma_map_merging(struct request_queue *q,
 					      struct device *dev);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 21/26] block: Do not check zone type in blk_check_zone_append()
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (19 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 20/26] block: Remove elevator required features Damien Le Moal
@ 2024-02-02  7:30 ` Damien Le Moal
  2024-02-04 12:37   ` Hannes Reinecke
  2024-02-02  7:31 ` [PATCH 22/26] block: Move zone related debugfs attribute to blk-zoned.c Damien Le Moal
                   ` (7 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:30 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

Zone append operations are only allowed to target sequential write
required zones. blk_check_zone_append() uses bio_zone_is_seq() to check
this. However, this check is not necessary because:
1) For NVMe ZNS namespace devices, only sequential write required zones
   exist, making the zone type check useless.
2) For null_blk, the driver will fail the request anyway, thus notifying
   the user that a conventional zone was targeted.
3) For all other zoned devices, zone append is now emulated using zone
   write plugging, which checks that a zone append operation does not
   target a conventional zone.

In preparation for the removal of zone write locking and its
conventional zone bitmap (used by bio_zone_is_seq()), remove the
bio_zone_is_seq() call from blk_check_zone_append().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-core.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3945cfcc4d9b..bb4af8ddd8e7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -577,8 +577,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 		return BLK_STS_NOTSUPP;
 
 	/* The bio sector must point to the start of a sequential zone */
-	if (!bdev_is_zone_start(bio->bi_bdev, bio->bi_iter.bi_sector) ||
-	    !bio_zone_is_seq(bio))
+	if (!bdev_is_zone_start(bio->bi_bdev, bio->bi_iter.bi_sector))
 		return BLK_STS_IOERR;
 
 	/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 22/26] block: Move zone related debugfs attribute to blk-zoned.c
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (20 preceding siblings ...)
  2024-02-02  7:30 ` [PATCH 21/26] block: Do not check zone type in blk_check_zone_append() Damien Le Moal
@ 2024-02-02  7:31 ` Damien Le Moal
  2024-02-04 12:38   ` Hannes Reinecke
  2024-02-02  7:31 ` [PATCH 23/26] block: Remove zone write locking Damien Le Moal
                   ` (6 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:31 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

block/blk-mq-debugfs-zone.c contains a single debugfs attribute
function. Defining this outside of block/blk-zoned.c does not really
help in any way, so move this zone related debugfs attribute to
block/blk-zoned.c and delete block/blk-mq-debugfs-zone.c.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/Kconfig                |  4 ----
 block/Makefile               |  1 -
 block/blk-mq-debugfs-zoned.c | 22 ----------------------
 block/blk-mq-debugfs.h       |  2 +-
 block/blk-zoned.c            | 20 ++++++++++++++++++++
 5 files changed, 21 insertions(+), 28 deletions(-)
 delete mode 100644 block/blk-mq-debugfs-zoned.c

diff --git a/block/Kconfig b/block/Kconfig
index 1de4682d48cc..9f647149fbee 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -198,10 +198,6 @@ config BLK_DEBUG_FS
 	Unless you are building a kernel for a tiny system, you should
 	say Y here.
 
-config BLK_DEBUG_FS_ZONED
-       bool
-       default BLK_DEBUG_FS && BLK_DEV_ZONED
-
 config BLK_SED_OPAL
 	bool "Logic for interfacing with Opal enabled SEDs"
 	depends on KEYS
diff --git a/block/Makefile b/block/Makefile
index 46ada9dc8bbf..168150b9c510 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -33,7 +33,6 @@ obj-$(CONFIG_BLK_MQ_VIRTIO)	+= blk-mq-virtio.o
 obj-$(CONFIG_BLK_DEV_ZONED)	+= blk-zoned.o
 obj-$(CONFIG_BLK_WBT)		+= blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS)	+= blk-mq-debugfs.o
-obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
 obj-$(CONFIG_BLK_SED_OPAL)	+= sed-opal.o
 obj-$(CONFIG_BLK_PM)		+= blk-pm.o
 obj-$(CONFIG_BLK_INLINE_ENCRYPTION)	+= blk-crypto.o blk-crypto-profile.o \
diff --git a/block/blk-mq-debugfs-zoned.c b/block/blk-mq-debugfs-zoned.c
deleted file mode 100644
index a77b099c34b7..000000000000
--- a/block/blk-mq-debugfs-zoned.c
+++ /dev/null
@@ -1,22 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * Copyright (C) 2017 Western Digital Corporation or its affiliates.
- */
-
-#include <linux/blkdev.h>
-#include "blk-mq-debugfs.h"
-
-int queue_zone_wlock_show(void *data, struct seq_file *m)
-{
-	struct request_queue *q = data;
-	unsigned int i;
-
-	if (!q->disk->seq_zones_wlock)
-		return 0;
-
-	for (i = 0; i < q->disk->nr_zones; i++)
-		if (test_bit(i, q->disk->seq_zones_wlock))
-			seq_printf(m, "%u\n", i);
-
-	return 0;
-}
diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index 9c7d4b6117d4..3ebe2c29b624 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -83,7 +83,7 @@ static inline void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
 }
 #endif
 
-#ifdef CONFIG_BLK_DEBUG_FS_ZONED
+#if defined(CONFIG_BLK_DEV_ZONED) && defined(CONFIG_BLK_DEBUG_FS)
 int queue_zone_wlock_show(void *data, struct seq_file *m);
 #else
 static inline int queue_zone_wlock_show(void *data, struct seq_file *m)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 3dadf37ad787..bac642e26a3e 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -20,6 +20,7 @@
 
 #include "blk.h"
 #include "blk-mq-sched.h"
+#include "blk-mq-debugfs.h"
 
 #define ZONE_COND_NAME(name) [BLK_ZONE_COND_##name] = #name
 static const char *const zone_cond_name[] = {
@@ -1418,3 +1419,22 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
+
+#ifdef CONFIG_BLK_DEBUG_FS
+
+int queue_zone_wlock_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+	unsigned int i;
+
+	if (!q->disk->seq_zones_wlock)
+		return 0;
+
+	for (i = 0; i < q->disk->nr_zones; i++)
+		if (test_bit(i, q->disk->seq_zones_wlock))
+			seq_printf(m, "%u\n", i);
+
+	return 0;
+}
+
+#endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 23/26] block: Remove zone write locking
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (21 preceding siblings ...)
  2024-02-02  7:31 ` [PATCH 22/26] block: Move zone related debugfs attribute to blk-zoned.c Damien Le Moal
@ 2024-02-02  7:31 ` Damien Le Moal
  2024-02-04 12:38   ` Hannes Reinecke
  2024-02-02  7:31 ` [PATCH 24/26] block: Do not special-case plugging of zone write operations Damien Le Moal
                   ` (5 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:31 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

Zone write locking is now unused and replaced with zone write plugging.
Remove all code that was implementing zone write locking, that is, the
various helper functions controlling request zone write locking and
the gendisk attached zone bitmaps.

The "zone_wlock" mq-debugfs entry that was listing zones that are
write-locked is replaced with the zone_plugged_wplugs entry which lists
the number of zones that have a zone write plug throttling write
operations.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-mq-debugfs.c    |  3 +-
 block/blk-mq-debugfs.h    |  4 +-
 block/blk-zoned.c         | 98 ++++++---------------------------------
 include/linux/blk-mq.h    | 83 ---------------------------------
 include/linux/blk_types.h |  1 -
 include/linux/blkdev.h    | 36 ++------------
 6 files changed, 21 insertions(+), 204 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 94668e72ab09..b803f5b370e9 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -160,7 +160,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
 	{ "requeue_list", 0400, .seq_ops = &queue_requeue_list_seq_ops },
 	{ "pm_only", 0600, queue_pm_only_show, NULL },
 	{ "state", 0600, queue_state_show, queue_state_write },
-	{ "zone_wlock", 0400, queue_zone_wlock_show, NULL },
+	{ "zone_plugged_wplugs", 0400, queue_zone_plugged_wplugs_show, NULL },
 	{ },
 };
 
@@ -256,7 +256,6 @@ static const char *const rqf_name[] = {
 	RQF_NAME(HASHED),
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
-	RQF_NAME(ZONE_WRITE_LOCKED),
 	RQF_NAME(TIMED_OUT),
 	RQF_NAME(RESV),
 };
diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index 3ebe2c29b624..6d3ac4b77d59 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -84,9 +84,9 @@ static inline void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
 #endif
 
 #if defined(CONFIG_BLK_DEV_ZONED) && defined(CONFIG_BLK_DEBUG_FS)
-int queue_zone_wlock_show(void *data, struct seq_file *m);
+int queue_zone_plugged_wplugs_show(void *data, struct seq_file *m);
 #else
-static inline int queue_zone_wlock_show(void *data, struct seq_file *m)
+static inline int queue_zone_plugged_wplugs_show(void *data, struct seq_file *m)
 {
 	return 0;
 }
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index bac642e26a3e..4da634e9f5a0 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -80,52 +80,6 @@ const char *blk_zone_cond_str(enum blk_zone_cond zone_cond)
 }
 EXPORT_SYMBOL_GPL(blk_zone_cond_str);
 
-/*
- * Return true if a request is a write requests that needs zone write locking.
- */
-bool blk_req_needs_zone_write_lock(struct request *rq)
-{
-	if (!rq->q->disk->seq_zones_wlock)
-		return false;
-
-	return blk_rq_is_seq_zoned_write(rq);
-}
-EXPORT_SYMBOL_GPL(blk_req_needs_zone_write_lock);
-
-bool blk_req_zone_write_trylock(struct request *rq)
-{
-	unsigned int zno = blk_rq_zone_no(rq);
-
-	if (test_and_set_bit(zno, rq->q->disk->seq_zones_wlock))
-		return false;
-
-	WARN_ON_ONCE(rq->rq_flags & RQF_ZONE_WRITE_LOCKED);
-	rq->rq_flags |= RQF_ZONE_WRITE_LOCKED;
-
-	return true;
-}
-EXPORT_SYMBOL_GPL(blk_req_zone_write_trylock);
-
-void __blk_req_zone_write_lock(struct request *rq)
-{
-	if (WARN_ON_ONCE(test_and_set_bit(blk_rq_zone_no(rq),
-					  rq->q->disk->seq_zones_wlock)))
-		return;
-
-	WARN_ON_ONCE(rq->rq_flags & RQF_ZONE_WRITE_LOCKED);
-	rq->rq_flags |= RQF_ZONE_WRITE_LOCKED;
-}
-EXPORT_SYMBOL_GPL(__blk_req_zone_write_lock);
-
-void __blk_req_zone_write_unlock(struct request *rq)
-{
-	rq->rq_flags &= ~RQF_ZONE_WRITE_LOCKED;
-	if (rq->q->disk->seq_zones_wlock)
-		WARN_ON_ONCE(!test_and_clear_bit(blk_rq_zone_no(rq),
-						 rq->q->disk->seq_zones_wlock));
-}
-EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
-
 /**
  * bdev_nr_zones - Get number of zones
  * @bdev:	Target device
@@ -1213,11 +1167,6 @@ void disk_free_zone_resources(struct gendisk *disk)
 	if (disk->zone_wplugs)
 		cancel_delayed_work_sync(&disk->zone_wplugs_work);
 
-	kfree(disk->conv_zones_bitmap);
-	disk->conv_zones_bitmap = NULL;
-	kfree(disk->seq_zones_wlock);
-	disk->seq_zones_wlock = NULL;
-
 	blk_zone_free_write_plugs(disk, disk->zone_wplugs, disk->nr_zones);
 	disk->zone_wplugs = NULL;
 
@@ -1226,9 +1175,6 @@ void disk_free_zone_resources(struct gendisk *disk)
 
 struct blk_revalidate_zone_args {
 	struct gendisk	*disk;
-	unsigned long	*conv_zones_bitmap;
-	unsigned long	*seq_zones_wlock;
-	unsigned int	nr_zones;
 	struct blk_zone_wplug *zone_wplugs;
 	sector_t	sector;
 };
@@ -1277,22 +1223,9 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 	/* Check zone type */
 	switch (zone->type) {
 	case BLK_ZONE_TYPE_CONVENTIONAL:
-		if (!args->conv_zones_bitmap) {
-			args->conv_zones_bitmap =
-				blk_alloc_zone_bitmap(q->node, args->nr_zones);
-			if (!args->conv_zones_bitmap)
-				return -ENOMEM;
-		}
-		set_bit(idx, args->conv_zones_bitmap);
 		args->zone_wplugs[idx].flags |= BLK_ZONE_WPLUG_CONV;
 		break;
 	case BLK_ZONE_TYPE_SEQWRITE_REQ:
-		if (!args->seq_zones_wlock) {
-			args->seq_zones_wlock =
-				blk_alloc_zone_bitmap(q->node, args->nr_zones);
-			if (!args->seq_zones_wlock)
-				return -ENOMEM;
-		}
 		args->zone_wplugs[idx].capacity = zone->capacity;
 		args->zone_wplugs[idx].wp_offset = blk_zone_wp_offset(zone);
 		break;
@@ -1308,7 +1241,7 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 }
 
 /**
- * blk_revalidate_disk_zones - (re)allocate and initialize zone bitmaps
+ * blk_revalidate_disk_zones - (re)allocate and initialize zone write plugs
  * @disk:	Target disk
  *
  * Helper function for low-level device drivers to check, (re) allocate and
@@ -1326,7 +1259,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 	sector_t zone_sectors = q->limits.chunk_sectors;
 	sector_t capacity = get_capacity(disk);
 	struct blk_revalidate_zone_args args = { };
-	unsigned int noio_flag;
+	unsigned int nr_zones, noio_flag;
 	int ret = -ENOMEM;
 
 	if (WARN_ON_ONCE(!blk_queue_is_zoned(q)))
@@ -1351,6 +1284,8 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 		return -ENODEV;
 	}
 
+	nr_zones = (capacity + zone_sectors - 1) >> ilog2(zone_sectors);
+
 	/*
 	 * Ensure that all memory allocations in this context are done as if
 	 * GFP_NOIO was specified.
@@ -1358,8 +1293,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 	noio_flag = memalloc_noio_save();
 
 	args.disk = disk;
-	args.nr_zones = (capacity + zone_sectors - 1) >> ilog2(zone_sectors);
-	args.zone_wplugs = blk_zone_alloc_write_plugs(args.nr_zones);
+	args.zone_wplugs = blk_zone_alloc_write_plugs(nr_zones);
 	if (!args.zone_wplugs)
 		goto out_restore_noio;
 
@@ -1389,16 +1323,13 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 	}
 
 	/*
-	 * Install the new bitmaps and update nr_zones only once the queue is
-	 * stopped and all I/Os are completed (i.e. a scheduler is not
-	 * referencing the bitmaps).
+	 * Install the new write plugs and update nr_zones only once the queue
+	 * is frozen and all I/Os are completed.
 	 */
 	blk_mq_freeze_queue(q);
 	if (ret > 0) {
 		mutex_lock(&disk->zone_wplugs_mutex);
-		disk->nr_zones = args.nr_zones;
-		swap(disk->seq_zones_wlock, args.seq_zones_wlock);
-		swap(disk->conv_zones_bitmap, args.conv_zones_bitmap);
+		disk->nr_zones = nr_zones;
 		swap(disk->zone_wplugs, args.zone_wplugs);
 		mutex_unlock(&disk->zone_wplugs_mutex);
 		ret = 0;
@@ -1408,9 +1339,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 	}
 	blk_mq_unfreeze_queue(q);
 
-	kfree(args.seq_zones_wlock);
-	kfree(args.conv_zones_bitmap);
-	blk_zone_free_write_plugs(disk, args.zone_wplugs, args.nr_zones);
+	blk_zone_free_write_plugs(disk, args.zone_wplugs, nr_zones);
 
 	return ret;
 
@@ -1422,16 +1351,17 @@ EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
 
 #ifdef CONFIG_BLK_DEBUG_FS
 
-int queue_zone_wlock_show(void *data, struct seq_file *m)
+int queue_zone_plugged_wplugs_show(void *data, struct seq_file *m)
 {
 	struct request_queue *q = data;
+	struct gendisk *disk = q->disk;
 	unsigned int i;
 
-	if (!q->disk->seq_zones_wlock)
+	if (!disk->zone_wplugs)
 		return 0;
 
-	for (i = 0; i < q->disk->nr_zones; i++)
-		if (test_bit(i, q->disk->seq_zones_wlock))
+	for (i = 0; i < disk->nr_zones; i++)
+		if (disk->zone_wplugs[i].flags & BLK_ZONE_WPLUG_PLUGGED)
 			seq_printf(m, "%u\n", i);
 
 	return 0;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index bc74f904b5a1..1478cc4fdebe 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -53,8 +53,6 @@ typedef __u32 __bitwise req_flags_t;
 /* Look at ->special_vec for the actual data payload instead of the
    bio chain. */
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
-/* The per-zone write lock is held for this request */
-#define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
 /* The request completion needs to be signaled to zone write pluging. */
 #define RQF_ZONE_WRITE_PLUGGING	((__force req_flags_t)(1 << 20))
 /* ->timeout has been called, don't expire again */
@@ -1148,85 +1146,4 @@ static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 }
 void blk_dump_rq_flags(struct request *, char *);
 
-#ifdef CONFIG_BLK_DEV_ZONED
-static inline unsigned int blk_rq_zone_no(struct request *rq)
-{
-	return disk_zone_no(rq->q->disk, blk_rq_pos(rq));
-}
-
-static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
-{
-	return disk_zone_is_seq(rq->q->disk, blk_rq_pos(rq));
-}
-
-/**
- * blk_rq_is_seq_zoned_write() - Check if @rq requires write serialization.
- * @rq: Request to examine.
- *
- * Note: REQ_OP_ZONE_APPEND requests do not require serialization.
- */
-static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
-{
-	return op_needs_zoned_write_locking(req_op(rq)) &&
-		blk_rq_zone_is_seq(rq);
-}
-
-bool blk_req_needs_zone_write_lock(struct request *rq);
-bool blk_req_zone_write_trylock(struct request *rq);
-void __blk_req_zone_write_lock(struct request *rq);
-void __blk_req_zone_write_unlock(struct request *rq);
-
-static inline void blk_req_zone_write_lock(struct request *rq)
-{
-	if (blk_req_needs_zone_write_lock(rq))
-		__blk_req_zone_write_lock(rq);
-}
-
-static inline void blk_req_zone_write_unlock(struct request *rq)
-{
-	if (rq->rq_flags & RQF_ZONE_WRITE_LOCKED)
-		__blk_req_zone_write_unlock(rq);
-}
-
-static inline bool blk_req_zone_is_write_locked(struct request *rq)
-{
-	return rq->q->disk->seq_zones_wlock &&
-		test_bit(blk_rq_zone_no(rq), rq->q->disk->seq_zones_wlock);
-}
-
-static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
-{
-	if (!blk_req_needs_zone_write_lock(rq))
-		return true;
-	return !blk_req_zone_is_write_locked(rq);
-}
-#else /* CONFIG_BLK_DEV_ZONED */
-static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
-{
-	return false;
-}
-
-static inline bool blk_req_needs_zone_write_lock(struct request *rq)
-{
-	return false;
-}
-
-static inline void blk_req_zone_write_lock(struct request *rq)
-{
-}
-
-static inline void blk_req_zone_write_unlock(struct request *rq)
-{
-}
-static inline bool blk_req_zone_is_write_locked(struct request *rq)
-{
-	return false;
-}
-
-static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
-{
-	return true;
-}
-#endif /* CONFIG_BLK_DEV_ZONED */
-
 #endif /* BLK_MQ_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index fd0dc2d08924..31994887cbdb 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -295,7 +295,6 @@ enum {
 	BIO_QOS_THROTTLED,	/* bio went through rq_qos throttle path */
 	BIO_QOS_MERGED,		/* but went through rq_qos merge path */
 	BIO_REMAPPED,
-	BIO_ZONE_WRITE_LOCKED,	/* Owns a zoned device zone write lock */
 	BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
 	BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
 	BIO_FLAG_LAST
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cb90a59d35cb..6dfefb2de652 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -176,24 +176,14 @@ struct gendisk {
 
 #ifdef CONFIG_BLK_DEV_ZONED
 	/*
-	 * Zoned block device information for request dispatch control.
-	 * nr_zones is the total number of zones of the device. This is always
-	 * 0 for regular block devices. conv_zones_bitmap is a bitmap of nr_zones
-	 * bits which indicates if a zone is conventional (bit set) or
-	 * sequential (bit clear). seq_zones_wlock is a bitmap of nr_zones
-	 * bits which indicates if a zone is write locked, that is, if a write
-	 * request targeting the zone was dispatched.
-	 *
-	 * Reads of this information must be protected with blk_queue_enter() /
-	 * blk_queue_exit(). Modifying this information is only allowed while
-	 * no requests are being processed. See also blk_mq_freeze_queue() and
-	 * blk_mq_unfreeze_queue().
+	 * Zoned block device information. Reads of this information must be
+	 * protected with blk_queue_enter() / blk_queue_exit(). Modifying this
+	 * information is only allowed while no requests are being processed.
+	 * See also blk_mq_freeze_queue() and blk_mq_unfreeze_queue().
 	 */
 	unsigned int		nr_zones;
 	unsigned int		max_open_zones;
 	unsigned int		max_active_zones;
-	unsigned long		*conv_zones_bitmap;
-	unsigned long		*seq_zones_wlock;
 	struct blk_zone_wplug	*zone_wplugs;
 	struct mutex		zone_wplugs_mutex;
 	atomic_t		zone_nr_wplugs_with_error;
@@ -629,15 +619,6 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
 	return sector >> ilog2(disk->queue->limits.chunk_sectors);
 }
 
-static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
-{
-	if (!blk_queue_is_zoned(disk->queue))
-		return false;
-	if (!disk->conv_zones_bitmap)
-		return true;
-	return !test_bit(disk_zone_no(disk, sector), disk->conv_zones_bitmap);
-}
-
 static inline void disk_set_max_open_zones(struct gendisk *disk,
 		unsigned int max_open_zones)
 {
@@ -671,10 +652,6 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
 {
 	return 0;
 }
-static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
-{
-	return false;
-}
 static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
 {
 	return 0;
@@ -859,11 +836,6 @@ static inline bool bio_straddle_zones(struct bio *bio)
 		disk_zone_no(bio->bi_bdev->bd_disk, bio_end_sector(bio) - 1);
 }
 
-static inline unsigned int bio_zone_is_seq(struct bio *bio)
-{
-	return disk_zone_is_seq(bio->bi_bdev->bd_disk, bio->bi_iter.bi_sector);
-}
-
 /*
  * Return how much of the chunk is left to be used for I/O at a given offset.
  */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 24/26] block: Do not special-case plugging of zone write operations
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (22 preceding siblings ...)
  2024-02-02  7:31 ` [PATCH 23/26] block: Remove zone write locking Damien Le Moal
@ 2024-02-02  7:31 ` Damien Le Moal
  2024-02-04 12:39   ` Hannes Reinecke
  2024-02-02  7:31 ` [PATCH 25/26] block: Reduce zone write plugging memory usage Damien Le Moal
                   ` (4 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:31 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

With the block layer zone write plugging being automatically done for
any write operation to a zone of a zoned block device, a regular request
plugging handled through current->plug can only ever see at most a
single write request per zone. In such case, any potential reordering
of the plugged requests will be harmless. We can thus remove the special
casing for write operations to zones and have these requests plugged as
well. This allows removing the function blk_mq_plug and instead directly
using current->plug where needed.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-core.c       |  6 ------
 block/blk-merge.c      |  3 +--
 block/blk-mq.c         |  7 +------
 block/blk-mq.h         | 31 -------------------------------
 include/linux/blkdev.h | 12 ------------
 5 files changed, 2 insertions(+), 57 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bb4af8ddd8e7..5cef05572f68 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -886,12 +886,6 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags)
 	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
 		return 0;
 
-	/*
-	 * As the requests that require a zone lock are not plugged in the
-	 * first place, directly accessing the plug instead of using
-	 * blk_mq_plug() should not have any consequences during flushing for
-	 * zoned devices.
-	 */
 	blk_flush_plug(current->plug, false);
 
 	/*
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2b5489cd9c65..bb18c7dc1227 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -1104,10 +1104,9 @@ static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 		unsigned int nr_segs)
 {
-	struct blk_plug *plug;
+	struct blk_plug *plug = current->plug;
 	struct request *rq;
 
-	plug = blk_mq_plug(bio);
 	if (!plug || rq_list_empty(plug->mq_list))
 		return false;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8576940f8674..72bd359a225c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1337,11 +1337,6 @@ void blk_execute_rq_nowait(struct request *rq, bool at_head)
 
 	blk_account_io_start(rq);
 
-	/*
-	 * As plugging can be enabled for passthrough requests on a zoned
-	 * device, directly accessing the plug instead of using blk_mq_plug()
-	 * should not have any consequences.
-	 */
 	if (current->plug && !at_head) {
 		blk_add_rq_to_plug(current->plug, rq);
 		return;
@@ -2948,7 +2943,7 @@ static void bio_set_ioprio(struct bio *bio)
 void blk_mq_submit_bio(struct bio *bio)
 {
 	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
-	struct blk_plug *plug = blk_mq_plug(bio);
+	struct blk_plug *plug = current->plug;
 	const int is_sync = op_is_sync(bio->bi_opf);
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int nr_segs = 1;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index f75a9ecfebde..260beea8e332 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -365,37 +365,6 @@ static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
 		qmap->mq_map[cpu] = 0;
 }
 
-/*
- * blk_mq_plug() - Get caller context plug
- * @bio : the bio being submitted by the caller context
- *
- * Plugging, by design, may delay the insertion of BIOs into the elevator in
- * order to increase BIO merging opportunities. This however can cause BIO
- * insertion order to change from the order in which submit_bio() is being
- * executed in the case of multiple contexts concurrently issuing BIOs to a
- * device, even if these context are synchronized to tightly control BIO issuing
- * order. While this is not a problem with regular block devices, this ordering
- * change can cause write BIO failures with zoned block devices as these
- * require sequential write patterns to zones. Prevent this from happening by
- * ignoring the plug state of a BIO issuing context if it is for a zoned block
- * device and the BIO to plug is a write operation.
- *
- * Return current->plug if the bio can be plugged and NULL otherwise
- */
-static inline struct blk_plug *blk_mq_plug( struct bio *bio)
-{
-	/* Zoned block device write operation case: do not plug the BIO */
-	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
-	    bdev_op_is_zoned_write(bio->bi_bdev, bio_op(bio)))
-		return NULL;
-
-	/*
-	 * For regular block devices or read operations, use the context plug
-	 * which may be NULL if blk_start_plug() was not executed.
-	 */
-	return current->plug;
-}
-
 /* Free all requests on the list */
 static inline void blk_mq_free_requests(struct list_head *list)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6dfefb2de652..e96baa552e12 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1262,18 +1262,6 @@ static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
 	return disk_zone_no(bdev->bd_disk, sec);
 }
 
-/* Whether write serialization is required for @op on zoned devices. */
-static inline bool op_needs_zoned_write_locking(enum req_op op)
-{
-	return op == REQ_OP_WRITE || op == REQ_OP_WRITE_ZEROES;
-}
-
-static inline bool bdev_op_is_zoned_write(struct block_device *bdev,
-					  enum req_op op)
-{
-	return bdev_is_zoned(bdev) && op_needs_zoned_write_locking(op);
-}
-
 static inline sector_t bdev_zone_sectors(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 25/26] block: Reduce zone write plugging memory usage
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (23 preceding siblings ...)
  2024-02-02  7:31 ` [PATCH 24/26] block: Do not special-case plugging of zone write operations Damien Le Moal
@ 2024-02-02  7:31 ` Damien Le Moal
  2024-02-04 12:42   ` Hannes Reinecke
  2024-02-02  7:31 ` [PATCH 26/26] block: Add zone_active_wplugs debugfs entry Damien Le Moal
                   ` (3 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:31 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

With zone write plugging, each zone of a zoned block device has a
64B struct blk_zone_wplug. While this is not a problem for small
capacity drives with few zones, this structure size result in large
memory usage per device for large capacity block devices.
E.g., for a 28 TB SMR disk with over 104,000 zones of 256 MB, the zone
write plug array of the gendisk uses 6.6 MB of memory.

However, except for the zone write plug spinlock, flags, zone capacity
and zone write pointer offset which all need to be always available
(the later 2 to avoid having to do too many report zones), the remaining
fields of struct blk_zone_wplug are needed only when a zone is being
written to.

This commit introduces struct blk_zone_active_wplug to reduce the size
of struct blk_zone_wplug from 64B down to 16B. This is done using an
union of a pointer to a struct blk_zone_active_wplug and of the zone
write pointer offset and zone capacity, with the zone write plug
spinlock and flags left as the first fields of struct blk_zone_wplug.

The flag BLK_ZONE_WPLUG_ACTIVE is introduced to indicate if the pointer
to struct blk_zone_active_wplug of a zone write plug is valid. For such
case, the write pointer offset and zone capacity fields are accessible
from struct blk_zone_active_wplug. Otherwise, they can be accessed from
struct blk_zone_wplug.

This data structure organization allows tracking the write pointer
offset of zones regardless of the zone write state (active or not).
Handling of zone reset, reset all and finish operations are modified
to update a zone write pointer offset according to its state.

A zone is activated in blk_zone_wplug_handle_write() with a call to
blk_zone_activate_wplug(). Reclaiming of allocated active zone write
plugs is done after a zone becomes full or is reset and
becomes empty. Reclaiming (freeing) of a zone active write plug
structure is done either directly when a plugged BIO completes and the
zone is full, or when resetting or finishing zones. Freeing of active
zone write plug is done using blk_zone_free_active_wplug().

For allocating struct blk_zone_active_wplug, a mempool is created and
sized according to the disk zone resources (maximum number of open zones
and maximum number of active zones). For devices with no zone resource
limits, the default BLK_ZONE_DEFAULT_ACTIVE_WPLUG_NR (128) is used.

With this mechanism, the amount of memory used per block device for zone
write plugs is roughly reduced by a factor of 4. E.g. for a 28 TB SMR
hard disk, memory usage is reduce to about 1.6 MB.

This commit contains contributions from Christoph Hellwig <hch@lst.de>.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-core.c       |   2 +
 block/blk-zoned.c      | 293 ++++++++++++++++++++++++++++++++---------
 block/blk.h            |   4 +
 include/linux/blkdev.h |   4 +
 4 files changed, 240 insertions(+), 63 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 5cef05572f68..e926f17d04d8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1220,5 +1220,7 @@ int __init blk_dev_init(void)
 
 	blk_debugfs_root = debugfs_create_dir("block", NULL);
 
+	blk_zone_dev_init();
+
 	return 0;
 }
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 4da634e9f5a0..865fc372f25e 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -36,17 +36,33 @@ static const char *const zone_cond_name[] = {
 #undef ZONE_COND_NAME
 
 /*
- * Per-zone write plug.
+ * Active zone write plug.
  */
-struct blk_zone_wplug {
-	spinlock_t		lock;
-	unsigned int		flags;
+struct blk_zone_active_wplug {
+	struct blk_zone_wplug	*zwplug;
 	struct bio_list		bio_list;
 	struct work_struct	bio_work;
 	unsigned int		wp_offset;
 	unsigned int		capacity;
 };
 
+static struct kmem_cache *blk_zone_active_wplugs_cachep;
+
+/*
+ * Per-zone write plug.
+ */
+struct blk_zone_wplug {
+	spinlock_t		lock;
+	unsigned int		flags;
+	union {
+		struct {
+			unsigned int	wp_offset;
+			unsigned int	capacity;
+		} info;
+		struct blk_zone_active_wplug *zawplug;
+	};
+};
+
 /*
  * Zone write plug flags bits:
  *  - BLK_ZONE_WPLUG_CONV: Indicate that the zone is a conventional one. Writes
@@ -56,10 +72,13 @@ struct blk_zone_wplug {
  *    being executed or the zone write plug bio list is not empty.
  *  - BLK_ZONE_WPLUG_ERROR: Indicate that a write error happened which will be
  *    recovered with a report zone to update the zone write pointer offset.
+ *  - BLK_ZONE_WPLUG_ACTIVE: Indicate that the zone is active, meaning that
+ *    a struct blk_zone_active_wplug was allocated for the zone.
  */
 #define BLK_ZONE_WPLUG_CONV	(1U << 0)
 #define BLK_ZONE_WPLUG_PLUGGED	(1U << 1)
 #define BLK_ZONE_WPLUG_ERROR	(1U << 2)
+#define BLK_ZONE_WPLUG_ACTIVE	(1U << 3)
 
 /**
  * blk_zone_cond_str - Return string XXX in BLK_ZONE_COND_XXX.
@@ -426,13 +445,13 @@ static inline void blk_zone_wplug_bio_io_error(struct bio *bio)
 	blk_queue_exit(q);
 }
 
-static int blk_zone_wplug_abort(struct gendisk *disk,
-				struct blk_zone_wplug *zwplug)
+static int blk_zone_active_wplug_abort(struct gendisk *disk,
+				struct blk_zone_active_wplug *zawplug)
 {
 	struct bio *bio;
 	int nr_aborted = 0;
 
-	while ((bio = bio_list_pop(&zwplug->bio_list))) {
+	while ((bio = bio_list_pop(&zawplug->bio_list))) {
 		blk_zone_wplug_bio_io_error(bio);
 		nr_aborted++;
 	}
@@ -440,15 +459,15 @@ static int blk_zone_wplug_abort(struct gendisk *disk,
 	return nr_aborted;
 }
 
-static void blk_zone_wplug_abort_unaligned(struct gendisk *disk,
-					   struct blk_zone_wplug *zwplug)
+static void blk_zone_active_wplug_abort_unaligned(struct gendisk *disk,
+				struct blk_zone_active_wplug *zawplug)
 {
-	unsigned int wp_offset = zwplug->wp_offset;
+	unsigned int wp_offset = zawplug->wp_offset;
 	struct bio_list bl = BIO_EMPTY_LIST;
 	struct bio *bio;
 
-	while ((bio = bio_list_pop(&zwplug->bio_list))) {
-		if (wp_offset >= zwplug->capacity ||
+	while ((bio = bio_list_pop(&zawplug->bio_list))) {
+		if (wp_offset >= zawplug->capacity ||
 		    (bio_op(bio) != REQ_OP_ZONE_APPEND &&
 		     bio_offset_from_zone_start(bio) != wp_offset)) {
 			blk_zone_wplug_bio_io_error(bio);
@@ -459,7 +478,57 @@ static void blk_zone_wplug_abort_unaligned(struct gendisk *disk,
 		bio_list_add(&bl, bio);
 	}
 
-	bio_list_merge(&zwplug->bio_list, &bl);
+	bio_list_merge(&zawplug->bio_list, &bl);
+}
+
+static void blk_zone_wplug_bio_work(struct work_struct *work);
+
+/*
+ * Activate an inactive zone by allocating its active write plug.
+ */
+static bool blk_zone_activate_wplug(struct gendisk *disk,
+				    struct blk_zone_wplug *zwplug)
+{
+	struct blk_zone_active_wplug *zawplug;
+
+	/* If we have an active write plug already, keep using it. */
+	if (zwplug->flags & BLK_ZONE_WPLUG_ACTIVE)
+		return true;
+
+	/*
+	 * Allocate an active write plug. This may fail if the mempool is fully
+	 * used if the user partially writes too many zones, which is possible
+	 * if the device has no active zone limit, if the user is not respecting
+	 * the open zone limit or if the device has no limits at all.
+	 */
+	zawplug = mempool_alloc(disk->zone_awplugs_pool, GFP_NOWAIT);
+	if (!zawplug)
+		return false;
+
+	zawplug->zwplug = zwplug;
+	bio_list_init(&zawplug->bio_list);
+	INIT_WORK(&zawplug->bio_work, blk_zone_wplug_bio_work);
+	zawplug->capacity = zwplug->info.capacity;
+	zawplug->wp_offset = zwplug->info.wp_offset;
+
+	zwplug->zawplug = zawplug;
+	zwplug->flags |= BLK_ZONE_WPLUG_ACTIVE;
+
+	return true;
+}
+
+static void blk_zone_free_active_wplug(struct gendisk *disk,
+				       struct blk_zone_active_wplug *zawplug)
+{
+	struct blk_zone_wplug *zwplug = zawplug->zwplug;
+
+	WARN_ON_ONCE(!bio_list_empty(&zawplug->bio_list));
+
+	zwplug->flags &= ~(BLK_ZONE_WPLUG_PLUGGED | BLK_ZONE_WPLUG_ACTIVE);
+	zwplug->info.capacity = zawplug->capacity;
+	zwplug->info.wp_offset = zawplug->wp_offset;
+
+	mempool_free(zawplug, disk->zone_awplugs_pool);
 }
 
 /*
@@ -499,6 +568,8 @@ static void blk_zone_wplug_set_wp_offset(struct gendisk *disk,
 					 struct blk_zone_wplug *zwplug,
 					 unsigned int wp_offset)
 {
+	struct blk_zone_active_wplug *zawplug;
+
 	/*
 	 * Updating the write pointer offset puts back the zone
 	 * in a good state. So clear the error flag and decrement the
@@ -509,9 +580,24 @@ static void blk_zone_wplug_set_wp_offset(struct gendisk *disk,
 		atomic_dec(&disk->zone_nr_wplugs_with_error);
 	}
 
+	/* Inactive zones only need the write pointer updated. */
+	if (!(zwplug->flags & BLK_ZONE_WPLUG_ACTIVE)) {
+		zwplug->info.wp_offset = wp_offset;
+		return;
+	}
+
 	/* Update the zone write pointer and abort all plugged BIOs. */
-	zwplug->wp_offset = wp_offset;
-	blk_zone_wplug_abort(disk, zwplug);
+	zawplug = zwplug->zawplug;
+	zawplug->wp_offset = wp_offset;
+	blk_zone_active_wplug_abort(disk, zawplug);
+
+	/*
+	 * We have no remaining plugged BIOs. So if there is no BIO being
+	 * executed (i.e. the zone is not plugged), then free the active write
+	 * plug as it is now either full or empty.
+	 */
+	if (!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED))
+		blk_zone_free_active_wplug(disk, zawplug);
 }
 
 static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
@@ -526,9 +612,6 @@ static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
 		return true;
 	}
 
-	if (!bdev_emulates_zone_append(bio->bi_bdev))
-		return false;
-
 	/*
 	 * Set the zone write pointer offset to 0 (reset case) or to the
 	 * zone size (finish case). This will abort all BIOs plugged for the
@@ -549,9 +632,6 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
 	unsigned long flags;
 	unsigned int i;
 
-	if (!bdev_emulates_zone_append(bio->bi_bdev))
-		return false;
-
 	/*
 	 * Set the write pointer offset of all zones to 0. This will abort all
 	 * plugged BIOs. It is fine as resetting zones while writes are still
@@ -598,7 +678,7 @@ static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
 	 * user must also issue write sequentially. So simply add the new BIO
 	 * at the tail of the list to preserve the sequential write order.
 	 */
-	bio_list_add(&zwplug->bio_list, bio);
+	bio_list_add(&zwplug->zawplug->bio_list, bio);
 }
 
 /*
@@ -623,7 +703,8 @@ void blk_zone_write_plug_bio_merged(struct bio *bio)
 	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
 
 	/* Advance the zone write pointer offset. */
-	zwplug->wp_offset += bio_sectors(bio);
+	WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_ACTIVE));
+	zwplug->zawplug->wp_offset += bio_sectors(bio);
 
 	blk_zone_wplug_unlock(zwplug, flags);
 }
@@ -636,6 +717,7 @@ void blk_zone_write_plug_attempt_merge(struct request *req)
 {
 	struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(req->bio);
 	sector_t req_back_sector = blk_rq_pos(req) + blk_rq_sectors(req);
+	struct blk_zone_active_wplug *zawplug = zwplug->zawplug;
 	struct request_queue *q = req->q;
 	unsigned long flags;
 	struct bio *bio;
@@ -654,8 +736,8 @@ void blk_zone_write_plug_attempt_merge(struct request *req)
 	 * into the back of the request.
 	 */
 	blk_zone_wplug_lock(zwplug, flags);
-	while (zwplug->wp_offset < zwplug->capacity &&
-	       (bio = bio_list_peek(&zwplug->bio_list))) {
+	while (zawplug->wp_offset < zawplug->capacity &&
+	       (bio = bio_list_peek(&zawplug->bio_list))) {
 		if (bio->bi_iter.bi_sector != req_back_sector ||
 		    !blk_rq_merge_ok(req, bio))
 			break;
@@ -663,10 +745,10 @@ void blk_zone_write_plug_attempt_merge(struct request *req)
 		WARN_ON_ONCE(bio_op(bio) != REQ_OP_WRITE_ZEROES &&
 			     !bio->__bi_nr_segments);
 
-		bio_list_pop(&zwplug->bio_list);
+		bio_list_pop(&zawplug->bio_list);
 		if (bio_attempt_back_merge(req, bio, bio->__bi_nr_segments) !=
 		    BIO_MERGE_OK) {
-			bio_list_add_head(&zwplug->bio_list, bio);
+			bio_list_add_head(&zawplug->bio_list, bio);
 			break;
 		}
 
@@ -675,7 +757,7 @@ void blk_zone_write_plug_attempt_merge(struct request *req)
 		 * plugging the BIO and advance the write pointer offset.
 		 */
 		blk_queue_exit(q);
-		zwplug->wp_offset += bio_sectors(bio);
+		zawplug->wp_offset += bio_sectors(bio);
 
 		req_back_sector += bio_sectors(bio);
 	}
@@ -698,12 +780,7 @@ static inline void blk_zone_wplug_set_error(struct gendisk *disk,
 static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
 				       struct bio *bio)
 {
-	/*
-	 * If we do not need to emulate zone append, zone write pointer offset
-	 * tracking is not necessary and we have nothing to do.
-	 */
-	if (!bdev_emulates_zone_append(bio->bi_bdev))
-		return true;
+	struct blk_zone_active_wplug *zawplug = zwplug->zawplug;
 
 	/*
 	 * Check that the user is not attempting to write to a full zone.
@@ -711,7 +788,7 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
 	 * write pointer offset, causing zone append BIOs for one zone to be
 	 * directed at the following zone.
 	 */
-	if (zwplug->wp_offset >= zwplug->capacity)
+	if (zawplug->wp_offset >= zawplug->capacity)
 		goto err;
 
 	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
@@ -722,7 +799,7 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
 		 */
 		bio->bi_opf &= ~REQ_OP_MASK;
 		bio->bi_opf |= REQ_OP_WRITE | REQ_NOMERGE;
-		bio->bi_iter.bi_sector += zwplug->wp_offset;
+		bio->bi_iter.bi_sector += zawplug->wp_offset;
 
 		/*
 		 * Remember that this BIO is in fact a zone append operation
@@ -735,12 +812,12 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
 		 * whole lot of error handling trouble if we don't send it off
 		 * to the driver.
 		 */
-		if (bio_offset_from_zone_start(bio) != zwplug->wp_offset)
+		if (bio_offset_from_zone_start(bio) != zawplug->wp_offset)
 			goto err;
 	}
 
 	/* Advance the zone write pointer offset. */
-	zwplug->wp_offset += bio_sectors(bio);
+	zawplug->wp_offset += bio_sectors(bio);
 
 	return true;
 
@@ -785,6 +862,12 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 
 	blk_zone_wplug_lock(zwplug, flags);
 
+	if (!blk_zone_activate_wplug(bio->bi_bdev->bd_disk, zwplug)) {
+		blk_zone_wplug_unlock(zwplug, flags);
+		bio_io_error(bio);
+		return true;
+	}
+
 	/* Indicate that this BIO is being handled using zone write plugging. */
 	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
 
@@ -867,6 +950,15 @@ bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs)
 	 * not need serialization with write and append operations. It is the
 	 * responsibility of the user to not issue reset and finish commands
 	 * when write operations are in flight.
+	 *
+	 * Note: for native zone append operations, we do not do any tracking of
+	 * the zone write pointer offset. This means that zones written only
+	 * using zone append operations will never be activated, thus avoiding
+	 * any overhead. If the user mixes regular writes and native zone append
+	 * operations for the same zone, the zone write plug will be activated
+	 * and have an incorrect write pointer offset. That is fine as mixing
+	 * these operations will very likely fail anyway, in which case the
+	 * zone error handling will recover a correct write pointer offset.
 	 */
 	switch (bio_op(bio)) {
 	case REQ_OP_ZONE_APPEND:
@@ -894,6 +986,7 @@ EXPORT_SYMBOL_GPL(blk_zone_write_plug_bio);
 static void blk_zone_write_plug_unplug_bio(struct gendisk *disk,
 					   struct blk_zone_wplug *zwplug)
 {
+	struct blk_zone_active_wplug *zawplug;
 	unsigned long flags;
 
 	blk_zone_wplug_lock(zwplug, flags);
@@ -910,10 +1003,22 @@ static void blk_zone_write_plug_unplug_bio(struct gendisk *disk,
 	}
 
 	/* Schedule submission of the next plugged BIO if we have one. */
-	if (!bio_list_empty(&zwplug->bio_list))
-		kblockd_schedule_work(&zwplug->bio_work);
-	else
-		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+	WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_ACTIVE));
+	zawplug = zwplug->zawplug;
+	if (!bio_list_empty(&zawplug->bio_list)) {
+		kblockd_schedule_work(&zawplug->bio_work);
+		blk_zone_wplug_unlock(zwplug, flags);
+		return;
+	}
+
+	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+
+	/*
+	 * If the zone is empty (a reset was executed) or was fully written
+	 * (or a zone finish was executed), free its active write plug.
+	 */
+	if (!zawplug->wp_offset || zawplug->wp_offset >= zawplug->capacity)
+		blk_zone_free_active_wplug(disk, zawplug);
 
 	blk_zone_wplug_unlock(zwplug, flags);
 }
@@ -964,8 +1069,9 @@ void blk_zone_write_plug_complete_request(struct request *req)
 
 static void blk_zone_wplug_bio_work(struct work_struct *work)
 {
-	struct blk_zone_wplug *zwplug =
-		container_of(work, struct blk_zone_wplug, bio_work);
+	struct blk_zone_active_wplug *zawplug =
+		container_of(work, struct blk_zone_active_wplug, bio_work);
+	struct blk_zone_wplug *zwplug = zawplug->zwplug;
 	unsigned long flags;
 	struct bio *bio;
 
@@ -975,7 +1081,7 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 	 */
 	blk_zone_wplug_lock(zwplug, flags);
 
-	bio = bio_list_pop(&zwplug->bio_list);
+	bio = bio_list_pop(&zawplug->bio_list);
 	if (!bio) {
 		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
 		blk_zone_wplug_unlock(zwplug, flags);
@@ -984,7 +1090,7 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 
 	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
 		/* Error recovery will decide what to do with the BIO. */
-		bio_list_add_head(&zwplug->bio_list, bio);
+		bio_list_add_head(&zawplug->bio_list, bio);
 		blk_zone_wplug_unlock(zwplug, flags);
 		return;
 	}
@@ -1039,7 +1145,8 @@ static void blk_zone_wplug_handle_error(struct gendisk *disk,
 {
 	unsigned int zno = zwplug - disk->zone_wplugs;
 	sector_t zone_start_sector = bdev_zone_sectors(disk->part0) * zno;
-	unsigned int noio_flag;
+	struct blk_zone_active_wplug *zawplug;
+	unsigned int noio_flag, wp_offset;
 	struct blk_zone zone;
 	unsigned long flags;
 	int ret;
@@ -1068,25 +1175,40 @@ static void blk_zone_wplug_handle_error(struct gendisk *disk,
 		 * forever on plugged BIOs to complete if there is a revalidate
 		 * or queue freeze on-going.
 		 */
-		blk_zone_wplug_abort(disk, zwplug);
-		goto unplug;
+		if (zwplug->flags & BLK_ZONE_WPLUG_ACTIVE) {
+			zawplug = zwplug->zawplug;
+			zawplug->wp_offset = UINT_MAX;
+			blk_zone_active_wplug_abort(disk, zawplug);
+			goto unplug;
+		}
+		goto unlock;
 	}
 
 	/* Update the zone capacity and write pointer offset. */
-	zwplug->wp_offset = blk_zone_wp_offset(&zone);
-	zwplug->capacity = zone.capacity;
+	wp_offset = blk_zone_wp_offset(&zone);
+	if (!(zwplug->flags & BLK_ZONE_WPLUG_ACTIVE)) {
+		zwplug->info.wp_offset = wp_offset;
+		zwplug->info.capacity = zone.capacity;
+		goto unlock;
+	}
 
-	blk_zone_wplug_abort_unaligned(disk, zwplug);
+	zawplug = zwplug->zawplug;
+	zawplug->wp_offset = wp_offset;
+	zawplug->capacity = zone.capacity;
+
+	blk_zone_active_wplug_abort_unaligned(disk, zawplug);
 
 	/* Restart BIO submission if we still have any BIO left. */
-	if (!bio_list_empty(&zwplug->bio_list)) {
+	if (!bio_list_empty(&zawplug->bio_list)) {
 		WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
-		kblockd_schedule_work(&zwplug->bio_work);
+		kblockd_schedule_work(&zawplug->bio_work);
 		goto unlock;
 	}
 
 unplug:
 	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+	if (!zawplug->wp_offset || zawplug->wp_offset >= zawplug->capacity)
+		blk_zone_free_active_wplug(disk, zawplug);
 
 unlock:
 	blk_zone_wplug_unlock(zwplug, flags);
@@ -1125,11 +1247,8 @@ static struct blk_zone_wplug *blk_zone_alloc_write_plugs(unsigned int nr_zones)
 	if (!zwplugs)
 		return NULL;
 
-	for (i = 0; i < nr_zones; i++) {
+	for (i = 0; i < nr_zones; i++)
 		spin_lock_init(&zwplugs[i].lock);
-		bio_list_init(&zwplugs[i].bio_list);
-		INIT_WORK(&zwplugs[i].bio_work, blk_zone_wplug_bio_work);
-	}
 
 	return zwplugs;
 }
@@ -1152,7 +1271,12 @@ static void blk_zone_free_write_plugs(struct gendisk *disk,
 			atomic_dec(&disk->zone_nr_wplugs_with_error);
 			zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
 		}
-		n = blk_zone_wplug_abort(disk, zwplug);
+		if (!(zwplug->flags & BLK_ZONE_WPLUG_ACTIVE)) {
+			blk_zone_wplug_unlock(zwplug, flags);
+			continue;
+		}
+		n = blk_zone_active_wplug_abort(disk, zwplug->zawplug);
+		blk_zone_free_active_wplug(disk, zwplug->zawplug);
 		blk_zone_wplug_unlock(zwplug, flags);
 		if (n)
 			pr_warn_ratelimited("%s: zone %u, %u plugged BIOs aborted\n",
@@ -1171,6 +1295,8 @@ void disk_free_zone_resources(struct gendisk *disk)
 	disk->zone_wplugs = NULL;
 
 	mutex_destroy(&disk->zone_wplugs_mutex);
+	mempool_destroy(disk->zone_awplugs_pool);
+	disk->zone_awplugs_pool = NULL;
 }
 
 struct blk_revalidate_zone_args {
@@ -1226,8 +1352,8 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 		args->zone_wplugs[idx].flags |= BLK_ZONE_WPLUG_CONV;
 		break;
 	case BLK_ZONE_TYPE_SEQWRITE_REQ:
-		args->zone_wplugs[idx].capacity = zone->capacity;
-		args->zone_wplugs[idx].wp_offset = blk_zone_wp_offset(zone);
+		args->zone_wplugs[idx].info.capacity = zone->capacity;
+		args->zone_wplugs[idx].info.wp_offset = blk_zone_wp_offset(zone);
 		break;
 	case BLK_ZONE_TYPE_SEQWRITE_PREF:
 	default:
@@ -1240,6 +1366,25 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 	return 0;
 }
 
+#define BLK_ZONE_DEFAULT_ACTIVE_WPLUG_NR	128
+
+static int blk_zone_active_wplugs_pool_size(struct gendisk *disk,
+					    unsigned int nr_zones)
+{
+	unsigned int pool_size;
+
+	/*
+	 * Size the disk mempool of active zone write plugs with enough elements
+	 * given the device open and active zones limits. There may be no device
+	 * limits, in which case, we use BLK_ZONE_DEFAULT_ACTIVE_WPLUG_NR.
+	 */
+	pool_size = max(disk->max_active_zones, disk->max_open_zones);
+	if (!pool_size)
+		pool_size = BLK_ZONE_DEFAULT_ACTIVE_WPLUG_NR;
+
+	return min(pool_size, nr_zones);
+}
+
 /**
  * blk_revalidate_disk_zones - (re)allocate and initialize zone write plugs
  * @disk:	Target disk
@@ -1260,6 +1405,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 	sector_t capacity = get_capacity(disk);
 	struct blk_revalidate_zone_args args = { };
 	unsigned int nr_zones, noio_flag;
+	unsigned int pool_size;
 	int ret = -ENOMEM;
 
 	if (WARN_ON_ONCE(!blk_queue_is_zoned(q)))
@@ -1297,11 +1443,23 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 	if (!args.zone_wplugs)
 		goto out_restore_noio;
 
+	pool_size = blk_zone_active_wplugs_pool_size(disk, nr_zones);
 	if (!disk->zone_wplugs) {
 		mutex_init(&disk->zone_wplugs_mutex);
 		atomic_set(&disk->zone_nr_wplugs_with_error, 0);
 		INIT_DELAYED_WORK(&disk->zone_wplugs_work,
 				  disk_zone_wplugs_work);
+		disk->zone_awplugs_pool_size = pool_size;
+		disk->zone_awplugs_pool =
+			mempool_create_slab_pool(disk->zone_awplugs_pool_size,
+						 blk_zone_active_wplugs_cachep);
+		if (!disk->zone_awplugs_pool)
+			goto out_restore_noio;
+	} else if (disk->zone_awplugs_pool_size != pool_size) {
+		ret = mempool_resize(disk->zone_awplugs_pool, pool_size);
+		if (ret)
+			goto out_restore_noio;
+		disk->zone_awplugs_pool_size = pool_size;
 	}
 
 	ret = disk->fops->report_zones(disk, 0, UINT_MAX,
@@ -1332,19 +1490,20 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 		disk->nr_zones = nr_zones;
 		swap(disk->zone_wplugs, args.zone_wplugs);
 		mutex_unlock(&disk->zone_wplugs_mutex);
+		blk_zone_free_write_plugs(disk, args.zone_wplugs, nr_zones);
 		ret = 0;
 	} else {
 		pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
+		blk_zone_free_write_plugs(disk, args.zone_wplugs, nr_zones);
 		disk_free_zone_resources(disk);
 	}
 	blk_mq_unfreeze_queue(q);
 
-	blk_zone_free_write_plugs(disk, args.zone_wplugs, nr_zones);
-
 	return ret;
 
 out_restore_noio:
 	memalloc_noio_restore(noio_flag);
+	blk_zone_free_write_plugs(disk, args.zone_wplugs, nr_zones);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
@@ -1368,3 +1527,11 @@ int queue_zone_plugged_wplugs_show(void *data, struct seq_file *m)
 }
 
 #endif
+
+void blk_zone_dev_init(void)
+{
+	blk_zone_active_wplugs_cachep =
+		kmem_cache_create("blk_zone_active_wplug",
+				  sizeof(struct blk_zone_active_wplug), 0,
+				  SLAB_PANIC, NULL);
+}
diff --git a/block/blk.h b/block/blk.h
index 7fbef6bb1aee..7e08bdddb725 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -403,6 +403,7 @@ static inline struct bio *blk_queue_bounce(struct bio *bio,
 }
 
 #ifdef CONFIG_BLK_DEV_ZONED
+void blk_zone_dev_init(void);
 void disk_free_zone_resources(struct gendisk *disk);
 static inline bool bio_zone_write_plugging(struct bio *bio)
 {
@@ -436,6 +437,9 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd,
 int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
 		unsigned int cmd, unsigned long arg);
 #else /* CONFIG_BLK_DEV_ZONED */
+void blk_zone_dev_init(void)
+{
+}
 static inline void disk_free_zone_resources(struct gendisk *disk)
 {
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e96baa552e12..e444dab0bef8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -24,6 +24,8 @@
 #include <linux/sbitmap.h>
 #include <linux/uuid.h>
 #include <linux/xarray.h>
+#include <linux/timekeeping.h>
+#include <linux/mempool.h>
 
 struct module;
 struct request_queue;
@@ -188,6 +190,8 @@ struct gendisk {
 	struct mutex		zone_wplugs_mutex;
 	atomic_t		zone_nr_wplugs_with_error;
 	struct delayed_work	zone_wplugs_work;
+	unsigned int		zone_awplugs_pool_size;
+	mempool_t		*zone_awplugs_pool;
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 #if IS_ENABLED(CONFIG_CDROM)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 26/26] block: Add zone_active_wplugs debugfs entry
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (24 preceding siblings ...)
  2024-02-02  7:31 ` [PATCH 25/26] block: Reduce zone write plugging memory usage Damien Le Moal
@ 2024-02-02  7:31 ` Damien Le Moal
  2024-02-04 12:43   ` Hannes Reinecke
  2024-02-02  7:37 ` [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (2 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:31 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

Add the zone_active_wplugs debugfs entry to list the zone number and
write pointer offset of zones that have an active zone write plug.

This helps ensure that struct blk_zone_active_wplug are reclaimed as
zones become empty or full and allows observing which zones are being
written by the block device user.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-debugfs.h |  5 +++++
 block/blk-zoned.c      | 27 +++++++++++++++++++++++++++
 3 files changed, 33 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b803f5b370e9..5390526f2ab0 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -161,6 +161,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
 	{ "pm_only", 0600, queue_pm_only_show, NULL },
 	{ "state", 0600, queue_state_show, queue_state_write },
 	{ "zone_plugged_wplugs", 0400, queue_zone_plugged_wplugs_show, NULL },
+	{ "zone_active_wplugs", 0400, queue_zone_active_wplugs_show, NULL },
 	{ },
 };
 
diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index 6d3ac4b77d59..ee0e34345ee7 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -85,11 +85,16 @@ static inline void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
 
 #if defined(CONFIG_BLK_DEV_ZONED) && defined(CONFIG_BLK_DEBUG_FS)
 int queue_zone_plugged_wplugs_show(void *data, struct seq_file *m);
+int queue_zone_active_wplugs_show(void *data, struct seq_file *m);
 #else
 static inline int queue_zone_plugged_wplugs_show(void *data, struct seq_file *m)
 {
 	return 0;
 }
+static inline int queue_zone_active_wplugs_show(void *data, struct seq_file *m)
+{
+	return 0;
+}
 #endif
 
 #endif
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 865fc372f25e..c9b31b28b5a2 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1526,6 +1526,33 @@ int queue_zone_plugged_wplugs_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+int queue_zone_active_wplugs_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+	struct gendisk *disk = q->disk;
+	struct blk_zone_wplug *zwplug;
+	unsigned int i, wp_offset;
+	unsigned long flags;
+	bool active;
+
+	if (!disk->zone_wplugs)
+		return 0;
+
+	for (i = 0; i < disk->nr_zones; i++) {
+		zwplug = &disk->zone_wplugs[i];
+		blk_zone_wplug_lock(zwplug, flags);
+		active = zwplug->flags & BLK_ZONE_WPLUG_ACTIVE;
+		if (active)
+			wp_offset = zwplug->zawplug->wp_offset;
+		blk_zone_wplug_unlock(zwplug, flags);
+
+		if (active)
+			seq_printf(m, "%u %u\n", i, wp_offset);
+	}
+
+	return 0;
+}
+
 #endif
 
 void blk_zone_dev_init(void)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/26] Zone write plugging
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (25 preceding siblings ...)
  2024-02-02  7:31 ` [PATCH 26/26] block: Add zone_active_wplugs debugfs entry Damien Le Moal
@ 2024-02-02  7:37 ` Damien Le Moal
  2024-02-03 12:11   ` Jens Axboe
  2024-02-05 17:21 ` Bart Van Assche
  2024-02-05 18:18 ` Bart Van Assche
  28 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-02  7:37 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 16:30, Damien Le Moal wrote:
> The patch series introduces zone write plugging (ZWP) as the new
> mechanism to control the ordering of writes to zoned block devices.
> ZWP replaces zone write locking (ZWL) which is implemented only by
> mq-deadline today. ZWP also allows emulating zone append operations
> using regular writes for zoned devices that do not natively support this
> operation (e.g. SMR HDDs). This patch series removes the scsi disk
> driver and device mapper zone append emulation to use ZWP emulation.
> 
> Unlike ZWL which operates on requests, ZWP operates on BIOs. A zone
> write plug is simply a BIO list that is atomically manipulated using a
> spinlock and a kblockd submission work. A write BIO to a zone is
> "plugged" to delay its execution if a write BIO for the same zone was
> already issued, that is, if a write request for the same zone is being
> executed. The next plugged BIO is unplugged and issued once the write
> request completes.
> 
> This mechanism allows to:
>  - Untangle zone write ordering from the block IO schedulers. This
>    allows removing the restriction on using only mq-deadline for zoned
>    block devices. Any block IO scheduler, including "none" can be used.
>  - Zone write plugging operates on BIOs instead of requests. Plugged
>    BIOs waiting for execution thus do not hold scheduling tags and thus
>    do not prevent other BIOs from being submitted to the device (reads
>    or writes to other zones). Depending on the workload, this can
>    significantly improve the device use and the performance.
>  - Both blk-mq (request) based zoned devices and BIO-based devices (e.g.
>    device mapper) can use ZWP. It is mandatory for the
>    former but optional for the latter: BIO-based driver can use zone
>    write plugging to implement write ordering guarantees, or the drivers
>    can implement their own if needed.
>  - The code is less invasive in the block layer and in device drivers.
>    ZWP implementation is mostly limited to blk-zoned.c, with some small
>    changes in blk-mq.c, blk-merge.c and bio.c.
> 
> Performance evaluation results are shown below.
> 
> The series is organized as follows:

I forgot to mention that the patches are against Jens block/for-next branch with
the addition of Christoph's "clean up blk_mq_submit_bio" patches [1] and my
patch "null_blk: Always split BIOs to respect queue limits" [2].

[1] https://lore.kernel.org/linux-block/20240124092658.2258309-1-hch@lst.de/
[2] https://lore.kernel.org/linux-block/20240126005032.1985245-1-dlemoal@kernel.org/


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 03/26] block: Introduce bio_straddle_zones() and bio_offset_from_zone_start()
  2024-02-02  7:30 ` [PATCH 03/26] block: Introduce bio_straddle_zones() and bio_offset_from_zone_start() Damien Le Moal
@ 2024-02-03  4:09   ` Bart Van Assche
  2024-02-04 11:58   ` Hannes Reinecke
  1 sibling, 0 replies; 107+ messages in thread
From: Bart Van Assche @ 2024-02-03  4:09 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/1/24 23:30, Damien Le Moal wrote:
> Implement the inline helper functions bio_straddle_zones() and
> bio_offset_from_zone_start() to respectively test if a BIO crosses a
> zone boundary (the start sector and last sector belong to different
> zones) and to obtain the oofset from a zone starting sector of a BIO.

oofset -> offset
> +static inline bool bio_straddle_zones(struct bio *bio)
> +{
> +	return bio_zone_no(bio) !=
> +		disk_zone_no(bio->bi_bdev->bd_disk, bio_end_sector(bio) - 1);
> +}

It seems to me that the above code is not intended to handle the case
where bi_size == 0, as is the case for an empty flush request. Should a
comment be added above this function or do we perhaps need to add a
WARN_ON_ONCE() statement?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 05/26] block: Allow using bio_attempt_back_merge() internally
  2024-02-02  7:30 ` [PATCH 05/26] block: Allow using bio_attempt_back_merge() internally Damien Le Moal
@ 2024-02-03  4:11   ` Bart Van Assche
  2024-02-04 12:00   ` Hannes Reinecke
  1 sibling, 0 replies; 107+ messages in thread
From: Bart Van Assche @ 2024-02-03  4:11 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/1/24 23:30, Damien Le Moal wrote:
> Remove the static definition of bio_attempt_back_merge() to allow using
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To me this suggests that the function definition is removed entirely but
that is not what this patch does ...

Otherwise this patch looks good to me.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/26] Zone write plugging
  2024-02-02  7:37 ` [PATCH 00/26] Zone write plugging Damien Le Moal
@ 2024-02-03 12:11   ` Jens Axboe
  2024-02-09  5:28     ` Damien Le Moal
  0 siblings, 1 reply; 107+ messages in thread
From: Jens Axboe @ 2024-02-03 12:11 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 12:37 AM, Damien Le Moal wrote:
> On 2/2/24 16:30, Damien Le Moal wrote:
>> The patch series introduces zone write plugging (ZWP) as the new
>> mechanism to control the ordering of writes to zoned block devices.
>> ZWP replaces zone write locking (ZWL) which is implemented only by
>> mq-deadline today. ZWP also allows emulating zone append operations
>> using regular writes for zoned devices that do not natively support this
>> operation (e.g. SMR HDDs). This patch series removes the scsi disk
>> driver and device mapper zone append emulation to use ZWP emulation.
>>
>> Unlike ZWL which operates on requests, ZWP operates on BIOs. A zone
>> write plug is simply a BIO list that is atomically manipulated using a
>> spinlock and a kblockd submission work. A write BIO to a zone is
>> "plugged" to delay its execution if a write BIO for the same zone was
>> already issued, that is, if a write request for the same zone is being
>> executed. The next plugged BIO is unplugged and issued once the write
>> request completes.
>>
>> This mechanism allows to:
>>  - Untangle zone write ordering from the block IO schedulers. This
>>    allows removing the restriction on using only mq-deadline for zoned
>>    block devices. Any block IO scheduler, including "none" can be used.
>>  - Zone write plugging operates on BIOs instead of requests. Plugged
>>    BIOs waiting for execution thus do not hold scheduling tags and thus
>>    do not prevent other BIOs from being submitted to the device (reads
>>    or writes to other zones). Depending on the workload, this can
>>    significantly improve the device use and the performance.
>>  - Both blk-mq (request) based zoned devices and BIO-based devices (e.g.
>>    device mapper) can use ZWP. It is mandatory for the
>>    former but optional for the latter: BIO-based driver can use zone
>>    write plugging to implement write ordering guarantees, or the drivers
>>    can implement their own if needed.
>>  - The code is less invasive in the block layer and in device drivers.
>>    ZWP implementation is mostly limited to blk-zoned.c, with some small
>>    changes in blk-mq.c, blk-merge.c and bio.c.
>>
>> Performance evaluation results are shown below.
>>
>> The series is organized as follows:
> 
> I forgot to mention that the patches are against Jens block/for-next
> branch with the addition of Christoph's "clean up blk_mq_submit_bio"
> patches [1] and my patch "null_blk: Always split BIOs to respect queue
> limits" [2].

I figured that was the case, I'll get both of these properly setup in a
for-6.9/block branch, just wanted -rc3 to get cut first. JFYI that they
are coming tomorrow.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 10/26] dm: Use the block layer zone append emulation
  2024-02-02  7:30 ` [PATCH 10/26] dm: Use the block layer zone append emulation Damien Le Moal
@ 2024-02-03 17:58   ` Mike Snitzer
  2024-02-05  5:38     ` Damien Le Moal
  2024-02-04 12:30   ` Hannes Reinecke
  1 sibling, 1 reply; 107+ messages in thread
From: Mike Snitzer @ 2024-02-03 17:58 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Christoph Hellwig

On Fri, Feb 02 2024 at  2:30P -0500,
Damien Le Moal <dlemoal@kernel.org> wrote:

> For targets requiring zone append operation emulation with regular
> writes (e.g. dm-crypt), we can use the block layer emulation provided by
> zone write plugging. Remove DM implemented zone append emulation and
> enable the block layer one.
> 
> This is done by setting the max_zone_append_sectors limit of the
> mapped device queue to 0 for mapped devices that have a target table
> that cannot support native zone append operations. These includes
> mixed zoned and non-zoned targets, or targets that explicitly requested
> emulation of zone append (e.g. dm-crypt). For these mapped devices, the
> new field emulate_zone_append is set to true. dm_split_and_process_bio()
> is modified to call blk_zone_write_plug_bio() for such device to let the
> block layer transform zone append operations into regular writes. This
> is done after ensuring that the submitted BIO is split if it straddles
> zone boundaries.
> 
> dm_revalidate_zones() is also modified to use the block layer provided
> function blk_revalidate_disk_zones() so that all zone resources needed
> for zone append emulation are allocated and initialized by the block
> layer without DM core needing to do anything. Since the device table is
> not yet live when dm_revalidate_zones() is executed, enabling the use of
> blk_revalidate_disk_zones() requires adding a pointer to the device
> table in struct mapped_device. This avoids errors in
> dm_blk_report_zones() trying to get the table with dm_get_live_table().
> The mapped device table pointer is set to the table passed as argument
> to dm_revalidate_zones() before calling blk_revalidate_disk_zones() and
> reset to NULL after this function returns to restore the live table
> handling for user call of report zones.
> 
> All the code related to zone append emulation is removed from
> dm-zone.c. This leads to simplifications of the functions __map_bio()
> and dm_zone_endio(). This later function now only needs to deal with
> completions of real zone append operations for targets that support it.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>

Love the overall improvement to the DM core code and the broader block
layer by switching to this bio-based ZWP approach.

Reviewed-by: Mike Snitzer <snitzer@kernel.org>

But one incremental suggestion inlined below.

> ---
>  drivers/md/dm-core.h |  11 +-
>  drivers/md/dm-zone.c | 470 ++++---------------------------------------
>  drivers/md/dm.c      |  44 ++--
>  drivers/md/dm.h      |   7 -
>  4 files changed, 68 insertions(+), 464 deletions(-)
> 

<snip>

> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 8dcabf84d866..92ce3b2eb4ae 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1419,25 +1419,12 @@ static void __map_bio(struct bio *clone)
>  		down(&md->swap_bios_semaphore);
>  	}
>  
> -	if (static_branch_unlikely(&zoned_enabled)) {
> -		/*
> -		 * Check if the IO needs a special mapping due to zone append
> -		 * emulation on zoned target. In this case, dm_zone_map_bio()
> -		 * calls the target map operation.
> -		 */
> -		if (unlikely(dm_emulate_zone_append(md)))
> -			r = dm_zone_map_bio(tio);
> -		else
> -			goto do_map;
> -	} else {
> -do_map:
> -		if (likely(ti->type->map == linear_map))
> -			r = linear_map(ti, clone);
> -		else if (ti->type->map == stripe_map)
> -			r = stripe_map(ti, clone);
> -		else
> -			r = ti->type->map(ti, clone);
> -	}
> +	if (likely(ti->type->map == linear_map))
> +		r = linear_map(ti, clone);
> +	else if (ti->type->map == stripe_map)
> +		r = stripe_map(ti, clone);
> +	else
> +		r = ti->type->map(ti, clone);
>  
>  	switch (r) {
>  	case DM_MAPIO_SUBMITTED:
> @@ -1774,19 +1761,33 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>  	struct clone_info ci;
>  	struct dm_io *io;
>  	blk_status_t error = BLK_STS_OK;
> -	bool is_abnormal;
> +	bool is_abnormal, need_split;
>  
>  	is_abnormal = is_abnormal_io(bio);
> -	if (unlikely(is_abnormal)) {
> +	if (likely(!md->emulate_zone_append))
> +		need_split = is_abnormal;
> +	else
> +		need_split = is_abnormal || bio_straddle_zones(bio);
> +	if (unlikely(need_split)) {
>  		/*
>  		 * Use bio_split_to_limits() for abnormal IO (e.g. discard, etc)
>  		 * otherwise associated queue_limits won't be imposed.
> +		 * Also split the BIO for mapped devices needing zone append
> +		 * emulation to ensure that the BIO does not cross zone
> +		 * boundaries.
>  		 */
>  		bio = bio_split_to_limits(bio);
>  		if (!bio)
>  			return;
>  	}
>  
> +	/*
> +	 * Use the block layer zone write plugging for mapped devices that
> +	 * need zone append emulation (e.g. dm-crypt).
> +	 */
> +	if (md->emulate_zone_append && blk_zone_write_plug_bio(bio, 0))
> +		return;
> +
>  	/* Only support nowait for normal IO */
>  	if (unlikely(bio->bi_opf & REQ_NOWAIT) && !is_abnormal) {
>  		io = alloc_io(md, bio, GFP_NOWAIT);

Would prefer to see this incremental change included from the start:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 92ce3b2eb4ae..1fd9bbf35db3 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1763,11 +1763,10 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	blk_status_t error = BLK_STS_OK;
 	bool is_abnormal, need_split;
 
-	is_abnormal = is_abnormal_io(bio);
-	if (likely(!md->emulate_zone_append))
-		need_split = is_abnormal;
-	else
+	need_split = is_abnormal = is_abnormal_io(bio);
+	if (static_branch_unlikely(&zoned_enabled) && unlikely(md->emulate_zone_append))
 		need_split = is_abnormal || bio_straddle_zones(bio);
+
 	if (unlikely(need_split)) {
 		/*
 		 * Use bio_split_to_limits() for abnormal IO (e.g. discard, etc)
@@ -1781,12 +1780,14 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 			return;
 	}
 
-	/*
-	 * Use the block layer zone write plugging for mapped devices that
-	 * need zone append emulation (e.g. dm-crypt).
-	 */
-	if (md->emulate_zone_append && blk_zone_write_plug_bio(bio, 0))
-		return;
+	if (static_branch_unlikely(&zoned_enabled)) {
+		/*
+		 * Use the block layer zone write plugging for mapped devices that
+		 * need zone append emulation (e.g. dm-crypt).
+		 */
+		if (unlikely(md->emulate_zone_append) && blk_zone_write_plug_bio(bio, 0))
+			return;
+	}
 
 	/* Only support nowait for normal IO */
 	if (unlikely(bio->bi_opf & REQ_NOWAIT) && !is_abnormal) {

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-02  7:30 ` [PATCH 06/26] block: Introduce zone write plugging Damien Le Moal
@ 2024-02-04  3:56   ` Ming Lei
  2024-02-04 23:57     ` Damien Le Moal
  2024-02-04 12:14   ` Hannes Reinecke
  2024-02-05 17:48   ` Bart Van Assche
  2 siblings, 1 reply; 107+ messages in thread
From: Ming Lei @ 2024-02-04  3:56 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, Christoph Hellwig, ming.lei

On Fri, Feb 02, 2024 at 04:30:44PM +0900, Damien Le Moal wrote:
> Zone write plugging implements a per-zone "plug" for write operations to
> tightly control the submission and execution order of writes to
> sequential write required zones of a zoned block device. Per-zone
> plugging of writes guarantees that at any time at most one write request
> per zone is in flight. This mechanism is intended to replace zone write
> locking which is controlled at the scheduler level and implemented only
> by mq-deadline.
> 
> Unlike zone write locking which operates on requests, zone write
> plugging operates on BIOs. A zone write plug is simply a BIO list that
> is atomically manipulated using a spinlock and a kblockd submission
> work. A write BIO to a zone is "plugged" to delay its execution if a
> write BIO for the same zone was already issued, that is, if a write
> request for the same zone is being executed. The next plugged BIO is
> unplugged and issued once the write request completes.
> 
> This mechanism allows to:
>  - Untangles zone write ordering from block IO schedulers. This allows
>    removing the restriction on using only mq-deadline for zoned block
>    devices. Any block IO scheduler, including "none" can be used.
>  - Zone write plugging operates on BIOs instead of requests. Plugged
>    BIOs waiting for execution thus do not hold scheduling tags and thus
>    are not preventing other BIOs to proceed (reads or writes to other
>    zones). Depending on the workload, this can significantly improve
>    the device use and performance.
>  - Both blk-mq (request) based zoned devices and BIO-based devices (e.g.
>    device mapper) can use zone write plugging. It is mandatory for the
>    former but optional for the latter: BIO-based driver can use zone
>    write plugging to implement write ordering guarantees, or the drivers
>    can implement their own if needed.
>  - The code is less invasive in the block layer and is mostly limited to
>    blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
>    bio.c.
> 
> Zone write plugging is implemented using struct blk_zone_wplug. This
> structurei includes a spinlock, a BIO list and a work structure to
> handle the submission of plugged BIOs.
> 
> Plugging of zone write BIOs is done using the function
> blk_zone_write_plug_bio() which returns false if a BIO execution does
> not need to be delayed and true otherwise. This function is called
> from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
> spanning multiple zones which would cause mishandling of zone write
> plugging. This enables by default zone write plugging for any mq
> request-based block device. BIO-based device drivers can also use zone
> write plugging by expliclty calling blk_zone_write_plug_bio() in their
> ->submit_bio method. For such devices, the driver must ensure that a
> BIO passed to blk_zone_write_plug_bio() is already split and not
> straddling zone boundaries.
> 
> Only write and write zeroes BIOs are plugged. Zone write plugging does
> not introduce any significant overhead for other operations. A BIO that
> is being handled through zone write plugging is flagged using the new
> BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
> this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
> The completion processing of BIOs and requests flagged trigger
> respectively calls to the functions blk_zone_write_plug_bio_endio() and
> blk_zone_write_plug_complete_request(). The latter function is used to
> trigger submission of the next plugged BIO using the zone plug work.
> blk_zone_write_plug_bio_endio() does the same for BIO-based devices.
> This ensures that at any time, at most one request (blk-mq devices) or
> one BIO (BIO-based devices) are being executed for any zone. The
> handling of zone write plug using a per-zone plug spinlock maximizes
> parrallelism and device usage by allowing multiple zones to be writen
> simultaneously without lock contention.
> 
> Zone write plugging ignores flush BIOs without data. Hovever, any flush
> BIO that has data is always plugged so that the write part of the flush
> sequence is serialized with other regular writes.
> 
> Given that any BIO handled through zone write plugging will be the only
> BIO in flight for the target zone when it is executed, the unplugging
> and submission of a BIO will have no chance of successfully merging with
> plugged requests or requests in the scheduler. To overcome this
> potential performance loss, blk_mq_submit_bio() calls the function
> blk_zone_write_plug_attempt_merge() to try to merge other plugged BIOs
> with the one just unplugged. Successful merging is signaled using
> blk_zone_write_plug_bio_merged(), called from bio_attempt_back_merge().
> Furthermore, to avoid recalculating the number of segments of plugged
> BIOs to attempt merging, the number of segments of a plugged BIO is
> saved using the new struct bio field __bi_nr_segments. To avoid growing
> the size of struct bio, this field is added as a union with the
> bio_cookie field. This is safe to do as polling is always disabled for
> plugged BIOs.
> 
> When BIOs are plugged in a zone write plug, the device request queue
> usage counter is always incremented. This kept and reused when the
> plugged BIO is unplugged and submitted again using
> submit_bio_noacct_nocheck(). For this case, the unplugged BIO is already
> flagged with BIO_ZONE_WRITE_PLUGGING and blk_mq_submit_bio() proceeds
> directly to allocating a new request for the BIO, re-using the usage
> reference count taken when the BIO was plugged. This extra reference
> count is dropped in blk_zone_write_plug_attempt_merge() for any plugged
> BIO that is successfully merged. Given that BIO-based devices will not
> take this path, the extra reference is dropped when a plugged BIO is
> unplugged and submitted.
> 
> To match the new data structures used for zoned disks, the function
> disk_free_zone_bitmaps() is renamed to the more generic
> disk_free_zone_resources().
> 
> This commit contains contributions from Christoph Hellwig <hch@lst.de>.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>  block/bio.c               |   7 +
>  block/blk-merge.c         |  11 +
>  block/blk-mq.c            |  28 +++
>  block/blk-zoned.c         | 408 +++++++++++++++++++++++++++++++++++++-
>  block/blk.h               |  32 ++-
>  block/genhd.c             |   2 +-
>  include/linux/blk-mq.h    |   2 +
>  include/linux/blk_types.h |   8 +-
>  include/linux/blkdev.h    |   8 +
>  9 files changed, 496 insertions(+), 10 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index b9642a41f286..c8b0f7e8c713 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1581,6 +1581,13 @@ void bio_endio(struct bio *bio)
>  	if (!bio_integrity_endio(bio))
>  		return;
>  
> +	/*
> +	 * For BIOs handled through a zone write plugs, signal the end of the
> +	 * BIO to the zone write plug to submit the next plugged BIO.
> +	 */
> +	if (bio_zone_write_plugging(bio))
> +		blk_zone_write_plug_bio_endio(bio);
> +
>  	rq_qos_done_bio(bio);
>  
>  	if (bio->bi_bdev && bio_flagged(bio, BIO_TRACE_COMPLETION)) {
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index a1ef61b03e31..2b5489cd9c65 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -377,6 +377,7 @@ struct bio *__bio_split_to_limits(struct bio *bio,
>  		blkcg_bio_issue_init(split);
>  		bio_chain(split, bio);
>  		trace_block_split(split, bio->bi_iter.bi_sector);
> +		WARN_ON_ONCE(bio_zone_write_plugging(bio));
>  		submit_bio_noacct(bio);
>  		return split;
>  	}
> @@ -980,6 +981,9 @@ enum bio_merge_status bio_attempt_back_merge(struct request *req,
>  
>  	blk_update_mixed_merge(req, bio, false);
>  
> +	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
> +		blk_zone_write_plug_bio_merged(bio);
> +
>  	req->biotail->bi_next = bio;
>  	req->biotail = bio;
>  	req->__data_len += bio->bi_iter.bi_size;
> @@ -995,6 +999,13 @@ static enum bio_merge_status bio_attempt_front_merge(struct request *req,
>  {
>  	const blk_opf_t ff = bio_failfast(bio);
>  
> +	/*
> +	 * A front merge for zone writes can happen only if the user submitted
> +	 * writes out of order. Do not attempt this to let the write fail.
> +	 */
> +	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
> +		return BIO_MERGE_FAILED;
> +
>  	if (!ll_front_merge_fn(req, bio, nr_segs))
>  		return BIO_MERGE_FAILED;
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f02e486a02ae..aa49bebf1199 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -830,6 +830,9 @@ static void blk_complete_request(struct request *req)
>  		bio = next;
>  	} while (bio);
>  
> +	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
> +		blk_zone_write_plug_complete_request(req);
> +
>  	/*
>  	 * Reset counters so that the request stacking driver
>  	 * can find how many bytes remain in the request
> @@ -943,6 +946,9 @@ bool blk_update_request(struct request *req, blk_status_t error,
>  	 * completely done
>  	 */
>  	if (!req->bio) {
> +		if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
> +			blk_zone_write_plug_complete_request(req);
> +
>  		/*
>  		 * Reset counters so that the request stacking driver
>  		 * can find how many bytes remain in the request
> @@ -2975,6 +2981,17 @@ void blk_mq_submit_bio(struct bio *bio)
>  	struct request *rq;
>  	blk_status_t ret;
>  
> +	/*
> +	 * A BIO that was released form a zone write plug has already been
> +	 * through the preparation in this function, already holds a reference
> +	 * on the queue usage counter, and is the only write BIO in-flight for
> +	 * the target zone. Go straight to allocating a request for it.
> +	 */
> +	if (bio_zone_write_plugging(bio)) {
> +		nr_segs = bio->__bi_nr_segments;
> +		goto new_request;
> +	}
> +
>  	bio = blk_queue_bounce(bio, q);
>  	bio_set_ioprio(bio);
>  
> @@ -3001,7 +3018,11 @@ void blk_mq_submit_bio(struct bio *bio)
>  	if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
>  		goto queue_exit;
>  
> +	if (blk_queue_is_zoned(q) && blk_zone_write_plug_bio(bio, nr_segs))
> +		goto queue_exit;
> +
>  	if (!rq) {
> +new_request:
>  		rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
>  		if (unlikely(!rq))
>  			goto queue_exit;
> @@ -3017,8 +3038,12 @@ void blk_mq_submit_bio(struct bio *bio)
>  
>  	ret = blk_crypto_rq_get_keyslot(rq);
>  	if (ret != BLK_STS_OK) {
> +		bool zwplugging = bio_zone_write_plugging(bio);
> +
>  		bio->bi_status = ret;
>  		bio_endio(bio);
> +		if (zwplugging)
> +			blk_zone_write_plug_complete_request(rq);
>  		blk_mq_free_request(rq);
>  		return;
>  	}
> @@ -3026,6 +3051,9 @@ void blk_mq_submit_bio(struct bio *bio)
>  	if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
>  		return;
>  
> +	if (bio_zone_write_plugging(bio))
> +		blk_zone_write_plug_attempt_merge(rq);
> +
>  	if (plug) {
>  		blk_add_rq_to_plug(plug, rq);
>  		return;
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index d343e5756a9c..f6d4f511b664 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -7,11 +7,11 @@
>   *
>   * Copyright (c) 2016, Damien Le Moal
>   * Copyright (c) 2016, Western Digital
> + * Copyright (c) 2024, Western Digital Corporation or its affiliates.
>   */
>  
>  #include <linux/kernel.h>
>  #include <linux/module.h>
> -#include <linux/rbtree.h>
>  #include <linux/blkdev.h>
>  #include <linux/blk-mq.h>
>  #include <linux/mm.h>
> @@ -19,6 +19,7 @@
>  #include <linux/sched/mm.h>
>  
>  #include "blk.h"
> +#include "blk-mq-sched.h"
>  
>  #define ZONE_COND_NAME(name) [BLK_ZONE_COND_##name] = #name
>  static const char *const zone_cond_name[] = {
> @@ -33,6 +34,27 @@ static const char *const zone_cond_name[] = {
>  };
>  #undef ZONE_COND_NAME
>  
> +/*
> + * Per-zone write plug.
> + */
> +struct blk_zone_wplug {
> +	spinlock_t		lock;
> +	unsigned int		flags;
> +	struct bio_list		bio_list;
> +	struct work_struct	bio_work;
> +};
> +
> +/*
> + * Zone write plug flags bits:
> + *  - BLK_ZONE_WPLUG_CONV: Indicate that the zone is a conventional one. Writes
> + *    to these zones are never plugged.
> + *  - BLK_ZONE_WPLUG_PLUGGED: Indicate that the zone write plug is plugged,
> + *    that is, that write BIOs are being throttled due to a write BIO already
> + *    being executed or the zone write plug bio list is not empty.
> + */
> +#define BLK_ZONE_WPLUG_CONV	(1U << 0)
> +#define BLK_ZONE_WPLUG_PLUGGED	(1U << 1)

BLK_ZONE_WPLUG_PLUGGED == !bio_list_empty(&zwplug->bio_list), so looks
this flag isn't necessary.

> +
>  /**
>   * blk_zone_cond_str - Return string XXX in BLK_ZONE_COND_XXX.
>   * @zone_cond: BLK_ZONE_COND_XXX.
> @@ -429,12 +451,374 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
>  	return ret;
>  }
>  
> -void disk_free_zone_bitmaps(struct gendisk *disk)
> +#define blk_zone_wplug_lock(zwplug, flags) \
> +	spin_lock_irqsave(&zwplug->lock, flags)
> +
> +#define blk_zone_wplug_unlock(zwplug, flags) \
> +	spin_unlock_irqrestore(&zwplug->lock, flags)
> +
> +static inline void blk_zone_wplug_bio_io_error(struct bio *bio)
> +{
> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +
> +	bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING);
> +	bio_io_error(bio);
> +	blk_queue_exit(q);
> +}
> +
> +static int blk_zone_wplug_abort(struct gendisk *disk,
> +				struct blk_zone_wplug *zwplug)
> +{
> +	struct bio *bio;
> +	int nr_aborted = 0;
> +
> +	while ((bio = bio_list_pop(&zwplug->bio_list))) {
> +		blk_zone_wplug_bio_io_error(bio);
> +		nr_aborted++;
> +	}
> +
> +	return nr_aborted;
> +}
> +
> +/*
> + * Return the zone write plug for sector in sequential write required zone.
> + * Given that conventional zones have no write ordering constraints, NULL is
> + * returned for sectors in conventional zones, to indicate that zone write
> + * plugging is not needed.
> + */
> +static inline struct blk_zone_wplug *
> +disk_lookup_zone_wplug(struct gendisk *disk, sector_t sector)
> +{
> +	struct blk_zone_wplug *zwplug;
> +
> +	if (WARN_ON_ONCE(!disk->zone_wplugs))
> +		return NULL;
> +
> +	zwplug = &disk->zone_wplugs[disk_zone_no(disk, sector)];
> +	if (zwplug->flags & BLK_ZONE_WPLUG_CONV)
> +		return NULL;
> +	return zwplug;
> +}
> +
> +static inline struct blk_zone_wplug *bio_lookup_zone_wplug(struct bio *bio)
> +{
> +	return disk_lookup_zone_wplug(bio->bi_bdev->bd_disk,
> +				      bio->bi_iter.bi_sector);
> +}
> +
> +static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
> +					  struct bio *bio, unsigned int nr_segs)
> +{
> +	/*
> +	 * Keep a reference on the BIO request queue usage. This reference will
> +	 * be dropped either if the BIO is failed or after it is issued and
> +	 * completes.
> +	 */
> +	percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);

It is fragile to get nested usage_counter, and same with grabbing/releasing it
from different contexts or even functions, and it could be much better to just
let block layer maintain it.

From patch 23's change:

+	 * Zoned block device information. Reads of this information must be
+	 * protected with blk_queue_enter() / blk_queue_exit(). Modifying this

Anytime if there is in-flight bio, the block device is opened, so both gendisk and
request_queue are live, so not sure if this .q_usage_counter protection
is needed.

+	 * information is only allowed while no requests are being processed.
+	 * See also blk_mq_freeze_queue() and blk_mq_unfreeze_queue().
 	 */

> +
> +	/*
> +	 * The BIO is being plugged and thus will have to wait for the on-going
> +	 * write and for all other writes already plugged. So polling makes
> +	 * no sense.
> +	 */
> +	bio_clear_polled(bio);
> +
> +	/*
> +	 * Reuse the poll cookie field to store the number of segments when
> +	 * split to the hardware limits.
> +	 */
> +	bio->__bi_nr_segments = nr_segs;
> +
> +	/*
> +	 * We always receive BIOs after they are split and ready to be issued.
> +	 * The block layer passes the parts of a split BIO in order, and the
> +	 * user must also issue write sequentially. So simply add the new BIO
> +	 * at the tail of the list to preserve the sequential write order.
> +	 */
> +	bio_list_add(&zwplug->bio_list, bio);
> +}
> +
> +/*
> + * Called from bio_attempt_back_merge() when a BIO was merged with a request.
> + */
> +void blk_zone_write_plug_bio_merged(struct bio *bio)
> +{
> +	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
> +}
> +
> +/*
> + * Attempt to merge plugged BIOs with a newly formed request of a BIO that went
> + * through zone write plugging (either a new BIO or one that was unplugged).
> + */
> +void blk_zone_write_plug_attempt_merge(struct request *req)
> +{
> +	struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(req->bio);
> +	sector_t req_back_sector = blk_rq_pos(req) + blk_rq_sectors(req);
> +	struct request_queue *q = req->q;
> +	unsigned long flags;
> +	struct bio *bio;
> +
> +	/*
> +	 * Completion of this request needs to be handled with
> +	 * blk_zone_write_complete_request().
> +	 */
> +	req->rq_flags |= RQF_ZONE_WRITE_PLUGGING;
> +
> +	if (blk_queue_nomerges(q))
> +		return;
> +
> +	/*
> +	 * Walk through the list of plugged BIOs to check if they can be merged
> +	 * into the back of the request.
> +	 */
> +	blk_zone_wplug_lock(zwplug, flags);
> +	while ((bio = bio_list_peek(&zwplug->bio_list))) {
> +		if (bio->bi_iter.bi_sector != req_back_sector ||
> +		    !blk_rq_merge_ok(req, bio))
> +			break;
> +
> +		WARN_ON_ONCE(bio_op(bio) != REQ_OP_WRITE_ZEROES &&
> +			     !bio->__bi_nr_segments);
> +
> +		bio_list_pop(&zwplug->bio_list);
> +		if (bio_attempt_back_merge(req, bio, bio->__bi_nr_segments) !=
> +		    BIO_MERGE_OK) {
> +			bio_list_add_head(&zwplug->bio_list, bio);
> +			break;
> +		}
> +
> +		/*
> +		 * Drop the extra reference on the queue usage we got when
> +		 * plugging the BIO.
> +		 */
> +		blk_queue_exit(q);
> +
> +		req_back_sector += bio_sectors(bio);
> +	}
> +	blk_zone_wplug_unlock(zwplug, flags);
> +}
> +
> +static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
> +{
> +	struct blk_zone_wplug *zwplug;
> +	unsigned long flags;
> +
> +	/*
> +	 * BIOs must be fully contained within a zone so that we use the correct
> +	 * zone write plug for the entire BIO. For blk-mq devices, the block
> +	 * layer should already have done any splitting required to ensure this
> +	 * and this BIO should thus not be straddling zone boundaries. For
> +	 * BIO-based devices, it is the responsibility of the driver to split
> +	 * the bio before submitting it.
> +	 */
> +	if (WARN_ON_ONCE(bio_straddle_zones(bio))) {
> +		bio_io_error(bio);
> +		return true;
> +	}
> +
> +	zwplug = bio_lookup_zone_wplug(bio);
> +	if (!zwplug)
> +		return false;
> +
> +	blk_zone_wplug_lock(zwplug, flags);
> +
> +	/* Indicate that this BIO is being handled using zone write plugging. */
> +	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
> +
> +	/*
> +	 * If the zone is already plugged, add the BIO to the plug BIO list.
> +	 * Otherwise, plug and let the BIO execute.
> +	 */
> +	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED) {
> +		blk_zone_wplug_add_bio(zwplug, bio, nr_segs);
> +		blk_zone_wplug_unlock(zwplug, flags);
> +		return true;
> +	}
> +
> +	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
> +
> +	blk_zone_wplug_unlock(zwplug, flags);
> +
> +	return false;
> +}
> +
> +/**
> + * blk_zone_write_plug_bio - Handle a zone write BIO with zone write plugging
> + * @bio: The BIO being submitted
> + *
> + * Handle write and write zeroes operations using zone write plugging.
> + * Return true whenever @bio execution needs to be delayed through the zone
> + * write plug. Otherwise, return false to let the submission path process
> + * @bio normally.
> + */
> +bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs)
> +{
> +	if (!bio->bi_bdev->bd_disk->zone_wplugs)
> +		return false;
> +
> +	/*
> +	 * If the BIO already has the plugging flag set, then it was already
> +	 * handled through this path and this is a submission from the zone
> +	 * plug bio submit work.
> +	 */
> +	if (bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING))
> +		return false;
> +
> +	/*
> +	 * We do not need to do anything special for empty flush BIOs, e.g
> +	 * BIOs such as issued by blkdev_issue_flush(). The is because it is
> +	 * the responsibility of the user to first wait for the completion of
> +	 * write operations for flush to have any effect on the persistence of
> +	 * the written data.
> +	 */
> +	if (op_is_flush(bio->bi_opf) && !bio_sectors(bio))
> +		return false;
> +
> +	/*
> +	 * Regular writes and write zeroes need to be handled through the target
> +	 * zone write plug. This includes writes with REQ_FUA | REQ_PREFLUSH
> +	 * which may need to go through the flush machinery depending on the
> +	 * target device capabilities. Plugging such writes is fine as the flush
> +	 * machinery operates at the request level, below the plug, and
> +	 * completion of the flush sequence will go through the regular BIO
> +	 * completion, which will handle zone write plugging.
> +	 */
> +	switch (bio_op(bio)) {
> +	case REQ_OP_WRITE:
> +	case REQ_OP_WRITE_ZEROES:
> +		return blk_zone_wplug_handle_write(bio, nr_segs);
> +	default:
> +		return false;
> +	}
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(blk_zone_write_plug_bio);
> +
> +static void blk_zone_write_plug_unplug_bio(struct blk_zone_wplug *zwplug)
> +{
> +	unsigned long flags;
> +
> +	blk_zone_wplug_lock(zwplug, flags);
> +
> +	/* Schedule submission of the next plugged BIO if we have one. */
> +	if (!bio_list_empty(&zwplug->bio_list))
> +		kblockd_schedule_work(&zwplug->bio_work);
> +	else
> +		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
> +
> +	blk_zone_wplug_unlock(zwplug, flags);
> +}
> +
> +void blk_zone_write_plug_bio_endio(struct bio *bio)
> +{
> +	/* Make sure we do not see this BIO again by clearing the plug flag. */
> +	bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING);
> +
> +	/*
> +	 * For BIO-based devices, blk_zone_write_plug_complete_request()
> +	 * is not called. So we need to schedule execution of the next
> +	 * plugged BIO here.
> +	 */
> +	if (bio->bi_bdev->bd_has_submit_bio) {
> +		struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(bio);
> +
> +		blk_zone_write_plug_unplug_bio(zwplug);
> +	}
> +}
> +
> +void blk_zone_write_plug_complete_request(struct request *req)
> +{
> +	struct gendisk *disk = req->q->disk;
> +	struct blk_zone_wplug *zwplug =
> +		disk_lookup_zone_wplug(disk, req->__sector);
> +
> +	req->rq_flags &= ~RQF_ZONE_WRITE_PLUGGING;
> +
> +	blk_zone_write_plug_unplug_bio(zwplug);
> +}
> +
> +static void blk_zone_wplug_bio_work(struct work_struct *work)
> +{
> +	struct blk_zone_wplug *zwplug =
> +		container_of(work, struct blk_zone_wplug, bio_work);
> +	unsigned long flags;
> +	struct bio *bio;
> +
> +	/*
> +	 * Unplug and submit the next plugged BIO. If we do not have any, clear
> +	 * the plugged flag.
> +	 */
> +	blk_zone_wplug_lock(zwplug, flags);
> +
> +	bio = bio_list_pop(&zwplug->bio_list);
> +	if (!bio) {
> +		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
> +		blk_zone_wplug_unlock(zwplug, flags);
> +		return;
> +	}
> +
> +	blk_zone_wplug_unlock(zwplug, flags);
> +
> +	/*
> +	 * blk-mq devices will reuse the reference on the request queue usage
> +	 * we took when the BIO was plugged, but the submission path for
> +	 * BIO-based devices will not do that. So drop this reference here.
> +	 */
> +	if (bio->bi_bdev->bd_has_submit_bio)
> +		blk_queue_exit(bio->bi_bdev->bd_disk->queue);

But I don't see where this reference is reused for blk-mq in this patch,
care to point it out?


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 01/26] block: Restore sector of flush requests
  2024-02-02  7:30 ` [PATCH 01/26] block: Restore sector of flush requests Damien Le Moal
@ 2024-02-04 11:55   ` Hannes Reinecke
  2024-02-05 17:22   ` Bart Van Assche
  1 sibling, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 11:55 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> On completion of a flush sequence, blk_flush_restore_request() restores
> the bio of a request to the original submitted BIO. However, the last
> use of the request in the flush sequence may have been for a POSTFLUSH
> which does not have a sector. So make sure to restore the request sector
> using the iter sector of the original BIO. This BIO has not changed yet
> since the completions of the flush sequence intermediate steps use
> requeueing of the request until all steps are completed.
> 
> Restoring the request sector ensures that blk_mq_end_request() will see
> a valid sector as originally set when the flush BIO was submitted.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-flush.c | 1 +
>   1 file changed, 1 insertion(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 02/26] block: Remove req_bio_endio()
  2024-02-02  7:30 ` [PATCH 02/26] block: Remove req_bio_endio() Damien Le Moal
@ 2024-02-04 11:57   ` Hannes Reinecke
  2024-02-05 17:28   ` Bart Van Assche
  1 sibling, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 11:57 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> Moving req_bio_endio() code into its only caller, blk_update_request(),
> allows reducing accesses to and tests of bio and request fields. Also,
> given that partial completions of zone append operations is not
> possible and that zone append operations cannot be merged, the update
> of the BIO sector using the request sector for these operations can be
> moved directly before the call to bio_endio().
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-mq.c | 66 ++++++++++++++++++++++++--------------------------
>   1 file changed, 31 insertions(+), 35 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 03/26] block: Introduce bio_straddle_zones() and bio_offset_from_zone_start()
  2024-02-02  7:30 ` [PATCH 03/26] block: Introduce bio_straddle_zones() and bio_offset_from_zone_start() Damien Le Moal
  2024-02-03  4:09   ` Bart Van Assche
@ 2024-02-04 11:58   ` Hannes Reinecke
  1 sibling, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 11:58 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> Implement the inline helper functions bio_straddle_zones() and
> bio_offset_from_zone_start() to respectively test if a BIO crosses a
> zone boundary (the start sector and last sector belong to different
> zones) and to obtain the oofset from a zone starting sector of a BIO.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   include/linux/blkdev.h | 12 ++++++++++++
>   1 file changed, 12 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 04/26] block: Introduce blk_zone_complete_request_bio()
  2024-02-02  7:30 ` [PATCH 04/26] block: Introduce blk_zone_complete_request_bio() Damien Le Moal
@ 2024-02-04 11:59   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 11:59 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> On completion of a zone append request, the request sector indicates the
> location of the written data. This value must be returned to the user
> through the BIO iter sector. This is done in 2 places: in
> blk_complete_request() and in req_bio_endio(). Introduce the inline
> helper function blk_zone_complete_request_bio() to avoid duplicating
> this BIO update for zone append requests, and to compile out this
> helper call when CONFIG_BLK_DEV_ZONED is not enabled.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-mq.c | 11 +++++------
>   block/blk.h    | 19 ++++++++++++++++++-
>   2 files changed, 23 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index bfebb8fcd248..f02e486a02ae 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -822,11 +822,11 @@ static void blk_complete_request(struct request *req)
>   		/* Completion has already been traced */
>   		bio_clear_flag(bio, BIO_TRACE_COMPLETION);
>   
> -		if (req_op(req) == REQ_OP_ZONE_APPEND)
> -			bio->bi_iter.bi_sector = req->__sector;
> -
> -		if (!is_flush)
> +		if (!is_flush) {
> +			blk_zone_complete_request_bio(req, bio);
>   			bio_endio(bio);
> +		}
> +
>   		bio = next;
>   	} while (bio);
>   
> @@ -928,8 +928,7 @@ bool blk_update_request(struct request *req, blk_status_t error,
>   
>   		/* Don't actually finish bio if it's part of flush sequence */
>   		if (!bio->bi_iter.bi_size && !is_flush) {
> -			if (req_op(req) == REQ_OP_ZONE_APPEND)
> -				bio->bi_iter.bi_sector = req->__sector;
> +			blk_zone_complete_request_bio(req, bio);
>   			bio_endio(bio);
>   		}
>   
> diff --git a/block/blk.h b/block/blk.h
> index 913c93838a01..23f76b452e70 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -396,12 +396,29 @@ static inline struct bio *blk_queue_bounce(struct bio *bio,
>   
>   #ifdef CONFIG_BLK_DEV_ZONED
>   void disk_free_zone_bitmaps(struct gendisk *disk);
> +static inline void blk_zone_complete_request_bio(struct request *rq,
> +						 struct bio *bio)
> +{
> +	/*
> +	 * For zone append requests, the request sector indicates the location
> +	 * at which the BIO data was written. Return this value to the BIO
> +	 * issuer through the BIO iter sector.
> +	 */
> +	if (req_op(rq) == REQ_OP_ZONE_APPEND)
> +		bio->bi_iter.bi_sector = rq->__sector;
> +}
>   int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd,
>   		unsigned long arg);
>   int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
>   		unsigned int cmd, unsigned long arg);
>   #else /* CONFIG_BLK_DEV_ZONED */
> -static inline void disk_free_zone_bitmaps(struct gendisk *disk) {}
> +static inline void disk_free_zone_bitmaps(struct gendisk *disk)
> +{
> +}
> +static inline void blk_zone_complete_request_bio(struct request *rq,
> +						 struct bio *bio)
> +{
> +}
>   static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
>   		unsigned int cmd, unsigned long arg)
>   {

Well, it doesn't _actually_ complete the request nor the bio.
What about blk_zone_update_request_bio()?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 05/26] block: Allow using bio_attempt_back_merge() internally
  2024-02-02  7:30 ` [PATCH 05/26] block: Allow using bio_attempt_back_merge() internally Damien Le Moal
  2024-02-03  4:11   ` Bart Van Assche
@ 2024-02-04 12:00   ` Hannes Reinecke
  1 sibling, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:00 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> Remove the static definition of bio_attempt_back_merge() to allow using
> this function internally from other block layer files. Add the
> definition of enum bio_merge_status and the declaration of
> bio_attempt_back_merge() to block/blk.h.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-merge.c | 8 +-------
>   block/blk.h       | 8 ++++++++
>   2 files changed, 9 insertions(+), 7 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-02  7:30 ` [PATCH 06/26] block: Introduce zone write plugging Damien Le Moal
  2024-02-04  3:56   ` Ming Lei
@ 2024-02-04 12:14   ` Hannes Reinecke
  2024-02-05 17:48   ` Bart Van Assche
  2 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:14 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> Zone write plugging implements a per-zone "plug" for write operations to
> tightly control the submission and execution order of writes to
> sequential write required zones of a zoned block device. Per-zone
> plugging of writes guarantees that at any time at most one write request
> per zone is in flight. This mechanism is intended to replace zone write
> locking which is controlled at the scheduler level and implemented only
> by mq-deadline.
> 
> Unlike zone write locking which operates on requests, zone write
> plugging operates on BIOs. A zone write plug is simply a BIO list that
> is atomically manipulated using a spinlock and a kblockd submission
> work. A write BIO to a zone is "plugged" to delay its execution if a
> write BIO for the same zone was already issued, that is, if a write
> request for the same zone is being executed. The next plugged BIO is
> unplugged and issued once the write request completes.
> 
> This mechanism allows to:
>   - Untangles zone write ordering from block IO schedulers. This allows
>     removing the restriction on using only mq-deadline for zoned block
>     devices. Any block IO scheduler, including "none" can be used.
>   - Zone write plugging operates on BIOs instead of requests. Plugged
>     BIOs waiting for execution thus do not hold scheduling tags and thus
>     are not preventing other BIOs to proceed (reads or writes to other
>     zones). Depending on the workload, this can significantly improve
>     the device use and performance.
>   - Both blk-mq (request) based zoned devices and BIO-based devices (e.g.
>     device mapper) can use zone write plugging. It is mandatory for the
>     former but optional for the latter: BIO-based driver can use zone
>     write plugging to implement write ordering guarantees, or the drivers
>     can implement their own if needed.
>   - The code is less invasive in the block layer and is mostly limited to
>     blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
>     bio.c.
> 
> Zone write plugging is implemented using struct blk_zone_wplug. This
> structurei includes a spinlock, a BIO list and a work structure to
> handle the submission of plugged BIOs.
> 
> Plugging of zone write BIOs is done using the function
> blk_zone_write_plug_bio() which returns false if a BIO execution does
> not need to be delayed and true otherwise. This function is called
> from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
> spanning multiple zones which would cause mishandling of zone write
> plugging. This enables by default zone write plugging for any mq
> request-based block device. BIO-based device drivers can also use zone
> write plugging by expliclty calling blk_zone_write_plug_bio() in their
> ->submit_bio method. For such devices, the driver must ensure that a
> BIO passed to blk_zone_write_plug_bio() is already split and not
> straddling zone boundaries.
> 
> Only write and write zeroes BIOs are plugged. Zone write plugging does
> not introduce any significant overhead for other operations. A BIO that
> is being handled through zone write plugging is flagged using the new
> BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
> this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
> The completion processing of BIOs and requests flagged trigger
> respectively calls to the functions blk_zone_write_plug_bio_endio() and
> blk_zone_write_plug_complete_request(). The latter function is used to
> trigger submission of the next plugged BIO using the zone plug work.
> blk_zone_write_plug_bio_endio() does the same for BIO-based devices.
> This ensures that at any time, at most one request (blk-mq devices) or
> one BIO (BIO-based devices) are being executed for any zone. The
> handling of zone write plug using a per-zone plug spinlock maximizes
> parrallelism and device usage by allowing multiple zones to be writen
> simultaneously without lock contention.
> 
> Zone write plugging ignores flush BIOs without data. Hovever, any flush
> BIO that has data is always plugged so that the write part of the flush
> sequence is serialized with other regular writes.
> 
> Given that any BIO handled through zone write plugging will be the only
> BIO in flight for the target zone when it is executed, the unplugging
> and submission of a BIO will have no chance of successfully merging with
> plugged requests or requests in the scheduler. To overcome this
> potential performance loss, blk_mq_submit_bio() calls the function
> blk_zone_write_plug_attempt_merge() to try to merge other plugged BIOs
> with the one just unplugged. Successful merging is signaled using
> blk_zone_write_plug_bio_merged(), called from bio_attempt_back_merge().
> Furthermore, to avoid recalculating the number of segments of plugged
> BIOs to attempt merging, the number of segments of a plugged BIO is
> saved using the new struct bio field __bi_nr_segments. To avoid growing
> the size of struct bio, this field is added as a union with the
> bio_cookie field. This is safe to do as polling is always disabled for
> plugged BIOs.
> 
> When BIOs are plugged in a zone write plug, the device request queue
> usage counter is always incremented. This kept and reused when the
> plugged BIO is unplugged and submitted again using
> submit_bio_noacct_nocheck(). For this case, the unplugged BIO is already
> flagged with BIO_ZONE_WRITE_PLUGGING and blk_mq_submit_bio() proceeds
> directly to allocating a new request for the BIO, re-using the usage
> reference count taken when the BIO was plugged. This extra reference
> count is dropped in blk_zone_write_plug_attempt_merge() for any plugged
> BIO that is successfully merged. Given that BIO-based devices will not
> take this path, the extra reference is dropped when a plugged BIO is
> unplugged and submitted.
> 
> To match the new data structures used for zoned disks, the function
> disk_free_zone_bitmaps() is renamed to the more generic
> disk_free_zone_resources().
> 
> This commit contains contributions from Christoph Hellwig <hch@lst.de>.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/bio.c               |   7 +
>   block/blk-merge.c         |  11 +
>   block/blk-mq.c            |  28 +++
>   block/blk-zoned.c         | 408 +++++++++++++++++++++++++++++++++++++-
>   block/blk.h               |  32 ++-
>   block/genhd.c             |   2 +-
>   include/linux/blk-mq.h    |   2 +
>   include/linux/blk_types.h |   8 +-
>   include/linux/blkdev.h    |   8 +
>   9 files changed, 496 insertions(+), 10 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index b9642a41f286..c8b0f7e8c713 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1581,6 +1581,13 @@ void bio_endio(struct bio *bio)
>   	if (!bio_integrity_endio(bio))
>   		return;
>   
> +	/*
> +	 * For BIOs handled through a zone write plugs, signal the end of the
> +	 * BIO to the zone write plug to submit the next plugged BIO.
> +	 */
> +	if (bio_zone_write_plugging(bio))
> +		blk_zone_write_plug_bio_endio(bio);
> +
>   	rq_qos_done_bio(bio);
>   
>   	if (bio->bi_bdev && bio_flagged(bio, BIO_TRACE_COMPLETION)) {
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index a1ef61b03e31..2b5489cd9c65 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -377,6 +377,7 @@ struct bio *__bio_split_to_limits(struct bio *bio,
>   		blkcg_bio_issue_init(split);
>   		bio_chain(split, bio);
>   		trace_block_split(split, bio->bi_iter.bi_sector);
> +		WARN_ON_ONCE(bio_zone_write_plugging(bio));
>   		submit_bio_noacct(bio);
>   		return split;
>   	}
> @@ -980,6 +981,9 @@ enum bio_merge_status bio_attempt_back_merge(struct request *req,
>   
>   	blk_update_mixed_merge(req, bio, false);
>   
> +	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
> +		blk_zone_write_plug_bio_merged(bio);
> +
>   	req->biotail->bi_next = bio;
>   	req->biotail = bio;
>   	req->__data_len += bio->bi_iter.bi_size;
> @@ -995,6 +999,13 @@ static enum bio_merge_status bio_attempt_front_merge(struct request *req,
>   {
>   	const blk_opf_t ff = bio_failfast(bio);
>   
> +	/*
> +	 * A front merge for zone writes can happen only if the user submitted
> +	 * writes out of order. Do not attempt this to let the write fail.
> +	 */
> +	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
> +		return BIO_MERGE_FAILED;
> +
>   	if (!ll_front_merge_fn(req, bio, nr_segs))
>   		return BIO_MERGE_FAILED;
>   
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f02e486a02ae..aa49bebf1199 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -830,6 +830,9 @@ static void blk_complete_request(struct request *req)
>   		bio = next;
>   	} while (bio);
>   
> +	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
> +		blk_zone_write_plug_complete_request(req);
> +
>   	/*
>   	 * Reset counters so that the request stacking driver
>   	 * can find how many bytes remain in the request
> @@ -943,6 +946,9 @@ bool blk_update_request(struct request *req, blk_status_t error,
>   	 * completely done
>   	 */
>   	if (!req->bio) {
> +		if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
> +			blk_zone_write_plug_complete_request(req);
> +
>   		/*
>   		 * Reset counters so that the request stacking driver
>   		 * can find how many bytes remain in the request
> @@ -2975,6 +2981,17 @@ void blk_mq_submit_bio(struct bio *bio)
>   	struct request *rq;
>   	blk_status_t ret;
>   
> +	/*
> +	 * A BIO that was released form a zone write plug has already been
> +	 * through the preparation in this function, already holds a reference
> +	 * on the queue usage counter, and is the only write BIO in-flight for
> +	 * the target zone. Go straight to allocating a request for it.
> +	 */
> +	if (bio_zone_write_plugging(bio)) {
> +		nr_segs = bio->__bi_nr_segments;
> +		goto new_request;
> +	}
> +
>   	bio = blk_queue_bounce(bio, q);
>   	bio_set_ioprio(bio);
>   
> @@ -3001,7 +3018,11 @@ void blk_mq_submit_bio(struct bio *bio)
>   	if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
>   		goto queue_exit;
>   
> +	if (blk_queue_is_zoned(q) && blk_zone_write_plug_bio(bio, nr_segs))
> +		goto queue_exit;
> +
>   	if (!rq) {
> +new_request:
>   		rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
>   		if (unlikely(!rq))
>   			goto queue_exit;
> @@ -3017,8 +3038,12 @@ void blk_mq_submit_bio(struct bio *bio)
>   
>   	ret = blk_crypto_rq_get_keyslot(rq);
>   	if (ret != BLK_STS_OK) {
> +		bool zwplugging = bio_zone_write_plugging(bio);
> +
>   		bio->bi_status = ret;
>   		bio_endio(bio);
> +		if (zwplugging)
> +			blk_zone_write_plug_complete_request(rq);
>   		blk_mq_free_request(rq);
>   		return;
>   	}
> @@ -3026,6 +3051,9 @@ void blk_mq_submit_bio(struct bio *bio)
>   	if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
>   		return;
>   
> +	if (bio_zone_write_plugging(bio))
> +		blk_zone_write_plug_attempt_merge(rq);
> +
>   	if (plug) {
>   		blk_add_rq_to_plug(plug, rq);
>   		return;
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index d343e5756a9c..f6d4f511b664 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -7,11 +7,11 @@
>    *
>    * Copyright (c) 2016, Damien Le Moal
>    * Copyright (c) 2016, Western Digital
> + * Copyright (c) 2024, Western Digital Corporation or its affiliates.
>    */
>   
>   #include <linux/kernel.h>
>   #include <linux/module.h>
> -#include <linux/rbtree.h>
>   #include <linux/blkdev.h>
>   #include <linux/blk-mq.h>
>   #include <linux/mm.h>
> @@ -19,6 +19,7 @@
>   #include <linux/sched/mm.h>
>   
>   #include "blk.h"
> +#include "blk-mq-sched.h"
>   
>   #define ZONE_COND_NAME(name) [BLK_ZONE_COND_##name] = #name
>   static const char *const zone_cond_name[] = {
> @@ -33,6 +34,27 @@ static const char *const zone_cond_name[] = {
>   };
>   #undef ZONE_COND_NAME
>   
> +/*
> + * Per-zone write plug.
> + */
> +struct blk_zone_wplug {
> +	spinlock_t		lock;
> +	unsigned int		flags;
> +	struct bio_list		bio_list;
> +	struct work_struct	bio_work;
> +};
> +
> +/*
> + * Zone write plug flags bits:
> + *  - BLK_ZONE_WPLUG_CONV: Indicate that the zone is a conventional one. Writes
> + *    to these zones are never plugged.
> + *  - BLK_ZONE_WPLUG_PLUGGED: Indicate that the zone write plug is plugged,
> + *    that is, that write BIOs are being throttled due to a write BIO already
> + *    being executed or the zone write plug bio list is not empty.
> + */
> +#define BLK_ZONE_WPLUG_CONV	(1U << 0)
> +#define BLK_ZONE_WPLUG_PLUGGED	(1U << 1)
> +
>   /**
>    * blk_zone_cond_str - Return string XXX in BLK_ZONE_COND_XXX.
>    * @zone_cond: BLK_ZONE_COND_XXX.
> @@ -429,12 +451,374 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
>   	return ret;
>   }
>   
> -void disk_free_zone_bitmaps(struct gendisk *disk)
> +#define blk_zone_wplug_lock(zwplug, flags) \
> +	spin_lock_irqsave(&zwplug->lock, flags)
> +
> +#define blk_zone_wplug_unlock(zwplug, flags) \
> +	spin_unlock_irqrestore(&zwplug->lock, flags)
> +
> +static inline void blk_zone_wplug_bio_io_error(struct bio *bio)
> +{
> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +
> +	bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING);
> +	bio_io_error(bio);
> +	blk_queue_exit(q);
> +}
> +
> +static int blk_zone_wplug_abort(struct gendisk *disk,
> +				struct blk_zone_wplug *zwplug)
> +{
> +	struct bio *bio;
> +	int nr_aborted = 0;
> +
> +	while ((bio = bio_list_pop(&zwplug->bio_list))) {
> +		blk_zone_wplug_bio_io_error(bio);
> +		nr_aborted++;
> +	}
> +
> +	return nr_aborted;
> +}
> +
> +/*
> + * Return the zone write plug for sector in sequential write required zone.
> + * Given that conventional zones have no write ordering constraints, NULL is
> + * returned for sectors in conventional zones, to indicate that zone write
> + * plugging is not needed.
> + */
> +static inline struct blk_zone_wplug *
> +disk_lookup_zone_wplug(struct gendisk *disk, sector_t sector)
> +{
> +	struct blk_zone_wplug *zwplug;
> +
> +	if (WARN_ON_ONCE(!disk->zone_wplugs))
> +		return NULL;
> +
> +	zwplug = &disk->zone_wplugs[disk_zone_no(disk, sector)];
> +	if (zwplug->flags & BLK_ZONE_WPLUG_CONV)
> +		return NULL;
> +	return zwplug;
> +}
> +
> +static inline struct blk_zone_wplug *bio_lookup_zone_wplug(struct bio *bio)
> +{
> +	return disk_lookup_zone_wplug(bio->bi_bdev->bd_disk,
> +				      bio->bi_iter.bi_sector);
> +}
> +
> +static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
> +					  struct bio *bio, unsigned int nr_segs)
> +{
> +	/*
> +	 * Keep a reference on the BIO request queue usage. This reference will
> +	 * be dropped either if the BIO is failed or after it is issued and
> +	 * completes.
> +	 */
> +	percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
> +
As discussed, wouldn't it be sufficient to increase the q_usage_counter 
only for the plug itself, and not the bios?
The bios are already allocated for, and I think it we would to use a 
separate reference here as the 'plug' has a different lifetime than the 
bios which are added to the plug.

> +	/*
> +	 * The BIO is being plugged and thus will have to wait for the on-going
> +	 * write and for all other writes already plugged. So polling makes
> +	 * no sense.
> +	 */
> +	bio_clear_polled(bio);
> +
> +	/*
> +	 * Reuse the poll cookie field to store the number of segments when
> +	 * split to the hardware limits.
> +	 */
> +	bio->__bi_nr_segments = nr_segs;
> +
> +	/*
> +	 * We always receive BIOs after they are split and ready to be issued.
> +	 * The block layer passes the parts of a split BIO in order, and the
> +	 * user must also issue write sequentially. So simply add the new BIO
> +	 * at the tail of the list to preserve the sequential write order.
> +	 */
> +	bio_list_add(&zwplug->bio_list, bio);
> +}
> +
> +/*
> + * Called from bio_attempt_back_merge() when a BIO was merged with a request.
> + */
> +void blk_zone_write_plug_bio_merged(struct bio *bio)
> +{
> +	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
> +}
> +
> +/*
> + * Attempt to merge plugged BIOs with a newly formed request of a BIO that went
> + * through zone write plugging (either a new BIO or one that was unplugged).
> + */
> +void blk_zone_write_plug_attempt_merge(struct request *req)
> +{
> +	struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(req->bio);
> +	sector_t req_back_sector = blk_rq_pos(req) + blk_rq_sectors(req);
> +	struct request_queue *q = req->q;
> +	unsigned long flags;
> +	struct bio *bio;
> +
> +	/*
> +	 * Completion of this request needs to be handled with
> +	 * blk_zone_write_complete_request().
> +	 */
> +	req->rq_flags |= RQF_ZONE_WRITE_PLUGGING;
> +
> +	if (blk_queue_nomerges(q))
> +		return;
> +
> +	/*
> +	 * Walk through the list of plugged BIOs to check if they can be merged
> +	 * into the back of the request.
> +	 */
> +	blk_zone_wplug_lock(zwplug, flags);
> +	while ((bio = bio_list_peek(&zwplug->bio_list))) {
> +		if (bio->bi_iter.bi_sector != req_back_sector ||
> +		    !blk_rq_merge_ok(req, bio))
> +			break;
> +
> +		WARN_ON_ONCE(bio_op(bio) != REQ_OP_WRITE_ZEROES &&
> +			     !bio->__bi_nr_segments);
> +
> +		bio_list_pop(&zwplug->bio_list);
> +		if (bio_attempt_back_merge(req, bio, bio->__bi_nr_segments) !=
> +		    BIO_MERGE_OK) {
> +			bio_list_add_head(&zwplug->bio_list, bio);
> +			break;
> +		}
> +
> +		/*
> +		 * Drop the extra reference on the queue usage we got when
> +		 * plugging the BIO.
> +		 */
> +		blk_queue_exit(q);
> +
> +		req_back_sector += bio_sectors(bio);
> +	}
> +	blk_zone_wplug_unlock(zwplug, flags);
> +}

And that's the other thing with which I'm slightly uncomfortable.
We're replicating parts of the generic merging code here.
It would be far better from a maintenance standpoint if we had only one 
place where we deal with bio merging.
But I do see the challenge, so this is more of a reminder than something
which needs to be fixed.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 07/26] block: Allow zero value of max_zone_append_sectors queue limit
  2024-02-02  7:30 ` [PATCH 07/26] block: Allow zero value of max_zone_append_sectors queue limit Damien Le Moal
@ 2024-02-04 12:15   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:15 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> In preparation for adding a generic zone append emulation using zone
> write plugging, allow device drivers supporting zoned block device to
> set a the max_zone_append_sectors queue limit of a device to 0 to
> indicate the lack of native support for zone append operations and that
> the block layer should emulate these operations using regular write
> operations.
> 
> blk_queue_max_zone_append_sectors() is modified to allow passing 0 as
> the max_zone_append_sectors argument. The function
> queue_max_zone_append_sectors() is also modified to ensure that the
> minimum of the max_sectors and chunk_sectors limit is used whenever the
> max_zone_append_sectors limit is 0.
> 
> The helper functions queue_emulates_zone_append() and
> bdev_emulates_zone_append() are added to test if a queue (or block
> device) emulates zone append operations.
> 
> In order for blk_revalidate_disk_zones() to accept zoned block devices
> relying on zone append emulation, the direct check to the
> max_zone_append_sectors queue limit of the disk is replaced by a check
> using the value returned by queue_max_zone_append_sectors(). Similarly,
> queue_zone_append_max_show() is modified to use the same accessor so
> that the sysfs attribute advertizes the non-zero limit that will be
> used, regardless if it is for native or emulated commands.
> 
> For stacking drivers, a top device should not need to care if the
> underlying devices have native or emulated zone append operations.
> blk_stack_limits() is thus modified to set the top device
> max_zone_append_sectors limit using the new accessor
> queue_limits_max_zone_append_sectors(). queue_max_zone_append_sectors()
> is modified to use this function as well. Stacking drivers that require
> zone append emulation, e.g. dm-crypt, can still request this feature by
> calling blk_queue_max_zone_append_sectors() with a 0 limit.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-core.c       |  2 +-
>   block/blk-settings.c   | 35 +++++++++++++++++++++++------------
>   block/blk-sysfs.c      |  2 +-
>   block/blk-zoned.c      |  2 +-
>   include/linux/blkdev.h | 20 +++++++++++++++++---
>   5 files changed, 43 insertions(+), 18 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 08/26] block: Implement zone append emulation
  2024-02-02  7:30 ` [PATCH 08/26] block: Implement zone append emulation Damien Le Moal
@ 2024-02-04 12:24   ` Hannes Reinecke
  2024-02-05  0:10     ` Damien Le Moal
  2024-02-05 17:58   ` Bart Van Assche
  1 sibling, 1 reply; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:24 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> Given that zone write plugging manages all writes to zones of a zoned
> block device, we can track the write pointer position of all zones in
> order to implement zone append emulation using regular write operations.
> This is needed for devices that do not natively support the zone append
> command, e.g. SMR hard-disks.
> 
> This commit adds zone write pointer tracking similarly to how the SCSI
> disk driver (sd) does, that is, in the form of a 32-bits number of
> sectors equal to the offset within the zone of the zone write pointer.
> The wp_offset field is added to struct blk_zone_wplug for this. Write
> pointer tracking is only enabled for zoned devices that requested
> zone append emulation by setting the max_zone_append_sectors queue
> limit of the disk to 0.
> 
> For zoned devices that requested zone append emulation, wp_offset is
> managed as follows:
>   - It is incremented when a write BIO is prepared for submission or
>     merged into a new request. This is done in
>     blk_zone_wplug_prepare_bio() when a BIO is unplugged, in
>     blk_zone_write_plug_bio_merged() when a new unplugged BIO is merged
>     before zone write plugging and in blk_zone_write_plug_attempt_merge()
>     when plugged BIOs are merged into a new request.
>   - The helper functions blk_zone_handle_reset() and
>     blk_zone_handle_reset_all() are added to set the write pointer
>     offset to 0 for the targeted zones of REQ_OP_ZONE_RESET and
>     REQ_OP_ZONE_RESETALL operations.
>   - The helper function blk_zone_handle_finish() is added to set the
>     write pointer offset to the zone size for the target zone of a
>     REQ_OP_ZONE_FINISH operation.
> 
> The function blk_zone_wplug_prepare_bio() also checks and prepares a BIO
> for submission. Preparation involves changing zone append BIOs into
> non-mergeable regular write BIOs for devices that require zone append
> emulation. Modified zone append BIOs are flagged with the new BIO flag
> BIO_EMULATES_ZONE_APPEND. This flag is checked on completion of the
> BIO in blk_zone_complete_requests_bio() to restore the original
> REQ_OP_ZONE_APPEND operation code of the BIO.
> 
> If a write error happens, the wp_offset value may become incorrect and
> out of sync with the device managed write pointer. This is handled using
> the new zone write plug flag BLK_ZONE_WPLUG_ERROR. The function
> blk_zone_wplug_handle_error() is called from the new disk zone write
> plug work when this flag is set. This function executes a report zone to
> update the zone write pointer offset to the current value as indicated
> by the device. The disk zone write plug work is scheduled whenever a BIO
> flagged with BIO_ZONE_WRITE_PLUGGING completes with an error or when
> bio_zone_wplug_prepare_bio() detects an unaligned write. Once scheduled,
> the disk zone write plugs work keeps running until all zone errors are
> handled.
> 
> The block layer internal inline helper function bio_is_zone_append() is
> added to test if a BIO is either a native zone append operation
> (REQ_OP_ZONE_APPEND operation code) or if it is flagged with
> BIO_EMULATES_ZONE_APPEND. Given that both native and emulated zone
> append BIO completion handling should be similar, The functions
> blk_update_request() and blk_zone_complete_request_bio() are modified to
> use bio_is_zone_append() to execute blk_zone_complete_request_bio() for
> both native and emulated zone append operations.
> 
> This commit contains contributions from Christoph Hellwig <hch@lst.de>.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-mq.c            |   2 +-
>   block/blk-zoned.c         | 457 ++++++++++++++++++++++++++++++++++++--
>   block/blk.h               |  14 +-
>   include/linux/blk_types.h |   1 +
>   include/linux/blkdev.h    |   3 +
>   5 files changed, 452 insertions(+), 25 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index aa49bebf1199..a112298a6541 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -909,7 +909,7 @@ bool blk_update_request(struct request *req, blk_status_t error,
>   
>   		if (bio_bytes == bio->bi_iter.bi_size) {
>   			req->bio = bio->bi_next;
> -		} else if (req_op(req) == REQ_OP_ZONE_APPEND) {
> +		} else if (bio_is_zone_append(bio)) {
>   			/*
>   			 * Partial zone append completions cannot be supported
>   			 * as the BIO fragments may end up not being written
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 661ef61ca3b1..929c28796c41 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -42,6 +42,8 @@ struct blk_zone_wplug {
>   	unsigned int		flags;
>   	struct bio_list		bio_list;
>   	struct work_struct	bio_work;
> +	unsigned int		wp_offset;
> +	unsigned int		capacity;
>   };
>   
>   /*
> @@ -51,9 +53,12 @@ struct blk_zone_wplug {
>    *  - BLK_ZONE_WPLUG_PLUGGED: Indicate that the zone write plug is plugged,
>    *    that is, that write BIOs are being throttled due to a write BIO already
>    *    being executed or the zone write plug bio list is not empty.
> + *  - BLK_ZONE_WPLUG_ERROR: Indicate that a write error happened which will be
> + *    recovered with a report zone to update the zone write pointer offset.
>    */
>   #define BLK_ZONE_WPLUG_CONV	(1U << 0)
>   #define BLK_ZONE_WPLUG_PLUGGED	(1U << 1)
> +#define BLK_ZONE_WPLUG_ERROR	(1U << 2)
>   
>   /**
>    * blk_zone_cond_str - Return string XXX in BLK_ZONE_COND_XXX.
> @@ -480,6 +485,28 @@ static int blk_zone_wplug_abort(struct gendisk *disk,
>   	return nr_aborted;
>   }
>   
> +static void blk_zone_wplug_abort_unaligned(struct gendisk *disk,
> +					   struct blk_zone_wplug *zwplug)
> +{
> +	unsigned int wp_offset = zwplug->wp_offset;
> +	struct bio_list bl = BIO_EMPTY_LIST;
> +	struct bio *bio;
> +
> +	while ((bio = bio_list_pop(&zwplug->bio_list))) {
> +		if (wp_offset >= zwplug->capacity ||
> +		    (bio_op(bio) != REQ_OP_ZONE_APPEND &&
> +		     bio_offset_from_zone_start(bio) != wp_offset)) {
> +			blk_zone_wplug_bio_io_error(bio);
> +			continue;
> +		}
> +
> +		wp_offset += bio_sectors(bio);
> +		bio_list_add(&bl, bio);
> +	}
> +
> +	bio_list_merge(&zwplug->bio_list, &bl);
> +}
> +
>   /*
>    * Return the zone write plug for sector in sequential write required zone.
>    * Given that conventional zones have no write ordering constraints, NULL is
> @@ -506,6 +533,87 @@ static inline struct blk_zone_wplug *bio_lookup_zone_wplug(struct bio *bio)
>   				      bio->bi_iter.bi_sector);
>   }
>   
> +/*
> + * Set a zone write plug write pointer offset to either 0 (zone reset case)
> + * or to the zone size (zone finish case). This aborts all plugged BIOs, which
> + * is fine to do as doing a zone reset or zone finish while writes are in-flight
> + * is a mistake from the user which will most likely cause all plugged BIOs to
> + * fail anyway.
> + */
> +static void blk_zone_wplug_set_wp_offset(struct gendisk *disk,
> +					 struct blk_zone_wplug *zwplug,
> +					 unsigned int wp_offset)
> +{
> +	/*
> +	 * Updating the write pointer offset puts back the zone
> +	 * in a good state. So clear the error flag and decrement the
> +	 * error count if we were in error state.
> +	 */
> +	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) {
> +		zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
> +		atomic_dec(&disk->zone_nr_wplugs_with_error);
> +	}
> +
> +	/* Update the zone write pointer and abort all plugged BIOs. */
> +	zwplug->wp_offset = wp_offset;
> +	blk_zone_wplug_abort(disk, zwplug);
> +}
> +
> +static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
> +						  unsigned int wp_offset)
> +{
> +	struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(bio);
> +	unsigned long flags;
> +
> +	/* Conventional zones cannot be reset nor finished. */
> +	if (!zwplug) {
> +		bio_io_error(bio);
> +		return true;
> +	}
> +
> +	if (!bdev_emulates_zone_append(bio->bi_bdev))
> +		return false;
> +
> +	/*
> +	 * Set the zone write pointer offset to 0 (reset case) or to the
> +	 * zone size (finish case). This will abort all BIOs plugged for the
> +	 * target zone. It is fine as resetting or finishing zones while writes
> +	 * are still in-flight will result in the writes failing anyway.
> +	 */
> +	blk_zone_wplug_lock(zwplug, flags);
> +	blk_zone_wplug_set_wp_offset(bio->bi_bdev->bd_disk, zwplug, wp_offset);
> +	blk_zone_wplug_unlock(zwplug, flags);
> +
> +	return false;
> +}
> +
> +static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
> +{
> +	struct gendisk *disk = bio->bi_bdev->bd_disk;
> +	struct blk_zone_wplug *zwplug = &disk->zone_wplugs[0];
> +	unsigned long flags;
> +	unsigned int i;
> +
> +	if (!bdev_emulates_zone_append(bio->bi_bdev))
> +		return false;
> +
> +	/*
> +	 * Set the write pointer offset of all zones to 0. This will abort all
> +	 * plugged BIOs. It is fine as resetting zones while writes are still
> +	 * in-flight will result in the writes failing anyway..
> +	 */
> +	for (i = 0; i < disk->nr_zones; i++, zwplug++) {
> +		/* Ignore conventional zones. */
> +		if (zwplug->flags & BLK_ZONE_WPLUG_CONV)
> +			continue;
> +		blk_zone_wplug_lock(zwplug, flags);
> +		blk_zone_wplug_set_wp_offset(disk, zwplug, 0);
> +		blk_zone_wplug_unlock(zwplug, flags);
> +	}
> +
> +	return false;
> +}
> +
>   static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
>   					  struct bio *bio, unsigned int nr_segs)
>   {
> @@ -543,7 +651,26 @@ static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
>    */
>   void blk_zone_write_plug_bio_merged(struct bio *bio)
>   {
> +	struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(bio);
> +	unsigned long flags;
> +
> +	/*
> +	 * If the BIO was already plugged, then this we were called through
> +	 * blk_zone_write_plug_attempt_merge() -> blk_attempt_bio_merge().
> +	 * For this case, blk_zone_write_plug_attempt_merge() will handle the
> +	 * zone write pointer offset update.
> +	 */
> +	if (bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING))
> +		return;
> +
> +	blk_zone_wplug_lock(zwplug, flags);
> +
>   	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
> +
> +	/* Advance the zone write pointer offset. */
> +	zwplug->wp_offset += bio_sectors(bio);
> +
> +	blk_zone_wplug_unlock(zwplug, flags);
>   }
>   
>   /*
> @@ -572,7 +699,8 @@ void blk_zone_write_plug_attempt_merge(struct request *req)
>   	 * into the back of the request.
>   	 */
>   	blk_zone_wplug_lock(zwplug, flags);
> -	while ((bio = bio_list_peek(&zwplug->bio_list))) {
> +	while (zwplug->wp_offset < zwplug->capacity &&
> +	       (bio = bio_list_peek(&zwplug->bio_list))) {
>   		if (bio->bi_iter.bi_sector != req_back_sector ||
>   		    !blk_rq_merge_ok(req, bio))
>   			break;
> @@ -589,15 +717,86 @@ void blk_zone_write_plug_attempt_merge(struct request *req)
>   
>   		/*
>   		 * Drop the extra reference on the queue usage we got when
> -		 * plugging the BIO.
> +		 * plugging the BIO and advance the write pointer offset.
>   		 */
>   		blk_queue_exit(q);
> +		zwplug->wp_offset += bio_sectors(bio);
>   
>   		req_back_sector += bio_sectors(bio);
>   	}
>   	blk_zone_wplug_unlock(zwplug, flags);
>   }
>   
> +static inline void blk_zone_wplug_set_error(struct gendisk *disk,
> +					    struct blk_zone_wplug *zwplug)
> +{
> +	if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR)) {
> +		zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
> +		atomic_inc(&disk->zone_nr_wplugs_with_error);
> +	}
> +}
> +
> +/*
> + * Prepare a zone write bio for submission by incrementing the write pointer and
> + * setting up the zone append emulation if needed.
> + */
> +static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
> +				       struct bio *bio)
> +{
> +	/*
> +	 * If we do not need to emulate zone append, zone write pointer offset
> +	 * tracking is not necessary and we have nothing to do.
> +	 */
> +	if (!bdev_emulates_zone_append(bio->bi_bdev))
> +		return true;
> +
> +	/*
> +	 * Check that the user is not attempting to write to a full zone.
> +	 * We know such BIO will fail, and that would potentially overflow our
> +	 * write pointer offset, causing zone append BIOs for one zone to be
> +	 * directed at the following zone.
> +	 */
> +	if (zwplug->wp_offset >= zwplug->capacity)
> +		goto err;
> +
> +	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> +		/*
> +		 * Use a regular write starting at the current write pointer.
> +		 * Similarly to native zone append operations, do not allow
> +		 * merging.
> +		 */
> +		bio->bi_opf &= ~REQ_OP_MASK;
> +		bio->bi_opf |= REQ_OP_WRITE | REQ_NOMERGE;
> +		bio->bi_iter.bi_sector += zwplug->wp_offset;
> +
> +		/*
> +		 * Remember that this BIO is in fact a zone append operation
> +		 * so that we can restore its operation code on completion.
> +		 */
> +		bio_set_flag(bio, BIO_EMULATES_ZONE_APPEND);
> +	} else {
> +		/*
> +		 * Check for non-sequential writes early because we avoid a
> +		 * whole lot of error handling trouble if we don't send it off
> +		 * to the driver.
> +		 */
> +		if (bio_offset_from_zone_start(bio) != zwplug->wp_offset)
> +			goto err;
> +	}
> +
> +	/* Advance the zone write pointer offset. */
> +	zwplug->wp_offset += bio_sectors(bio);
> +
> +	return true;
> +
> +err:
> +	/* We detected an invalid write BIO: schedule error recovery. */
> +	blk_zone_wplug_set_error(bio->bi_bdev->bd_disk, zwplug);
> +	kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND,
> +				&bio->bi_bdev->bd_disk->zone_wplugs_work, 0);
> +	return false;
> +}
> +
>   static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
>   {
>   	struct blk_zone_wplug *zwplug;
> @@ -617,8 +816,17 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
>   	}
>   
>   	zwplug = bio_lookup_zone_wplug(bio);
> -	if (!zwplug)
> +	if (!zwplug) {
> +		/*
> +		 * Zone append operations to conventional zones are not
> +		 * allowed.
> +		 */
> +		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> +			bio_io_error(bio);
> +			return true;
> +		}
>   		return false;
> +	}
>   
>   	blk_zone_wplug_lock(zwplug, flags);
>   
> @@ -626,34 +834,48 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
>   	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
>   
>   	/*
> -	 * If the zone is already plugged, add the BIO to the plug BIO list.
> -	 * Otherwise, plug and let the BIO execute.
> +	 * If the zone is already plugged or has a pending error, add the BIO
> +	 * to the plug BIO list. Otherwise, plug and let the BIO execute.
>   	 */
> -	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED) {
> -		blk_zone_wplug_add_bio(zwplug, bio, nr_segs);
> -		blk_zone_wplug_unlock(zwplug, flags);
> -		return true;
> -	}
> +	if (zwplug->flags & (BLK_ZONE_WPLUG_PLUGGED | BLK_ZONE_WPLUG_ERROR))
> +		goto plug;
> +
> +	/*
> +	 * If an error is detected when preparing the BIO, add it to the BIO
> +	 * list so that error recovery can deal with it.
> +	 */
> +	if (!blk_zone_wplug_prepare_bio(zwplug, bio))
> +		goto plug;
>   
>   	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
>   
>   	blk_zone_wplug_unlock(zwplug, flags);
>   
>   	return false;
> +
> +plug:
> +	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
> +	blk_zone_wplug_add_bio(zwplug, bio, nr_segs);
> +
> +	blk_zone_wplug_unlock(zwplug, flags);
> +
> +	return true;
>   }
>   
>   /**
>    * blk_zone_write_plug_bio - Handle a zone write BIO with zone write plugging
>    * @bio: The BIO being submitted
>    *
> - * Handle write and write zeroes operations using zone write plugging.
> - * Return true whenever @bio execution needs to be delayed through the zone
> - * write plug. Otherwise, return false to let the submission path process
> - * @bio normally.
> + * Handle write, write zeroes and zone append operations requiring emulation
> + * using zone write plugging. Return true whenever @bio execution needs to be
> + * delayed through the zone write plug. Otherwise, return false to let the
> + * submission path process @bio normally.
>    */
>   bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs)
>   {
> -	if (!bio->bi_bdev->bd_disk->zone_wplugs)
> +	struct block_device *bdev = bio->bi_bdev;
> +
> +	if (!bdev->bd_disk->zone_wplugs)
>   		return false;
>   
>   	/*
> @@ -682,11 +904,30 @@ bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs)
>   	 * machinery operates at the request level, below the plug, and
>   	 * completion of the flush sequence will go through the regular BIO
>   	 * completion, which will handle zone write plugging.
> +	 * Zone append operations that need emulation must also be plugged so
> +	 * that these operations can be changed into regular writes.
> +	 * Zone reset, reset all and finish commands need special treatment
> +	 * to correctly track the write pointer offset of zones when zone
> +	 * append emulation is needed. These commands are not plugged as we do
> +	 * not need serialization with write and append operations. It is the
> +	 * responsibility of the user to not issue reset and finish commands
> +	 * when write operations are in flight.
>   	 */
>   	switch (bio_op(bio)) {
> +	case REQ_OP_ZONE_APPEND:
> +		if (!bdev_emulates_zone_append(bdev))
> +			return false;
> +		fallthrough;
>   	case REQ_OP_WRITE:
>   	case REQ_OP_WRITE_ZEROES:
>   		return blk_zone_wplug_handle_write(bio, nr_segs);
> +	case REQ_OP_ZONE_RESET:
> +		return blk_zone_wplug_handle_reset_or_finish(bio, 0);
> +	case REQ_OP_ZONE_FINISH:
> +		return blk_zone_wplug_handle_reset_or_finish(bio,
> +						bdev_zone_sectors(bdev));
> +	case REQ_OP_ZONE_RESET_ALL:
> +		return blk_zone_wplug_handle_reset_all(bio);
>   	default:
>   		return false;
>   	}
> @@ -695,12 +936,24 @@ bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs)
>   }
>   EXPORT_SYMBOL_GPL(blk_zone_write_plug_bio);
>   
> -static void blk_zone_write_plug_unplug_bio(struct blk_zone_wplug *zwplug)
> +static void blk_zone_write_plug_unplug_bio(struct gendisk *disk,
> +					   struct blk_zone_wplug *zwplug)
>   {
>   	unsigned long flags;
>   
>   	blk_zone_wplug_lock(zwplug, flags);
>   
> +	/*
> +	 * If we had an error, schedule error recovery. The recovery work
> +	 * will restart submission of plugged BIOs.
> +	 */
> +	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) {
> +		blk_zone_wplug_unlock(zwplug, flags);
> +		kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND,
> +					    &disk->zone_wplugs_work, 0);
> +		return;
> +	}
> +
>   	/* Schedule submission of the next plugged BIO if we have one. */
>   	if (!bio_list_empty(&zwplug->bio_list))
>   		kblockd_schedule_work(&zwplug->bio_work);
> @@ -712,19 +965,35 @@ static void blk_zone_write_plug_unplug_bio(struct blk_zone_wplug *zwplug)
>   
>   void blk_zone_write_plug_bio_endio(struct bio *bio)
>   {
> +	struct gendisk *disk = bio->bi_bdev->bd_disk;
> +	struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(bio);
> +
>   	/* Make sure we do not see this BIO again by clearing the plug flag. */
>   	bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING);
>   
> +	/*
> +	 * If this is a regular write emulating a zone append operation,
> +	 * restore the original operation code.
> +	 */
> +	if (bio_flagged(bio, BIO_EMULATES_ZONE_APPEND)) {
> +		bio->bi_opf &= ~REQ_OP_MASK;
> +		bio->bi_opf |= REQ_OP_ZONE_APPEND;
> +	}
> +
> +	/*
> +	 * If the BIO failed, mark the plug as having an error to trigger
> +	 * recovery.
> +	 */
> +	if (bio->bi_status != BLK_STS_OK)
> +		blk_zone_wplug_set_error(disk, zwplug);
> +
>   	/*
>   	 * For BIO-based devices, blk_zone_write_plug_complete_request()
>   	 * is not called. So we need to schedule execution of the next
>   	 * plugged BIO here.
>   	 */
> -	if (bio->bi_bdev->bd_has_submit_bio) {
> -		struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(bio);
> -
> -		blk_zone_write_plug_unplug_bio(zwplug);
> -	}
> +	if (bio->bi_bdev->bd_has_submit_bio)
> +		blk_zone_write_plug_unplug_bio(disk, zwplug);
>   }
>   
>   void blk_zone_write_plug_complete_request(struct request *req)
> @@ -735,7 +1004,7 @@ void blk_zone_write_plug_complete_request(struct request *req)
>   
>   	req->rq_flags &= ~RQF_ZONE_WRITE_PLUGGING;
>   
> -	blk_zone_write_plug_unplug_bio(zwplug);
> +	blk_zone_write_plug_unplug_bio(disk, zwplug);
>   }
>   
>   static void blk_zone_wplug_bio_work(struct work_struct *work)
> @@ -758,6 +1027,13 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
>   		return;
>   	}
>   
> +	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
> +		/* Error recovery will decide what to do with the BIO. */
> +		bio_list_add_head(&zwplug->bio_list, bio);
> +		blk_zone_wplug_unlock(zwplug, flags);
> +		return;
> +	}
> +
>   	blk_zone_wplug_unlock(zwplug, flags);
>   
>   	/*
> @@ -771,6 +1047,120 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
>   	submit_bio_noacct_nocheck(bio);
>   }
>   
> +static unsigned int blk_zone_wp_offset(struct blk_zone *zone)
> +{
> +	switch (zone->cond) {
> +	case BLK_ZONE_COND_IMP_OPEN:
> +	case BLK_ZONE_COND_EXP_OPEN:
> +	case BLK_ZONE_COND_CLOSED:
> +		return zone->wp - zone->start;
> +	case BLK_ZONE_COND_FULL:
> +		return zone->len;
> +	case BLK_ZONE_COND_EMPTY:
> +		return 0;
> +	case BLK_ZONE_COND_NOT_WP:
> +	case BLK_ZONE_COND_OFFLINE:
> +	case BLK_ZONE_COND_READONLY:
> +	default:
> +		/*
> +		 * Conventional, offline and read-only zones do not have a valid
> +		 * write pointer.
> +		 */
> +		return UINT_MAX;
> +	}
> +}
> +
> +static int blk_zone_wplug_get_zone_cb(struct blk_zone *zone,
> +				      unsigned int idx, void *data)
> +{
> +	struct blk_zone *zonep = data;
> +
> +	*zonep = *zone;
> +	return 0;
> +}
> +
> +static void blk_zone_wplug_handle_error(struct gendisk *disk,
> +					struct blk_zone_wplug *zwplug)
> +{
> +	unsigned int zno = zwplug - disk->zone_wplugs;
> +	sector_t zone_start_sector = bdev_zone_sectors(disk->part0) * zno;
> +	unsigned int noio_flag;
> +	struct blk_zone zone;
> +	unsigned long flags;
> +	int ret;
> +
> +	/* Check if we have an error and clear it if we do. */
> +	blk_zone_wplug_lock(zwplug, flags);
> +	if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR))
> +		goto unlock;
> +	zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
> +	atomic_dec(&disk->zone_nr_wplugs_with_error);
> +	blk_zone_wplug_unlock(zwplug, flags);
> +

Don't you need to quiesce the drive here?
After all, I/O (or a reset zone) might be executed after the call to 
report zones, but before we lock the zone, no?

> +	/* Get the current zone information from the device. */
> +	noio_flag = memalloc_noio_save();
> +	ret = disk->fops->report_zones(disk, zone_start_sector, 1,
> +				       blk_zone_wplug_get_zone_cb, &zone);
> +	memalloc_noio_restore(noio_flag);
> +
> +	blk_zone_wplug_lock(zwplug, flags);
> +
> +	if (ret != 1) {
> +		/*
> +		 * We failed to get the zone information, likely meaning that
> +		 * something is really wrong with the device. Abort all
> +		 * remaining plugged BIOs as otherwise we could endup waiting
> +		 * forever on plugged BIOs to complete if there is a revalidate
> +		 * or queue freeze on-going.
> +		 */
> +		blk_zone_wplug_abort(disk, zwplug);
> +		goto unplug;
> +	}
> +
> +	/* Update the zone capacity and write pointer offset. */
> +	zwplug->wp_offset = blk_zone_wp_offset(&zone);
> +	zwplug->capacity = zone.capacity;
> +
> +	blk_zone_wplug_abort_unaligned(disk, zwplug);
> +
> +	/* Restart BIO submission if we still have any BIO left. */
> +	if (!bio_list_empty(&zwplug->bio_list)) {
> +		WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
> +		kblockd_schedule_work(&zwplug->bio_work);
> +		goto unlock;
> +	}
> +
> +unplug:
> +	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
> +
> +unlock:
> +	blk_zone_wplug_unlock(zwplug, flags);
> +}
> +
> +static void disk_zone_wplugs_work(struct work_struct *work)
> +{
> +	struct gendisk *disk =
> +		container_of(work, struct gendisk, zone_wplugs_work.work);
> +	struct blk_zone_wplug *zwplug;
> +	unsigned int i;
> +
> +	while (atomic_read(&disk->zone_nr_wplugs_with_error)) {
> +		/* Serialize against revalidate. */
> +		mutex_lock(&disk->zone_wplugs_mutex);
> +
> +		zwplug = disk->zone_wplugs;
> +		if (!zwplug) {
> +			mutex_unlock(&disk->zone_wplugs_mutex);
> +			return;
> +		}
> +
> +		for (i = 0; i < disk->nr_zones; i++, zwplug++)
> +			blk_zone_wplug_handle_error(disk, zwplug);
> +
> +		mutex_unlock(&disk->zone_wplugs_mutex);
> +	}
> +}
> +
>   static struct blk_zone_wplug *blk_zone_alloc_write_plugs(unsigned int nr_zones)
>   {
>   	struct blk_zone_wplug *zwplugs;
> @@ -794,6 +1184,7 @@ static void blk_zone_free_write_plugs(struct gendisk *disk,
>   				      unsigned int nr_zones)
>   {
>   	struct blk_zone_wplug *zwplug = zwplugs;
> +	unsigned long flags;
>   	unsigned int i, n;
>   
>   	if (!zwplug)
> @@ -801,7 +1192,13 @@ static void blk_zone_free_write_plugs(struct gendisk *disk,
>   
>   	/* Make sure we do not leak any plugged BIO. */
>   	for (i = 0; i < nr_zones; i++, zwplug++) {
> +		blk_zone_wplug_lock(zwplug, flags);
> +		if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) {
> +			atomic_dec(&disk->zone_nr_wplugs_with_error);
> +			zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
> +		}
>   		n = blk_zone_wplug_abort(disk, zwplug);
> +		blk_zone_wplug_unlock(zwplug, flags);
>   		if (n)
>   			pr_warn_ratelimited("%s: zone %u, %u plugged BIOs aborted\n",
>   					    disk->disk_name, i, n);
> @@ -812,6 +1209,9 @@ static void blk_zone_free_write_plugs(struct gendisk *disk,
>   
>   void disk_free_zone_resources(struct gendisk *disk)
>   {
> +	if (disk->zone_wplugs)
> +		cancel_delayed_work_sync(&disk->zone_wplugs_work);
> +
>   	kfree(disk->conv_zones_bitmap);
>   	disk->conv_zones_bitmap = NULL;
>   	kfree(disk->seq_zones_wlock);
> @@ -819,6 +1219,8 @@ void disk_free_zone_resources(struct gendisk *disk)
>   
>   	blk_zone_free_write_plugs(disk, disk->zone_wplugs, disk->nr_zones);
>   	disk->zone_wplugs = NULL;
> +
> +	mutex_destroy(&disk->zone_wplugs_mutex);
>   }
>   
>   struct blk_revalidate_zone_args {
> @@ -890,6 +1292,8 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>   			if (!args->seq_zones_wlock)
>   				return -ENOMEM;
>   		}
> +		args->zone_wplugs[idx].capacity = zone->capacity;
> +		args->zone_wplugs[idx].wp_offset = blk_zone_wp_offset(zone);
>   		break;
>   	case BLK_ZONE_TYPE_SEQWRITE_PREF:
>   	default:
> @@ -964,6 +1368,13 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
>   	if (!args.zone_wplugs)
>   		goto out_restore_noio;
>   
> +	if (!disk->zone_wplugs) {
> +		mutex_init(&disk->zone_wplugs_mutex);
> +		atomic_set(&disk->zone_nr_wplugs_with_error, 0);
> +		INIT_DELAYED_WORK(&disk->zone_wplugs_work,
> +				  disk_zone_wplugs_work);
> +	}
> +

Same question here about device quiesce ...

>   	ret = disk->fops->report_zones(disk, 0, UINT_MAX,
>   				       blk_revalidate_zone_cb, &args);
>   	if (!ret) {
> @@ -989,12 +1400,14 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
>   	 */
>   	blk_mq_freeze_queue(q);

And this, I guess, comes to late.
We've already read the zone list, so any write I/O submitted after the 
report zone but befor here will cause things to be iffy.

>   	if (ret > 0) {
> +		mutex_lock(&disk->zone_wplugs_mutex);
>   		disk->nr_zones = args.nr_zones;
>   		swap(disk->seq_zones_wlock, args.seq_zones_wlock);
>   		swap(disk->conv_zones_bitmap, args.conv_zones_bitmap);
>   		swap(disk->zone_wplugs, args.zone_wplugs);
>   		if (update_driver_data)
>   			update_driver_data(disk);
> +		mutex_unlock(&disk->zone_wplugs_mutex);
>   		ret = 0;
>   	} else {
>   		pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
> diff --git a/block/blk.h b/block/blk.h
> index d0ecd5a2002c..7fbef6bb1aee 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -408,6 +408,11 @@ static inline bool bio_zone_write_plugging(struct bio *bio)
>   {
>   	return bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING);
>   }
> +static inline bool bio_is_zone_append(struct bio *bio)
> +{
> +	return bio_op(bio) == REQ_OP_ZONE_APPEND ||
> +		bio_flagged(bio, BIO_EMULATES_ZONE_APPEND);
> +}
>   void blk_zone_write_plug_bio_merged(struct bio *bio);
>   void blk_zone_write_plug_attempt_merge(struct request *rq);
>   static inline void blk_zone_complete_request_bio(struct request *rq,
> @@ -417,8 +422,9 @@ static inline void blk_zone_complete_request_bio(struct request *rq,
>   	 * For zone append requests, the request sector indicates the location
>   	 * at which the BIO data was written. Return this value to the BIO
>   	 * issuer through the BIO iter sector.
> -	 * For plugged zone writes, we need the original BIO sector so
> -	 * that blk_zone_write_plug_bio_endio() can lookup the zone write plug.
> +	 * For plugged zone writes, which include emulated zone append, we need
> +	 * the original BIO sector so that blk_zone_write_plug_bio_endio() can
> +	 * lookup the zone write plug.
>   	 */
>   	if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio))
>   		bio->bi_iter.bi_sector = rq->__sector;
> @@ -437,6 +443,10 @@ static inline bool bio_zone_write_plugging(struct bio *bio)
>   {
>   	return false;
>   }
> +static inline bool bio_is_zone_append(struct bio *bio)
> +{
> +	return false;
> +}
>   static inline void blk_zone_write_plug_bio_merged(struct bio *bio)
>   {
>   }
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 19839d303289..5c5343099800 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -309,6 +309,7 @@ enum {
>   	BIO_REMAPPED,
>   	BIO_ZONE_WRITE_LOCKED,	/* Owns a zoned device zone write lock */
>   	BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
> +	BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
>   	BIO_FLAG_LAST
>   };
>   
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 87fba5af34ba..e619e10847bd 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -195,6 +195,9 @@ struct gendisk {
>   	unsigned long		*conv_zones_bitmap;
>   	unsigned long		*seq_zones_wlock;
>   	struct blk_zone_wplug	*zone_wplugs;
> +	struct mutex		zone_wplugs_mutex;
> +	atomic_t		zone_nr_wplugs_with_error;
> +	struct delayed_work	zone_wplugs_work;
>   #endif /* CONFIG_BLK_DEV_ZONED */
>   
>   #if IS_ENABLED(CONFIG_CDROM)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 09/26] block: Allow BIO-based drivers to use blk_revalidate_disk_zones()
  2024-02-02  7:30 ` [PATCH 09/26] block: Allow BIO-based drivers to use blk_revalidate_disk_zones() Damien Le Moal
@ 2024-02-04 12:26   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:26 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> Remove the check in blk_revalidate_disk_zones() restricting the use of
> this function to mq request-based drivers to allow also BIO-based
> drivers to use it. This is safe to do as long as the BIO-based block
> device queue is already setup and usable, as it should, and can be
> safely frozen.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-zoned.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 929c28796c41..8bf6821735f3 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -1316,8 +1316,8 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>    * be called within the disk ->revalidate method for blk-mq based drivers.
>    * Before calling this function, the device driver must already have set the
>    * device zone size (chunk_sector limit) and the max zone append limit.
> - * For BIO based drivers, this function cannot be used. BIO based device drivers
> - * only need to set disk->nr_zones so that the sysfs exposed value is correct.
> + * BIO based drivers can also use this function as long as the device queue
> + * can be safely frozen.
>    * If the @update_driver_data callback function is not NULL, the callback is
>    * executed with the device request queue frozen after all zones have been
>    * checked.
> @@ -1334,8 +1334,6 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
>   
>   	if (WARN_ON_ONCE(!blk_queue_is_zoned(q)))
>   		return -EIO;
> -	if (WARN_ON_ONCE(!queue_is_mq(q)))
> -		return -EIO;
>   
>   	if (!capacity)
>   		return -ENODEV;
Reviewed-by: Hannes Reinecke <hare@suse.de

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 11/26] scsi: sd: Use the block layer zone append emulation
  2024-02-02  7:30 ` [PATCH 11/26] scsi: sd: " Damien Le Moal
@ 2024-02-04 12:29   ` Hannes Reinecke
  2024-02-06  1:55   ` Martin K. Petersen
  1 sibling, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:29 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> Set the request queue of a TYPE_ZBC device as needing zone append
> emulation by setting the device queue max_zone_append_sectors limit to
> 0. This enables the block layer generic implementation provided by zone
> write plugging. With this, the sd driver will never see a
> REQ_OP_ZONE_APPEND request and the zone append emulation code
> implemented in sd_zbc.c can be removed.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   drivers/scsi/sd.c     |   8 -
>   drivers/scsi/sd.h     |  19 ---
>   drivers/scsi/sd_zbc.c | 335 ++----------------------------------------
>   3 files changed, 10 insertions(+), 352 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 10/26] dm: Use the block layer zone append emulation
  2024-02-02  7:30 ` [PATCH 10/26] dm: Use the block layer zone append emulation Damien Le Moal
  2024-02-03 17:58   ` Mike Snitzer
@ 2024-02-04 12:30   ` Hannes Reinecke
  1 sibling, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:30 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> For targets requiring zone append operation emulation with regular
> writes (e.g. dm-crypt), we can use the block layer emulation provided by
> zone write plugging. Remove DM implemented zone append emulation and
> enable the block layer one.
> 
> This is done by setting the max_zone_append_sectors limit of the
> mapped device queue to 0 for mapped devices that have a target table
> that cannot support native zone append operations. These includes
> mixed zoned and non-zoned targets, or targets that explicitly requested
> emulation of zone append (e.g. dm-crypt). For these mapped devices, the
> new field emulate_zone_append is set to true. dm_split_and_process_bio()
> is modified to call blk_zone_write_plug_bio() for such device to let the
> block layer transform zone append operations into regular writes. This
> is done after ensuring that the submitted BIO is split if it straddles
> zone boundaries.
> 
> dm_revalidate_zones() is also modified to use the block layer provided
> function blk_revalidate_disk_zones() so that all zone resources needed
> for zone append emulation are allocated and initialized by the block
> layer without DM core needing to do anything. Since the device table is
> not yet live when dm_revalidate_zones() is executed, enabling the use of
> blk_revalidate_disk_zones() requires adding a pointer to the device
> table in struct mapped_device. This avoids errors in
> dm_blk_report_zones() trying to get the table with dm_get_live_table().
> The mapped device table pointer is set to the table passed as argument
> to dm_revalidate_zones() before calling blk_revalidate_disk_zones() and
> reset to NULL after this function returns to restore the live table
> handling for user call of report zones.
> 
> All the code related to zone append emulation is removed from
> dm-zone.c. This leads to simplifications of the functions __map_bio()
> and dm_zone_endio(). This later function now only needs to deal with
> completions of real zone append operations for targets that support it.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   drivers/md/dm-core.h |  11 +-
>   drivers/md/dm-zone.c | 470 ++++---------------------------------------
>   drivers/md/dm.c      |  44 ++--
>   drivers/md/dm.h      |   7 -
>   4 files changed, 68 insertions(+), 464 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 12/26] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  2024-02-02  7:30 ` [PATCH 12/26] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature Damien Le Moal
@ 2024-02-04 12:31   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:31 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> With zone write plugging enabled at the block layer level, any zone
> device can only ever see at most a single write operation per zone.
> There is thus no need to request a block scheduler with strick per-zone
> sequential write ordering control through the ELEVATOR_F_ZBD_SEQ_WRITE
> feature. Removing this allows using a zoned ublk device with any
> scheduler, including "none".
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   drivers/block/ublk_drv.c | 2 --
>   1 file changed, 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 13/26] null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  2024-02-02  7:30 ` [PATCH 13/26] null_blk: " Damien Le Moal
@ 2024-02-04 12:31   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:31 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> With zone write plugging enabled at the block layer level, any zone
> device can only ever see at most a single write operation per zone.
> There is thus no need to request a block scheduler with strick per-zone
> sequential write ordering control through the ELEVATOR_F_ZBD_SEQ_WRITE
> feature. Removing this allows using a zoned null_blk device with any
> scheduler, including "none".
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   drivers/block/null_blk/zoned.c | 1 -
>   1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
> index 6f5e0994862e..f2cb6da0dd0d 100644
> --- a/drivers/block/null_blk/zoned.c
> +++ b/drivers/block/null_blk/zoned.c
> @@ -161,7 +161,6 @@ int null_register_zoned_dev(struct nullb *nullb)
>   
>   	disk_set_zoned(nullb->disk);
>   	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
> -	blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE);
>   	blk_queue_chunk_sectors(q, dev->zone_size_sects);
>   	nullb->disk->nr_zones = bdev_nr_zones(nullb->disk->part0);
>   	blk_queue_max_zone_append_sectors(q, dev->zone_size_sects);
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 14/26] null_blk: Introduce zone_append_max_sectors attribute
  2024-02-02  7:30 ` [PATCH 14/26] null_blk: Introduce zone_append_max_sectors attribute Damien Le Moal
@ 2024-02-04 12:32   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:32 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> Add the zone_append_max_sectors configfs attribute and module parameter
> to allow configuring the maximum number of 512B sectors of zone append
> operations. This attribute is meaningful only for zoned null block
> devices.
> 
> If not specified, the default is unchanged and the zoned device max
> append sectors limit is set to the device max sectors limit.
> If a non 0 value is used for this attribute, which is the default,
> then native support for zone append operations is enabled.
> Setting a 0 value disables native zone append operations support to
> instead use the block layer emulation.
> 
> null_submit_bio() is modified to use blk_zone_write_plug_bio() to
> handle zone append emulation if that is enabled.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   drivers/block/null_blk/main.c     | 40 +++++++++++++++++++++----------
>   drivers/block/null_blk/null_blk.h |  1 +
>   drivers/block/null_blk/zoned.c    | 31 ++++++++++++++++++------
>   3 files changed, 52 insertions(+), 20 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 15/26] null_blk: Introduce fua attribute
  2024-02-02  7:30 ` [PATCH 15/26] null_blk: Introduce fua attribute Damien Le Moal
@ 2024-02-04 12:33   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:33 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> Add the fua configfs attribute and module parameter to allow
> configuring if the device supports FUA or not. Using this attribute
> has an effect on the null_blk device only if memory backing is enabled
> together with a write cache (cache_size option).
> 
> This new attribute allows configuring a null_blk device with a write
> cache but without FUA support. This is convenient to test the block
> layer flush machinery.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   drivers/block/null_blk/main.c     | 12 ++++++++++--
>   drivers/block/null_blk/null_blk.h |  1 +
>   2 files changed, 11 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 16/26] nvmet: zns: Do not reference the gendisk conv_zones_bitmap
  2024-02-02  7:30 ` [PATCH 16/26] nvmet: zns: Do not reference the gendisk conv_zones_bitmap Damien Le Moal
@ 2024-02-04 12:34   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:34 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> The gendisk conventional zone bitmap is going away. So to check for the
> presence of conventional zones on a zoned target device, always use
> report zones.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   drivers/nvme/target/zns.c | 10 +++-------
>   1 file changed, 3 insertions(+), 7 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/26] block: Remove BLK_STS_ZONE_RESOURCE
  2024-02-02  7:30 ` [PATCH 17/26] block: Remove BLK_STS_ZONE_RESOURCE Damien Le Moal
@ 2024-02-04 12:34   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:34 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> The zone append emulation of the scsi disk driver was the only driver
> using BLK_STS_ZONE_RESOURCE. With this code removed,
> BLK_STS_ZONE_RESOURCE is now unused. Remove this macro definition and
> simplify blk_mq_dispatch_rq_list() where this status code was handled.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-mq.c            | 26 --------------------------
>   drivers/scsi/scsi_lib.c   |  1 -
>   include/linux/blk_types.h | 20 ++++----------------
>   3 files changed, 4 insertions(+), 43 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 18/26] block: Simplify blk_revalidate_disk_zones() interface
  2024-02-02  7:30 ` [PATCH 18/26] block: Simplify blk_revalidate_disk_zones() interface Damien Le Moal
@ 2024-02-04 12:35   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:35 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> The only user of blk_revalidate_disk_zones() second argument was the
> SCSI disk driver (sd). Now that this driver does not require this
> update_driver_data argument, remove it to simplify the interface of
> blk_revalidate_disk_zones(). Also update the function kdoc comment to
> be more accurate (i.e. there is no gendisk ->revalidate method).
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-zoned.c              | 16 +++++-----------
>   drivers/block/null_blk/zoned.c |  2 +-
>   drivers/block/ublk_drv.c       |  2 +-
>   drivers/block/virtio_blk.c     |  2 +-
>   drivers/md/dm-zone.c           |  2 +-
>   drivers/nvme/host/zns.c        |  2 +-
>   drivers/scsi/sd_zbc.c          |  2 +-
>   include/linux/blkdev.h         |  3 +--
>   8 files changed, 12 insertions(+), 19 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 19/26] block: mq-deadline: Remove support for zone write locking
  2024-02-02  7:30 ` [PATCH 19/26] block: mq-deadline: Remove support for zone write locking Damien Le Moal
@ 2024-02-04 12:36   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:36 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> With the block layer generic plugging of write operations for zoned
> block devices, mq-deadline, or any other scheduler, can only ever
> see at most one write operation per zone at any time. There is thus no
> sequentiality requirements for these writes and thus no need to tightly
> control the dispatching of write requests using zone write locking.
> 
> Remove all the code that implement this control in the mq-deadline
> scheduler and remove advertizing support for the
> ELEVATOR_F_ZBD_SEQ_WRITE elevator feature.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/mq-deadline.c | 176 ++------------------------------------------
>   1 file changed, 6 insertions(+), 170 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 20/26] block: Remove elevator required features
  2024-02-02  7:30 ` [PATCH 20/26] block: Remove elevator required features Damien Le Moal
@ 2024-02-04 12:36   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:36 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> The only elevator feature ever implemented is ELEVATOR_F_ZBD_SEQ_WRITE
> for signaling that a scheduler implements zone write locking to tightly
> control the dispatching order of write operations to zoned block
> devices. With the removal of zone write locking support in mq-deadline
> and the reliance of all block device drivers on the block layer zone
> write plugging to control ordering of write operations to zones, the
> elevator feature ELEVATOR_F_ZBD_SEQ_WRITE is completely unused.
> Remove it, and also remove the now unused code for filtering the
> possible schedulers for a block device based on required features.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-settings.c   | 16 ---------------
>   block/elevator.c       | 46 +++++-------------------------------------
>   block/elevator.h       |  1 -
>   include/linux/blkdev.h | 10 ---------
>   4 files changed, 5 insertions(+), 68 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 21/26] block: Do not check zone type in blk_check_zone_append()
  2024-02-02  7:30 ` [PATCH 21/26] block: Do not check zone type in blk_check_zone_append() Damien Le Moal
@ 2024-02-04 12:37   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:37 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:30, Damien Le Moal wrote:
> Zone append operations are only allowed to target sequential write
> required zones. blk_check_zone_append() uses bio_zone_is_seq() to check
> this. However, this check is not necessary because:
> 1) For NVMe ZNS namespace devices, only sequential write required zones
>     exist, making the zone type check useless.
> 2) For null_blk, the driver will fail the request anyway, thus notifying
>     the user that a conventional zone was targeted.
> 3) For all other zoned devices, zone append is now emulated using zone
>     write plugging, which checks that a zone append operation does not
>     target a conventional zone.
> 
> In preparation for the removal of zone write locking and its
> conventional zone bitmap (used by bio_zone_is_seq()), remove the
> bio_zone_is_seq() call from blk_check_zone_append().
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-core.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 22/26] block: Move zone related debugfs attribute to blk-zoned.c
  2024-02-02  7:31 ` [PATCH 22/26] block: Move zone related debugfs attribute to blk-zoned.c Damien Le Moal
@ 2024-02-04 12:38   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:38 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:31, Damien Le Moal wrote:
> block/blk-mq-debugfs-zone.c contains a single debugfs attribute
> function. Defining this outside of block/blk-zoned.c does not really
> help in any way, so move this zone related debugfs attribute to
> block/blk-zoned.c and delete block/blk-mq-debugfs-zone.c.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/Kconfig                |  4 ----
>   block/Makefile               |  1 -
>   block/blk-mq-debugfs-zoned.c | 22 ----------------------
>   block/blk-mq-debugfs.h       |  2 +-
>   block/blk-zoned.c            | 20 ++++++++++++++++++++
>   5 files changed, 21 insertions(+), 28 deletions(-)
>   delete mode 100644 block/blk-mq-debugfs-zoned.c
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 23/26] block: Remove zone write locking
  2024-02-02  7:31 ` [PATCH 23/26] block: Remove zone write locking Damien Le Moal
@ 2024-02-04 12:38   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:38 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:31, Damien Le Moal wrote:
> Zone write locking is now unused and replaced with zone write plugging.
> Remove all code that was implementing zone write locking, that is, the
> various helper functions controlling request zone write locking and
> the gendisk attached zone bitmaps.
> 
> The "zone_wlock" mq-debugfs entry that was listing zones that are
> write-locked is replaced with the zone_plugged_wplugs entry which lists
> the number of zones that have a zone write plug throttling write
> operations.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-mq-debugfs.c    |  3 +-
>   block/blk-mq-debugfs.h    |  4 +-
>   block/blk-zoned.c         | 98 ++++++---------------------------------
>   include/linux/blk-mq.h    | 83 ---------------------------------
>   include/linux/blk_types.h |  1 -
>   include/linux/blkdev.h    | 36 ++------------
>   6 files changed, 21 insertions(+), 204 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 24/26] block: Do not special-case plugging of zone write operations
  2024-02-02  7:31 ` [PATCH 24/26] block: Do not special-case plugging of zone write operations Damien Le Moal
@ 2024-02-04 12:39   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:39 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:31, Damien Le Moal wrote:
> With the block layer zone write plugging being automatically done for
> any write operation to a zone of a zoned block device, a regular request
> plugging handled through current->plug can only ever see at most a
> single write request per zone. In such case, any potential reordering
> of the plugged requests will be harmless. We can thus remove the special
> casing for write operations to zones and have these requests plugged as
> well. This allows removing the function blk_mq_plug and instead directly
> using current->plug where needed.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-core.c       |  6 ------
>   block/blk-merge.c      |  3 +--
>   block/blk-mq.c         |  7 +------
>   block/blk-mq.h         | 31 -------------------------------
>   include/linux/blkdev.h | 12 ------------
>   5 files changed, 2 insertions(+), 57 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
  2024-02-02  7:31 ` [PATCH 25/26] block: Reduce zone write plugging memory usage Damien Le Moal
@ 2024-02-04 12:42   ` Hannes Reinecke
  2024-02-05 17:51     ` Bart Van Assche
  0 siblings, 1 reply; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:42 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:31, Damien Le Moal wrote:
> With zone write plugging, each zone of a zoned block device has a
> 64B struct blk_zone_wplug. While this is not a problem for small
> capacity drives with few zones, this structure size result in large
> memory usage per device for large capacity block devices.
> E.g., for a 28 TB SMR disk with over 104,000 zones of 256 MB, the zone
> write plug array of the gendisk uses 6.6 MB of memory.
> 
> However, except for the zone write plug spinlock, flags, zone capacity
> and zone write pointer offset which all need to be always available
> (the later 2 to avoid having to do too many report zones), the remaining
> fields of struct blk_zone_wplug are needed only when a zone is being
> written to.
> 
> This commit introduces struct blk_zone_active_wplug to reduce the size
> of struct blk_zone_wplug from 64B down to 16B. This is done using an
> union of a pointer to a struct blk_zone_active_wplug and of the zone
> write pointer offset and zone capacity, with the zone write plug
> spinlock and flags left as the first fields of struct blk_zone_wplug.
> 
> The flag BLK_ZONE_WPLUG_ACTIVE is introduced to indicate if the pointer
> to struct blk_zone_active_wplug of a zone write plug is valid. For such
> case, the write pointer offset and zone capacity fields are accessible
> from struct blk_zone_active_wplug. Otherwise, they can be accessed from
> struct blk_zone_wplug.
> 
> This data structure organization allows tracking the write pointer
> offset of zones regardless of the zone write state (active or not).
> Handling of zone reset, reset all and finish operations are modified
> to update a zone write pointer offset according to its state.
> 
> A zone is activated in blk_zone_wplug_handle_write() with a call to
> blk_zone_activate_wplug(). Reclaiming of allocated active zone write
> plugs is done after a zone becomes full or is reset and
> becomes empty. Reclaiming (freeing) of a zone active write plug
> structure is done either directly when a plugged BIO completes and the
> zone is full, or when resetting or finishing zones. Freeing of active
> zone write plug is done using blk_zone_free_active_wplug().
> 
> For allocating struct blk_zone_active_wplug, a mempool is created and
> sized according to the disk zone resources (maximum number of open zones
> and maximum number of active zones). For devices with no zone resource
> limits, the default BLK_ZONE_DEFAULT_ACTIVE_WPLUG_NR (128) is used.
> 
> With this mechanism, the amount of memory used per block device for zone
> write plugs is roughly reduced by a factor of 4. E.g. for a 28 TB SMR
> hard disk, memory usage is reduce to about 1.6 MB.
> 
Hmm. Wouldn't it sufficient to tie the number of available plugs to the
number of open zones? Of course that doesn't help for drives not 
reporting that, but otherwise?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 26/26] block: Add zone_active_wplugs debugfs entry
  2024-02-02  7:31 ` [PATCH 26/26] block: Add zone_active_wplugs debugfs entry Damien Le Moal
@ 2024-02-04 12:43   ` Hannes Reinecke
  0 siblings, 0 replies; 107+ messages in thread
From: Hannes Reinecke @ 2024-02-04 12:43 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/2/24 15:31, Damien Le Moal wrote:
> Add the zone_active_wplugs debugfs entry to list the zone number and
> write pointer offset of zones that have an active zone write plug.
> 
> This helps ensure that struct blk_zone_active_wplug are reclaimed as
> zones become empty or full and allows observing which zones are being
> written by the block device user.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-mq-debugfs.c |  1 +
>   block/blk-mq-debugfs.h |  5 +++++
>   block/blk-zoned.c      | 27 +++++++++++++++++++++++++++
>   3 files changed, 33 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-04  3:56   ` Ming Lei
@ 2024-02-04 23:57     ` Damien Le Moal
  2024-02-05  2:19       ` Ming Lei
  0 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-04 23:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, Christoph Hellwig

On 2/4/24 12:56, Ming Lei wrote:
> On Fri, Feb 02, 2024 at 04:30:44PM +0900, Damien Le Moal wrote:
>> +/*
>> + * Zone write plug flags bits:
>> + *  - BLK_ZONE_WPLUG_CONV: Indicate that the zone is a conventional one. Writes
>> + *    to these zones are never plugged.
>> + *  - BLK_ZONE_WPLUG_PLUGGED: Indicate that the zone write plug is plugged,
>> + *    that is, that write BIOs are being throttled due to a write BIO already
>> + *    being executed or the zone write plug bio list is not empty.
>> + */
>> +#define BLK_ZONE_WPLUG_CONV	(1U << 0)
>> +#define BLK_ZONE_WPLUG_PLUGGED	(1U << 1)
> 
> BLK_ZONE_WPLUG_PLUGGED == !bio_list_empty(&zwplug->bio_list), so looks
> this flag isn't necessary.

No, it is. As the description says, the flag not only indicates that there are
plugged BIOs, but it also indicates that there is a write for the zone
in-flight. And that can happen even with the BIO list being empty. E.g. for a
qd=1 workload of small BIOs, no BIO will ever be added to the BIO list, but the
zone still must be marked as "plugged" when a write BIO is issued for it.

>> +static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
>> +					  struct bio *bio, unsigned int nr_segs)
>> +{
>> +	/*
>> +	 * Keep a reference on the BIO request queue usage. This reference will
>> +	 * be dropped either if the BIO is failed or after it is issued and
>> +	 * completes.
>> +	 */
>> +	percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
> 
> It is fragile to get nested usage_counter, and same with grabbing/releasing it
> from different contexts or even functions, and it could be much better to just
> let block layer maintain it.
> 
> From patch 23's change:
> 
> +	 * Zoned block device information. Reads of this information must be
> +	 * protected with blk_queue_enter() / blk_queue_exit(). Modifying this
> 
> Anytime if there is in-flight bio, the block device is opened, so both gendisk and
> request_queue are live, so not sure if this .q_usage_counter protection
> is needed.

Hannes also commented about this. Let me revisit this.

>> +	/*
>> +	 * blk-mq devices will reuse the reference on the request queue usage
>> +	 * we took when the BIO was plugged, but the submission path for
>> +	 * BIO-based devices will not do that. So drop this reference here.
>> +	 */
>> +	if (bio->bi_bdev->bd_has_submit_bio)
>> +		blk_queue_exit(bio->bi_bdev->bd_disk->queue);
> 
> But I don't see where this reference is reused for blk-mq in this patch,
> care to point it out?

This patch modifies blk_mq_submit_bio() to add a "goto new_request" at the top
for any BIO flagged with BIO_FLAG_ZONE_WRITE_PLUGGING. So when a plugged BIO is
unplugged and submitted again, the reference that was taken in
blk_zone_wplug_add_bio() is reused for the new request for that BIO.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 08/26] block: Implement zone append emulation
  2024-02-04 12:24   ` Hannes Reinecke
@ 2024-02-05  0:10     ` Damien Le Moal
  0 siblings, 0 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-05  0:10 UTC (permalink / raw)
  To: Hannes Reinecke, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/4/24 21:24, Hannes Reinecke wrote:
> On 2/2/24 15:30, Damien Le Moal wrote:
>> +/*
>> + * Set a zone write plug write pointer offset to either 0 (zone reset case)
>> + * or to the zone size (zone finish case). This aborts all plugged BIOs, which
>> + * is fine to do as doing a zone reset or zone finish while writes are in-flight
>> + * is a mistake from the user which will most likely cause all plugged BIOs to
>> + * fail anyway.
>> + */
>> +static void blk_zone_wplug_set_wp_offset(struct gendisk *disk,
>> +					 struct blk_zone_wplug *zwplug,
>> +					 unsigned int wp_offset)
>> +{
>> +	/*
>> +	 * Updating the write pointer offset puts back the zone
>> +	 * in a good state. So clear the error flag and decrement the
>> +	 * error count if we were in error state.
>> +	 */
>> +	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) {
>> +		zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
>> +		atomic_dec(&disk->zone_nr_wplugs_with_error);
>> +	}
>> +
>> +	/* Update the zone write pointer and abort all plugged BIOs. */
>> +	zwplug->wp_offset = wp_offset;
>> +	blk_zone_wplug_abort(disk, zwplug);
>> +}

[...]

>> +static void blk_zone_wplug_handle_error(struct gendisk *disk,
>> +					struct blk_zone_wplug *zwplug)
>> +{
>> +	unsigned int zno = zwplug - disk->zone_wplugs;
>> +	sector_t zone_start_sector = bdev_zone_sectors(disk->part0) * zno;
>> +	unsigned int noio_flag;
>> +	struct blk_zone zone;
>> +	unsigned long flags;
>> +	int ret;
>> +
>> +	/* Check if we have an error and clear it if we do. */
>> +	blk_zone_wplug_lock(zwplug, flags);
>> +	if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR))
>> +		goto unlock;
>> +	zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
>> +	atomic_dec(&disk->zone_nr_wplugs_with_error);
>> +	blk_zone_wplug_unlock(zwplug, flags);
>> +
> 
> Don't you need to quiesce the drive here?
> After all, I/O (or a reset zone) might be executed after the call to 
> report zones, but before we lock the zone, no?

Indeed, this is racy with reset zone. But there is no race with IOs because when
the error flag is set, we always plug incoming BIOs.
But the race with reset (and finish) is actually easy to fix. All I need to do
is not clear the error flag above and check for it after the report zones and
locking the zone plug. Given that blk_zone_wplug_set_wp_offset() clears the
error flag, we end up restoring a known good wp either from the reset or from
the report zones.

>>   struct blk_revalidate_zone_args {
>> @@ -890,6 +1292,8 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>>   			if (!args->seq_zones_wlock)
>>   				return -ENOMEM;
>>   		}
>> +		args->zone_wplugs[idx].capacity = zone->capacity;
>> +		args->zone_wplugs[idx].wp_offset = blk_zone_wp_offset(zone);
>>   		break;
>>   	case BLK_ZONE_TYPE_SEQWRITE_PREF:
>>   	default:
>> @@ -964,6 +1368,13 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
>>   	if (!args.zone_wplugs)
>>   		goto out_restore_noio;
>>   
>> +	if (!disk->zone_wplugs) {
>> +		mutex_init(&disk->zone_wplugs_mutex);
>> +		atomic_set(&disk->zone_nr_wplugs_with_error, 0);
>> +		INIT_DELAYED_WORK(&disk->zone_wplugs_work,
>> +				  disk_zone_wplugs_work);
>> +	}
>> +
> 
> Same question here about device quiesce ...

Yes, I need to check this, together with revisiting the queue usage counter
handling.

>>   	ret = disk->fops->report_zones(disk, 0, UINT_MAX,
>>   				       blk_revalidate_zone_cb, &args);
>>   	if (!ret) {
>> @@ -989,12 +1400,14 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
>>   	 */
>>   	blk_mq_freeze_queue(q);
> 
> And this, I guess, comes to late.
> We've already read the zone list, so any write I/O submitted after the 
> report zone but befor here will cause things to be iffy.

Yes, but the driver is supposed to guarantee that this function is being called
while there are no writes in flight. DM is OK with that. I think scsi is too,
but need to check again. Not sure about NVMe and null_blk. But Ming had a good
point about the usage ref coming from the device being open. So a quiesce may be
enough here. Need to revisit this.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-04 23:57     ` Damien Le Moal
@ 2024-02-05  2:19       ` Ming Lei
  2024-02-05  2:41         ` Damien Le Moal
  0 siblings, 1 reply; 107+ messages in thread
From: Ming Lei @ 2024-02-05  2:19 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, Christoph Hellwig

On Mon, Feb 05, 2024 at 08:57:00AM +0900, Damien Le Moal wrote:
> On 2/4/24 12:56, Ming Lei wrote:
> > On Fri, Feb 02, 2024 at 04:30:44PM +0900, Damien Le Moal wrote:
> >> +/*
> >> + * Zone write plug flags bits:
> >> + *  - BLK_ZONE_WPLUG_CONV: Indicate that the zone is a conventional one. Writes
> >> + *    to these zones are never plugged.
> >> + *  - BLK_ZONE_WPLUG_PLUGGED: Indicate that the zone write plug is plugged,
> >> + *    that is, that write BIOs are being throttled due to a write BIO already
> >> + *    being executed or the zone write plug bio list is not empty.
> >> + */
> >> +#define BLK_ZONE_WPLUG_CONV	(1U << 0)
> >> +#define BLK_ZONE_WPLUG_PLUGGED	(1U << 1)
> > 
> > BLK_ZONE_WPLUG_PLUGGED == !bio_list_empty(&zwplug->bio_list), so looks
> > this flag isn't necessary.
> 
> No, it is. As the description says, the flag not only indicates that there are
> plugged BIOs, but it also indicates that there is a write for the zone
> in-flight. And that can happen even with the BIO list being empty. E.g. for a
> qd=1 workload of small BIOs, no BIO will ever be added to the BIO list, but the
> zone still must be marked as "plugged" when a write BIO is issued for it.

OK.

> 
> >> +static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
> >> +					  struct bio *bio, unsigned int nr_segs)
> >> +{
> >> +	/*
> >> +	 * Keep a reference on the BIO request queue usage. This reference will
> >> +	 * be dropped either if the BIO is failed or after it is issued and
> >> +	 * completes.
> >> +	 */
> >> +	percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
> > 
> > It is fragile to get nested usage_counter, and same with grabbing/releasing it
> > from different contexts or even functions, and it could be much better to just
> > let block layer maintain it.
> > 
> > From patch 23's change:
> > 
> > +	 * Zoned block device information. Reads of this information must be
> > +	 * protected with blk_queue_enter() / blk_queue_exit(). Modifying this
> > 
> > Anytime if there is in-flight bio, the block device is opened, so both gendisk and
> > request_queue are live, so not sure if this .q_usage_counter protection
> > is needed.
> 
> Hannes also commented about this. Let me revisit this.

I think only queue re-configuration(blk_revalidate_zone) requires the
queue usage counter. Otherwise, bdev open()/close() should work just
fine.

> 
> >> +	/*
> >> +	 * blk-mq devices will reuse the reference on the request queue usage
> >> +	 * we took when the BIO was plugged, but the submission path for
> >> +	 * BIO-based devices will not do that. So drop this reference here.
> >> +	 */
> >> +	if (bio->bi_bdev->bd_has_submit_bio)
> >> +		blk_queue_exit(bio->bi_bdev->bd_disk->queue);
> > 
> > But I don't see where this reference is reused for blk-mq in this patch,
> > care to point it out?
> 
> This patch modifies blk_mq_submit_bio() to add a "goto new_request" at the top
> for any BIO flagged with BIO_FLAG_ZONE_WRITE_PLUGGING. So when a plugged BIO is
> unplugged and submitted again, the reference that was taken in
> blk_zone_wplug_add_bio() is reused for the new request for that BIO.

OK, this reference reuse may be worse, because queue freeze can't prevent new
write zoned bio from being submitted any more given only percpu_ref_get() is
called for all write zoned bios.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-05  2:19       ` Ming Lei
@ 2024-02-05  2:41         ` Damien Le Moal
  2024-02-05  3:38           ` Ming Lei
                             ` (2 more replies)
  0 siblings, 3 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-05  2:41 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, Christoph Hellwig

On 2/5/24 11:19, Ming Lei wrote:
>>>> +static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
>>>> +					  struct bio *bio, unsigned int nr_segs)
>>>> +{
>>>> +	/*
>>>> +	 * Keep a reference on the BIO request queue usage. This reference will
>>>> +	 * be dropped either if the BIO is failed or after it is issued and
>>>> +	 * completes.
>>>> +	 */
>>>> +	percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
>>>
>>> It is fragile to get nested usage_counter, and same with grabbing/releasing it
>>> from different contexts or even functions, and it could be much better to just
>>> let block layer maintain it.
>>>
>>> From patch 23's change:
>>>
>>> +	 * Zoned block device information. Reads of this information must be
>>> +	 * protected with blk_queue_enter() / blk_queue_exit(). Modifying this
>>>
>>> Anytime if there is in-flight bio, the block device is opened, so both gendisk and
>>> request_queue are live, so not sure if this .q_usage_counter protection
>>> is needed.
>>
>> Hannes also commented about this. Let me revisit this.
> 
> I think only queue re-configuration(blk_revalidate_zone) requires the
> queue usage counter. Otherwise, bdev open()/close() should work just
> fine.

I want to check FS case though. No clear if mounting FS that supports zone
(btrfs) also uses bdev open ?

>>>> +	/*
>>>> +	 * blk-mq devices will reuse the reference on the request queue usage
>>>> +	 * we took when the BIO was plugged, but the submission path for
>>>> +	 * BIO-based devices will not do that. So drop this reference here.
>>>> +	 */
>>>> +	if (bio->bi_bdev->bd_has_submit_bio)
>>>> +		blk_queue_exit(bio->bi_bdev->bd_disk->queue);
>>>
>>> But I don't see where this reference is reused for blk-mq in this patch,
>>> care to point it out?
>>
>> This patch modifies blk_mq_submit_bio() to add a "goto new_request" at the top
>> for any BIO flagged with BIO_FLAG_ZONE_WRITE_PLUGGING. So when a plugged BIO is
>> unplugged and submitted again, the reference that was taken in
>> blk_zone_wplug_add_bio() is reused for the new request for that BIO.
> 
> OK, this reference reuse may be worse, because queue freeze can't prevent new
> write zoned bio from being submitted any more given only percpu_ref_get() is
> called for all write zoned bios.

New BIOs (BIOS that have never been plugged yet) will go through the normal
blk_queue_enter() in blk_mq_submit_bio(), so they will be stopped there if
another context asked for a queue freeze. I do not think there is any issue with
how things are currently done (we tested that *a lot* with many different drives
and drive configs with DM etc). Reference counting as it is is OK, even though
it most likely be simplified. I am looking at that now.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-05  2:41         ` Damien Le Moal
@ 2024-02-05  3:38           ` Ming Lei
  2024-02-05  5:11           ` Christoph Hellwig
  2024-02-05 10:06           ` Ming Lei
  2 siblings, 0 replies; 107+ messages in thread
From: Ming Lei @ 2024-02-05  3:38 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, Christoph Hellwig

On Mon, Feb 05, 2024 at 11:41:04AM +0900, Damien Le Moal wrote:
> On 2/5/24 11:19, Ming Lei wrote:
> >>>> +static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
> >>>> +					  struct bio *bio, unsigned int nr_segs)
> >>>> +{
> >>>> +	/*
> >>>> +	 * Keep a reference on the BIO request queue usage. This reference will
> >>>> +	 * be dropped either if the BIO is failed or after it is issued and
> >>>> +	 * completes.
> >>>> +	 */
> >>>> +	percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
> >>>
> >>> It is fragile to get nested usage_counter, and same with grabbing/releasing it
> >>> from different contexts or even functions, and it could be much better to just
> >>> let block layer maintain it.
> >>>
> >>> From patch 23's change:
> >>>
> >>> +	 * Zoned block device information. Reads of this information must be
> >>> +	 * protected with blk_queue_enter() / blk_queue_exit(). Modifying this
> >>>
> >>> Anytime if there is in-flight bio, the block device is opened, so both gendisk and
> >>> request_queue are live, so not sure if this .q_usage_counter protection
> >>> is needed.
> >>
> >> Hannes also commented about this. Let me revisit this.
> > 
> > I think only queue re-configuration(blk_revalidate_zone) requires the
> > queue usage counter. Otherwise, bdev open()/close() should work just
> > fine.
> 
> I want to check FS case though. No clear if mounting FS that supports zone
> (btrfs) also uses bdev open ?

btrfs '-O zoned' shouldn't be one exception:

mount -O zoned /dev/ublkb0 /mnt

  b'blkdev_get_whole'
  b'bdev_open_by_dev'
  b'bdev_open_by_path'
  b'btrfs_scan_one_device'
  b'btrfs_get_tree'
  b'vfs_get_tree'
  b'fc_mount'
  b'btrfs_get_tree'
  b'vfs_get_tree'
  b'path_mount'
  b'__x64_sys_mount'
  b'do_syscall_64'
  b'entry_SYSCALL_64_after_hwframe'
  b'[unknown]'
    1

> 
> >>>> +	/*
> >>>> +	 * blk-mq devices will reuse the reference on the request queue usage
> >>>> +	 * we took when the BIO was plugged, but the submission path for
> >>>> +	 * BIO-based devices will not do that. So drop this reference here.
> >>>> +	 */
> >>>> +	if (bio->bi_bdev->bd_has_submit_bio)
> >>>> +		blk_queue_exit(bio->bi_bdev->bd_disk->queue);
> >>>
> >>> But I don't see where this reference is reused for blk-mq in this patch,
> >>> care to point it out?
> >>
> >> This patch modifies blk_mq_submit_bio() to add a "goto new_request" at the top
> >> for any BIO flagged with BIO_FLAG_ZONE_WRITE_PLUGGING. So when a plugged BIO is
> >> unplugged and submitted again, the reference that was taken in
> >> blk_zone_wplug_add_bio() is reused for the new request for that BIO.
> > 
> > OK, this reference reuse may be worse, because queue freeze can't prevent new
> > write zoned bio from being submitted any more given only percpu_ref_get() is
> > called for all write zoned bios.
> 
> New BIOs (BIOS that have never been plugged yet) will go through the normal
> blk_queue_enter() in blk_mq_submit_bio(), so they will be stopped there if
> another context asked for a queue freeze. I do not think there is any issue with
> how things are currently done (we tested that *a lot* with many different drives
> and drive configs with DM etc). Reference counting as it is is OK, even though
> it most likely be simplified. I am looking at that now.

Indeed, new zoned write bio is still covered by blk-mq's queue reference, and the
trick is just played on old plugged bio.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-05  2:41         ` Damien Le Moal
  2024-02-05  3:38           ` Ming Lei
@ 2024-02-05  5:11           ` Christoph Hellwig
  2024-02-05  5:37             ` Damien Le Moal
  2024-02-05 10:06           ` Ming Lei
  2 siblings, 1 reply; 107+ messages in thread
From: Christoph Hellwig @ 2024-02-05  5:11 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Ming Lei, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, Christoph Hellwig

On Mon, Feb 05, 2024 at 11:41:04AM +0900, Damien Le Moal wrote:
> > I think only queue re-configuration(blk_revalidate_zone) requires the
> > queue usage counter. Otherwise, bdev open()/close() should work just
> > fine.
> 
> I want to check FS case though. No clear if mounting FS that supports zone
> (btrfs) also uses bdev open ?

Every file system opens the block device.  But we don't just need the
block device to be open, but we also need the block limits to not
change, and the only way to do that is to hold a q_usage_counter
reference.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-05  5:11           ` Christoph Hellwig
@ 2024-02-05  5:37             ` Damien Le Moal
  2024-02-05  5:50               ` Christoph Hellwig
  0 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-05  5:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer

On 2/5/24 14:11, Christoph Hellwig wrote:
> On Mon, Feb 05, 2024 at 11:41:04AM +0900, Damien Le Moal wrote:
>>> I think only queue re-configuration(blk_revalidate_zone) requires the
>>> queue usage counter. Otherwise, bdev open()/close() should work just
>>> fine.
>>
>> I want to check FS case though. No clear if mounting FS that supports zone
>> (btrfs) also uses bdev open ?
> 
> Every file system opens the block device.  But we don't just need the
> block device to be open, but we also need the block limits to not
> change, and the only way to do that is to hold a q_usage_counter
> reference.

OK. So I think that Hannes'idea to get/put the queue usage counter reference
based on a zone BIO plug becoming not empty (get ref) and becoming empty (put
ref) may be simpler then. And that would also work in the same way for blk-mq
and BIO based drivers.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 10/26] dm: Use the block layer zone append emulation
  2024-02-03 17:58   ` Mike Snitzer
@ 2024-02-05  5:38     ` Damien Le Moal
  2024-02-05 20:33       ` Mike Snitzer
  0 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-05  5:38 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Christoph Hellwig

On 2/4/24 02:58, Mike Snitzer wrote:
> Love the overall improvement to the DM core code and the broader block
> layer by switching to this bio-based ZWP approach.
> 
> Reviewed-by: Mike Snitzer <snitzer@kernel.org>

Thanks Mike !

> But one incremental suggestion inlined below.

I made this change, but in a lightly different form as I noticed that I was
getting compile errors when CONFIG_BLK_DEV_ZONED is disabled.
The change look like this now:

static void dm_split_and_process_bio(struct mapped_device *md,
				     struct dm_table *map, struct bio *bio)
{
	...
	need_split = is_abnormal = is_abnormal_io(bio);
	if (static_branch_unlikely(&zoned_enabled))
		need_split = is_abnormal || dm_zone_bio_needs_split(md, bio);

	...

	/*
	 * Use the block layer zone write plugging for mapped devices that
	 * need zone append emulation (e.g. dm-crypt).
	 */
	if (static_branch_unlikely(&zoned_enabled) &&
	    dm_zone_write_plug_bio(md, bio))
		return;

	...

with these added to dm-core.h:

static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
					   struct bio *bio)
{
	return md->emulate_zone_append && bio_straddle_zones(bio);
}
static inline bool dm_zone_write_plug_bio(struct mapped_device *md,
					  struct bio *bio)
{
	return md->emulate_zone_append && blk_zone_write_plug_bio(bio, 0);
}

These 2 helpers define to "return false" for !CONFIG_BLK_DEV_ZONED.
I hope this works for you. Otherwise, I will drop your review tag when posting V2.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-05  5:37             ` Damien Le Moal
@ 2024-02-05  5:50               ` Christoph Hellwig
  2024-02-05  6:14                 ` Damien Le Moal
  0 siblings, 1 reply; 107+ messages in thread
From: Christoph Hellwig @ 2024-02-05  5:50 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Ming Lei, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer

On Mon, Feb 05, 2024 at 02:37:41PM +0900, Damien Le Moal wrote:
> OK. So I think that Hannes'idea to get/put the queue usage counter reference
> based on a zone BIO plug becoming not empty (get ref) and becoming empty (put
> ref) may be simpler then. And that would also work in the same way for blk-mq
> and BIO based drivers.

Maybe I'm missing something, but I'm not sure how that would even work.

We need a q_usage_counter ref when doing all the submissions checks
(limits, bounce, etc) early in blk_mq_sunmit_bio, and that one should be
taken using the mormal bio_queue_enter patch to do the right thing on
nowait submissions, when the queue is already frozen, etc.  What
is the benefit of not just keeping that references vs releasing it
for all but the first bio just so that we need to grab another
new reference at the actual submission time?


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-05  5:50               ` Christoph Hellwig
@ 2024-02-05  6:14                 ` Damien Le Moal
  0 siblings, 0 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-05  6:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer

On 2/5/24 14:50, Christoph Hellwig wrote:
> On Mon, Feb 05, 2024 at 02:37:41PM +0900, Damien Le Moal wrote:
>> OK. So I think that Hannes'idea to get/put the queue usage counter reference
>> based on a zone BIO plug becoming not empty (get ref) and becoming empty (put
>> ref) may be simpler then. And that would also work in the same way for blk-mq
>> and BIO based drivers.
> 
> Maybe I'm missing something, but I'm not sure how that would even work.
> 
> We need a q_usage_counter ref when doing all the submissions checks
> (limits, bounce, etc) early in blk_mq_sunmit_bio, and that one should be
> taken using the mormal bio_queue_enter patch to do the right thing on
> nowait submissions, when the queue is already frozen, etc.  What
> is the benefit of not just keeping that references vs releasing it
> for all but the first bio just so that we need to grab another
> new reference at the actual submission time?

I just tried to make the change, but the code does not become easier/cleaner at
all. In fact, the contrary, it is very messy. I think I am going to keep things
as is regarding the ref counting.

There is one thing I need to check though: I re-ran the perf tests but this time
took the average of 10 runs to mitigate differences due to variance between runs
of the same test. And doing that, I do see a regression in performance for ZNS
4K qd=16 sequential writes (751 MB/s with rc2 vs 661 MB/s with ZWP). I need to
check if that is due to never using the cached request for plugged BIOs in
blk_mq_submit_bio(). If that is the case, we'll need to tweak the reference
dropping there for the cached request case.

I am also seeing a regression in performance with btrfs on SMR HDD (244 MB/s
with rc2 and block/for-next vs 233 MB/s with ZWP). I need to check that as well.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-05  2:41         ` Damien Le Moal
  2024-02-05  3:38           ` Ming Lei
  2024-02-05  5:11           ` Christoph Hellwig
@ 2024-02-05 10:06           ` Ming Lei
  2024-02-05 12:20             ` Damien Le Moal
  2 siblings, 1 reply; 107+ messages in thread
From: Ming Lei @ 2024-02-05 10:06 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, Christoph Hellwig

On Mon, Feb 05, 2024 at 11:41:04AM +0900, Damien Le Moal wrote:
> On 2/5/24 11:19, Ming Lei wrote:
> >>>> +static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
> >>>> +					  struct bio *bio, unsigned int nr_segs)
> >>>> +{
> >>>> +	/*
> >>>> +	 * Keep a reference on the BIO request queue usage. This reference will
> >>>> +	 * be dropped either if the BIO is failed or after it is issued and
> >>>> +	 * completes.
> >>>> +	 */
> >>>> +	percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
> >>>
> >>> It is fragile to get nested usage_counter, and same with grabbing/releasing it
> >>> from different contexts or even functions, and it could be much better to just
> >>> let block layer maintain it.
> >>>
> >>> From patch 23's change:
> >>>
> >>> +	 * Zoned block device information. Reads of this information must be
> >>> +	 * protected with blk_queue_enter() / blk_queue_exit(). Modifying this
> >>>
> >>> Anytime if there is in-flight bio, the block device is opened, so both gendisk and
> >>> request_queue are live, so not sure if this .q_usage_counter protection
> >>> is needed.
> >>
> >> Hannes also commented about this. Let me revisit this.
> > 
> > I think only queue re-configuration(blk_revalidate_zone) requires the
> > queue usage counter. Otherwise, bdev open()/close() should work just
> > fine.
> 
> I want to check FS case though. No clear if mounting FS that supports zone
> (btrfs) also uses bdev open ?

I feel the following delta change might be cleaner and easily documented:

- one IO takes single reference for both bio based and blk-mq,
- no drop & re-grab
- only grab extra reference for bio based
- two code paths share same pattern

diff --git a/block/blk-core.c b/block/blk-core.c
index 9520ccab3050..118dd789beb5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -597,6 +597,10 @@ static void __submit_bio(struct bio *bio)
 
 	if (!bio->bi_bdev->bd_has_submit_bio) {
 		blk_mq_submit_bio(bio);
+	} else if (bio_zone_write_plugging(bio)) {
+		struct gendisk *disk = bio->bi_bdev->bd_disk;
+
+		disk->fops->submit_bio(bio);
 	} else if (likely(bio_queue_enter(bio) == 0)) {
 		struct gendisk *disk = bio->bi_bdev->bd_disk;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f0fc61a3ec81..fc6d792747dc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3006,8 +3006,12 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
 		goto queue_exit;
 
+	/*
+	 * Grab one reference for plugged zoned write and it will be reused in
+	 * next real submission
+	 */
 	if (blk_queue_is_zoned(q) && blk_zone_write_plug_bio(bio, nr_segs))
-		goto queue_exit;
+		return;
 
 	if (!rq) {
 new_request:
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index f6d4f511b664..87abb3f7ef30 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -514,7 +514,8 @@ static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
 	 * be dropped either if the BIO is failed or after it is issued and
 	 * completes.
 	 */
-	percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
+	if (bio->bi_bdev->bd_has_submit_bio)
+		percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
 
 	/*
 	 * The BIO is being plugged and thus will have to wait for the on-going
@@ -760,15 +761,10 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 
 	blk_zone_wplug_unlock(zwplug, flags);
 
-	/*
-	 * blk-mq devices will reuse the reference on the request queue usage
-	 * we took when the BIO was plugged, but the submission path for
-	 * BIO-based devices will not do that. So drop this reference here.
-	 */
+	submit_bio_noacct_nocheck(bio);
+
 	if (bio->bi_bdev->bd_has_submit_bio)
 		blk_queue_exit(bio->bi_bdev->bd_disk->queue);
-
-	submit_bio_noacct_nocheck(bio);
 }
 
 static struct blk_zone_wplug *blk_zone_alloc_write_plugs(unsigned int nr_zones)

Thanks,
Ming


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-05 10:06           ` Ming Lei
@ 2024-02-05 12:20             ` Damien Le Moal
  2024-02-05 12:43               ` Damien Le Moal
  0 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-05 12:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, Christoph Hellwig

On 2/5/24 19:06, Ming Lei wrote:
> On Mon, Feb 05, 2024 at 11:41:04AM +0900, Damien Le Moal wrote:
>> On 2/5/24 11:19, Ming Lei wrote:
>>>>>> +static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
>>>>>> +					  struct bio *bio, unsigned int nr_segs)
>>>>>> +{
>>>>>> +	/*
>>>>>> +	 * Keep a reference on the BIO request queue usage. This reference will
>>>>>> +	 * be dropped either if the BIO is failed or after it is issued and
>>>>>> +	 * completes.
>>>>>> +	 */
>>>>>> +	percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
>>>>>
>>>>> It is fragile to get nested usage_counter, and same with grabbing/releasing it
>>>>> from different contexts or even functions, and it could be much better to just
>>>>> let block layer maintain it.
>>>>>
>>>>> From patch 23's change:
>>>>>
>>>>> +	 * Zoned block device information. Reads of this information must be
>>>>> +	 * protected with blk_queue_enter() / blk_queue_exit(). Modifying this
>>>>>
>>>>> Anytime if there is in-flight bio, the block device is opened, so both gendisk and
>>>>> request_queue are live, so not sure if this .q_usage_counter protection
>>>>> is needed.
>>>>
>>>> Hannes also commented about this. Let me revisit this.
>>>
>>> I think only queue re-configuration(blk_revalidate_zone) requires the
>>> queue usage counter. Otherwise, bdev open()/close() should work just
>>> fine.
>>
>> I want to check FS case though. No clear if mounting FS that supports zone
>> (btrfs) also uses bdev open ?
> 
> I feel the following delta change might be cleaner and easily documented:
> 
> - one IO takes single reference for both bio based and blk-mq,
> - no drop & re-grab
> - only grab extra reference for bio based
> - two code paths share same pattern
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 9520ccab3050..118dd789beb5 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -597,6 +597,10 @@ static void __submit_bio(struct bio *bio)
>  
>  	if (!bio->bi_bdev->bd_has_submit_bio) {
>  		blk_mq_submit_bio(bio);
> +	} else if (bio_zone_write_plugging(bio)) {
> +		struct gendisk *disk = bio->bi_bdev->bd_disk;
> +
> +		disk->fops->submit_bio(bio);
>  	} else if (likely(bio_queue_enter(bio) == 0)) {
>  		struct gendisk *disk = bio->bi_bdev->bd_disk;
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f0fc61a3ec81..fc6d792747dc 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3006,8 +3006,12 @@ void blk_mq_submit_bio(struct bio *bio)
>  	if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
>  		goto queue_exit;
>  
> +	/*
> +	 * Grab one reference for plugged zoned write and it will be reused in
> +	 * next real submission
> +	 */
>  	if (blk_queue_is_zoned(q) && blk_zone_write_plug_bio(bio, nr_segs))
> -		goto queue_exit;
> +		return;
>  
>  	if (!rq) {
>  new_request:
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index f6d4f511b664..87abb3f7ef30 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -514,7 +514,8 @@ static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
>  	 * be dropped either if the BIO is failed or after it is issued and
>  	 * completes.
>  	 */
> -	percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
> +	if (bio->bi_bdev->bd_has_submit_bio)
> +		percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
>  
>  	/*
>  	 * The BIO is being plugged and thus will have to wait for the on-going
> @@ -760,15 +761,10 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
>  
>  	blk_zone_wplug_unlock(zwplug, flags);
>  
> -	/*
> -	 * blk-mq devices will reuse the reference on the request queue usage
> -	 * we took when the BIO was plugged, but the submission path for
> -	 * BIO-based devices will not do that. So drop this reference here.
> -	 */
> +	submit_bio_noacct_nocheck(bio);
> +
>  	if (bio->bi_bdev->bd_has_submit_bio)
>  		blk_queue_exit(bio->bi_bdev->bd_disk->queue);

Hmm... As-is, this is a potential use-after-free of the bio. But I get the idea.
This is indeed a little better. I will integrate this.

> -
> -	submit_bio_noacct_nocheck(bio);
>  }
>  
>  static struct blk_zone_wplug *blk_zone_alloc_write_plugs(unsigned int nr_zones)
> 
> Thanks,
> Ming
> 
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-05 12:20             ` Damien Le Moal
@ 2024-02-05 12:43               ` Damien Le Moal
  0 siblings, 0 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-05 12:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, Christoph Hellwig

On 2/5/24 21:20, Damien Le Moal wrote:
> On 2/5/24 19:06, Ming Lei wrote:
>> On Mon, Feb 05, 2024 at 11:41:04AM +0900, Damien Le Moal wrote:
>>> On 2/5/24 11:19, Ming Lei wrote:
>>>>>>> +static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
>>>>>>> +					  struct bio *bio, unsigned int nr_segs)
>>>>>>> +{
>>>>>>> +	/*
>>>>>>> +	 * Keep a reference on the BIO request queue usage. This reference will
>>>>>>> +	 * be dropped either if the BIO is failed or after it is issued and
>>>>>>> +	 * completes.
>>>>>>> +	 */
>>>>>>> +	percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
>>>>>>
>>>>>> It is fragile to get nested usage_counter, and same with grabbing/releasing it
>>>>>> from different contexts or even functions, and it could be much better to just
>>>>>> let block layer maintain it.
>>>>>>
>>>>>> From patch 23's change:
>>>>>>
>>>>>> +	 * Zoned block device information. Reads of this information must be
>>>>>> +	 * protected with blk_queue_enter() / blk_queue_exit(). Modifying this
>>>>>>
>>>>>> Anytime if there is in-flight bio, the block device is opened, so both gendisk and
>>>>>> request_queue are live, so not sure if this .q_usage_counter protection
>>>>>> is needed.
>>>>>
>>>>> Hannes also commented about this. Let me revisit this.
>>>>
>>>> I think only queue re-configuration(blk_revalidate_zone) requires the
>>>> queue usage counter. Otherwise, bdev open()/close() should work just
>>>> fine.
>>>
>>> I want to check FS case though. No clear if mounting FS that supports zone
>>> (btrfs) also uses bdev open ?
>>
>> I feel the following delta change might be cleaner and easily documented:
>>
>> - one IO takes single reference for both bio based and blk-mq,
>> - no drop & re-grab
>> - only grab extra reference for bio based
>> - two code paths share same pattern
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 9520ccab3050..118dd789beb5 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -597,6 +597,10 @@ static void __submit_bio(struct bio *bio)
>>  
>>  	if (!bio->bi_bdev->bd_has_submit_bio) {
>>  		blk_mq_submit_bio(bio);
>> +	} else if (bio_zone_write_plugging(bio)) {
>> +		struct gendisk *disk = bio->bi_bdev->bd_disk;
>> +
>> +		disk->fops->submit_bio(bio);

Actually, no, that is not correct. This would not stop BIO submission if
blk_queue_freeze() was called by another context. So we cannot do that here
without calling blk_queue_enter()...

>>  	} else if (likely(bio_queue_enter(bio) == 0)) {
>>  		struct gendisk *disk = bio->bi_bdev->bd_disk;
>>  
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index f0fc61a3ec81..fc6d792747dc 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -3006,8 +3006,12 @@ void blk_mq_submit_bio(struct bio *bio)
>>  	if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
>>  		goto queue_exit;
>>  
>> +	/*
>> +	 * Grab one reference for plugged zoned write and it will be reused in
>> +	 * next real submission
>> +	 */
>>  	if (blk_queue_is_zoned(q) && blk_zone_write_plug_bio(bio, nr_segs))
>> -		goto queue_exit;
>> +		return;

...and this one is not correct because of the cached request: if there was a
cached request, blk_mq_submit_bio() did not call blk_queue_enter() because the
cached request already had a reference. But we cannot reuse that reference as
the next BIO may be a read or a write to a zone that is not plugged, and these
would use the cached request and so need the usage counter reference. So we
would still need to grab an extra reference in such case.

So in the end, it feels a lot simpler to keep the reference counting as it was
as it makes things a lot less messier in blk_mq_submit_bio(). I will though try
to improve the comments to make it clear how this is working.

>>  
>>  	if (!rq) {
>>  new_request:
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index f6d4f511b664..87abb3f7ef30 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -514,7 +514,8 @@ static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
>>  	 * be dropped either if the BIO is failed or after it is issued and
>>  	 * completes.
>>  	 */
>> -	percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
>> +	if (bio->bi_bdev->bd_has_submit_bio)
>> +		percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
>>  
>>  	/*
>>  	 * The BIO is being plugged and thus will have to wait for the on-going
>> @@ -760,15 +761,10 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
>>  
>>  	blk_zone_wplug_unlock(zwplug, flags);
>>  
>> -	/*
>> -	 * blk-mq devices will reuse the reference on the request queue usage
>> -	 * we took when the BIO was plugged, but the submission path for
>> -	 * BIO-based devices will not do that. So drop this reference here.
>> -	 */
>> +	submit_bio_noacct_nocheck(bio);
>> +
>>  	if (bio->bi_bdev->bd_has_submit_bio)
>>  		blk_queue_exit(bio->bi_bdev->bd_disk->queue);
> 
> Hmm... As-is, this is a potential use-after-free of the bio. But I get the idea.
> This is indeed a little better. I will integrate this.
> 
>> -
>> -	submit_bio_noacct_nocheck(bio);
>>  }
>>  
>>  static struct blk_zone_wplug *blk_zone_alloc_write_plugs(unsigned int nr_zones)
>>
>> Thanks,
>> Ming
>>
>>
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/26] Zone write plugging
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (26 preceding siblings ...)
  2024-02-02  7:37 ` [PATCH 00/26] Zone write plugging Damien Le Moal
@ 2024-02-05 17:21 ` Bart Van Assche
  2024-02-05 23:42   ` Damien Le Moal
  2024-02-05 18:18 ` Bart Van Assche
  28 siblings, 1 reply; 107+ messages in thread
From: Bart Van Assche @ 2024-02-05 17:21 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/1/24 23:30, Damien Le Moal wrote:
> The patch series introduces zone write plugging (ZWP) as the new
> mechanism to control the ordering of writes to zoned block devices.
> ZWP replaces zone write locking (ZWL) which is implemented only by
> mq-deadline today. ZWP also allows emulating zone append operations
> using regular writes for zoned devices that do not natively support this
> operation (e.g. SMR HDDs). This patch series removes the scsi disk
> driver and device mapper zone append emulation to use ZWP emulation.

How are SCSI unit attention conditions handled?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 01/26] block: Restore sector of flush requests
  2024-02-02  7:30 ` [PATCH 01/26] block: Restore sector of flush requests Damien Le Moal
  2024-02-04 11:55   ` Hannes Reinecke
@ 2024-02-05 17:22   ` Bart Van Assche
  2024-02-05 23:42     ` Damien Le Moal
  1 sibling, 1 reply; 107+ messages in thread
From: Bart Van Assche @ 2024-02-05 17:22 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/1/24 23:30, Damien Le Moal wrote:
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index b0f314f4bc14..2f58ae018464 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -130,6 +130,7 @@ static void blk_flush_restore_request(struct request *rq)
>   	 * original @rq->bio.  Restore it.
>   	 */
>   	rq->bio = rq->biotail;
> +	rq->__sector = rq->bio->bi_iter.bi_sector;

Hmm ... is it guaranteed that rq->bio != NULL in this context?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 02/26] block: Remove req_bio_endio()
  2024-02-02  7:30 ` [PATCH 02/26] block: Remove req_bio_endio() Damien Le Moal
  2024-02-04 11:57   ` Hannes Reinecke
@ 2024-02-05 17:28   ` Bart Van Assche
  2024-02-05 23:45     ` Damien Le Moal
  2024-02-09  6:53     ` Damien Le Moal
  1 sibling, 2 replies; 107+ messages in thread
From: Bart Van Assche @ 2024-02-05 17:28 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/1/24 23:30, Damien Le Moal wrote:
> @@ -916,9 +888,8 @@ bool blk_update_request(struct request *req, blk_status_t error,
>   	if (blk_crypto_rq_has_keyslot(req) && nr_bytes >= blk_rq_bytes(req))
>   		__blk_crypto_rq_put_keyslot(req);
>   
> -	if (unlikely(error && !blk_rq_is_passthrough(req) &&
> -		     !(req->rq_flags & RQF_QUIET)) &&
> -		     !test_bit(GD_DEAD, &req->q->disk->state)) {
> +	if (unlikely(error && !blk_rq_is_passthrough(req) && !quiet) &&
> +	    !test_bit(GD_DEAD, &req->q->disk->state)) {

The new indentation of !test_bit(GD_DEAD, &req->q->disk->state) looks odd to me ...

>   		blk_print_req_error(req, error);
>   		trace_block_rq_error(req, error, nr_bytes);
>   	}
> @@ -930,12 +901,37 @@ bool blk_update_request(struct request *req, blk_status_t error,
>   		struct bio *bio = req->bio;
>   		unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
>   
> -		if (bio_bytes == bio->bi_iter.bi_size)
> +		if (unlikely(error))
> +			bio->bi_status = error;
> +
> +		if (bio_bytes == bio->bi_iter.bi_size) {
>   			req->bio = bio->bi_next;

The behavior has been changed compared to the original code: the original code
only tests bio_bytes if error == 0. The new code tests bio_bytes no matter what
value the 'error' variable has. Is this behavior change intentional?

Otherwise this patch looks good to me.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-02  7:30 ` [PATCH 06/26] block: Introduce zone write plugging Damien Le Moal
  2024-02-04  3:56   ` Ming Lei
  2024-02-04 12:14   ` Hannes Reinecke
@ 2024-02-05 17:48   ` Bart Van Assche
  2024-02-05 23:48     ` Damien Le Moal
  2 siblings, 1 reply; 107+ messages in thread
From: Bart Van Assche @ 2024-02-05 17:48 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/1/24 23:30, Damien Le Moal wrote:
> The next plugged BIO is unplugged and issued once the write request completes.

So this patch series is orthogonal to my patch series that implements zoned
write pipelining?

> This mechanism allows to:
>   - Untangles zone write ordering from block IO schedulers. This allows

Untangles -> Untangle

> Zone write plugging is implemented using struct blk_zone_wplug. This
> structurei includes a spinlock, a BIO list and a work structure to

structurei -> structure

> This ensures that at any time, at most one request (blk-mq devices) or
> one BIO (BIO-based devices) are being executed for any zone. The
> handling of zone write plug using a per-zone plug spinlock maximizes
> parrallelism and device usage by allowing multiple zones to be writen

parrallelism -> parallelism

> simultaneously without lock contention.

This is not correct. Device usage is not maximized since zone write bios
are serialized. Pipelining zoned writes results in higher device
utilization.

> +	/*
> +	 * For BIOs handled through a zone write plugs, signal the end of the

plugs -> plug

> +#define blk_zone_wplug_lock(zwplug, flags) \
> +	spin_lock_irqsave(&zwplug->lock, flags)
> +
> +#define blk_zone_wplug_unlock(zwplug, flags) \
> +	spin_unlock_irqrestore(&zwplug->lock, flags)

Hmm ... these macros may make code harder to read rather than improve
readability of the code.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
  2024-02-04 12:42   ` Hannes Reinecke
@ 2024-02-05 17:51     ` Bart Van Assche
  2024-02-05 23:55       ` Damien Le Moal
  0 siblings, 1 reply; 107+ messages in thread
From: Bart Van Assche @ 2024-02-05 17:51 UTC (permalink / raw)
  To: Hannes Reinecke, Damien Le Moal, linux-block, Jens Axboe,
	linux-scsi, Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/4/24 04:42, Hannes Reinecke wrote:
> On 2/2/24 15:31, Damien Le Moal wrote:
>> With this mechanism, the amount of memory used per block device for zone
>> write plugs is roughly reduced by a factor of 4. E.g. for a 28 TB SMR
>> hard disk, memory usage is reduce to about 1.6 MB.
>>
> Hmm. Wouldn't it sufficient to tie the number of available plugs to the
> number of open zones? Of course that doesn't help for drives not reporting that, but otherwise?

I have the same question. I think the number of zoned opened by filesystems
like BTRFS and F2FS is much smaller than the total number of zoned supported
by zoned drives.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 08/26] block: Implement zone append emulation
  2024-02-02  7:30 ` [PATCH 08/26] block: Implement zone append emulation Damien Le Moal
  2024-02-04 12:24   ` Hannes Reinecke
@ 2024-02-05 17:58   ` Bart Van Assche
  2024-02-05 23:57     ` Damien Le Moal
  1 sibling, 1 reply; 107+ messages in thread
From: Bart Van Assche @ 2024-02-05 17:58 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/1/24 23:30, Damien Le Moal wrote:
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 661ef61ca3b1..929c28796c41 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -42,6 +42,8 @@ struct blk_zone_wplug {
>   	unsigned int		flags;
>   	struct bio_list		bio_list;
>   	struct work_struct	bio_work;
> +	unsigned int		wp_offset;
> +	unsigned int		capacity;
>   };

This patch increases the size of struct blk_zone_wplug for all zoned storage
use cases, including the use cases where zone append is not used at all.
Shouldn't the size of this data structure depend on whether or not zone append
is in use?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/26] Zone write plugging
  2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
                   ` (27 preceding siblings ...)
  2024-02-05 17:21 ` Bart Van Assche
@ 2024-02-05 18:18 ` Bart Van Assche
  2024-02-06  0:07   ` Damien Le Moal
  28 siblings, 1 reply; 107+ messages in thread
From: Bart Van Assche @ 2024-02-05 18:18 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/1/24 23:30, Damien Le Moal wrote:
>   - Zone write plugging operates on BIOs instead of requests. Plugged
>     BIOs waiting for execution thus do not hold scheduling tags and thus
>     do not prevent other BIOs from being submitted to the device (reads
>     or writes to other zones). Depending on the workload, this can
>     significantly improve the device use and the performance.

Deep queues may introduce performance problems. In Android we had to
restrict the number of pending writes to the device queue depth because
otherwise read latency is too high (e.g. to start the camera app).

I'm not convinced that queuing zoned write bios is a better approach than
queuing zoned write requests.

Are there numbers available about the performance differences (bandwidth
and latency) between plugging zoned write bios and zoned write plugging
requests?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 10/26] dm: Use the block layer zone append emulation
  2024-02-05  5:38     ` Damien Le Moal
@ 2024-02-05 20:33       ` Mike Snitzer
  2024-02-05 23:40         ` Damien Le Moal
  0 siblings, 1 reply; 107+ messages in thread
From: Mike Snitzer @ 2024-02-05 20:33 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Christoph Hellwig

On Mon, Feb 05 2024 at 12:38P -0500,
Damien Le Moal <dlemoal@kernel.org> wrote:

> On 2/4/24 02:58, Mike Snitzer wrote:
> > Love the overall improvement to the DM core code and the broader block
> > layer by switching to this bio-based ZWP approach.
> > 
> > Reviewed-by: Mike Snitzer <snitzer@kernel.org>
> 
> Thanks Mike !
> 
> > But one incremental suggestion inlined below.
> 
> I made this change, but in a lightly different form as I noticed that I was
> getting compile errors when CONFIG_BLK_DEV_ZONED is disabled.
> The change look like this now:
> 
> static void dm_split_and_process_bio(struct mapped_device *md,
> 				     struct dm_table *map, struct bio *bio)
> {
> 	...
> 	need_split = is_abnormal = is_abnormal_io(bio);
> 	if (static_branch_unlikely(&zoned_enabled))
> 		need_split = is_abnormal || dm_zone_bio_needs_split(md, bio);
> 
> 	...
> 
> 	/*
> 	 * Use the block layer zone write plugging for mapped devices that
> 	 * need zone append emulation (e.g. dm-crypt).
> 	 */
> 	if (static_branch_unlikely(&zoned_enabled) &&
> 	    dm_zone_write_plug_bio(md, bio))
> 		return;
> 
> 	...
> 
> with these added to dm-core.h:
> 
> static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
> 					   struct bio *bio)
> {
> 	return md->emulate_zone_append && bio_straddle_zones(bio);
> }
> static inline bool dm_zone_write_plug_bio(struct mapped_device *md,
> 					  struct bio *bio)
> {
> 	return md->emulate_zone_append && blk_zone_write_plug_bio(bio, 0);
> }
> 
> These 2 helpers define to "return false" for !CONFIG_BLK_DEV_ZONED.
> I hope this works for you. Otherwise, I will drop your review tag when posting V2.

Why expose them in dm-core.h ?

Just have what you point in dm-core.h above dm_split_and_process_bio in dm.c ?

And yes, you can retain my Reviewed-by.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 10/26] dm: Use the block layer zone append emulation
  2024-02-05 20:33       ` Mike Snitzer
@ 2024-02-05 23:40         ` Damien Le Moal
  2024-02-06 20:41           ` Mike Snitzer
  0 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-05 23:40 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Christoph Hellwig

On 2/6/24 05:33, Mike Snitzer wrote:
> On Mon, Feb 05 2024 at 12:38P -0500,
> Damien Le Moal <dlemoal@kernel.org> wrote:
> 
>> On 2/4/24 02:58, Mike Snitzer wrote:
>>> Love the overall improvement to the DM core code and the broader block
>>> layer by switching to this bio-based ZWP approach.
>>>
>>> Reviewed-by: Mike Snitzer <snitzer@kernel.org>
>>
>> Thanks Mike !
>>
>>> But one incremental suggestion inlined below.
>>
>> I made this change, but in a lightly different form as I noticed that I was
>> getting compile errors when CONFIG_BLK_DEV_ZONED is disabled.
>> The change look like this now:
>>
>> static void dm_split_and_process_bio(struct mapped_device *md,
>> 				     struct dm_table *map, struct bio *bio)
>> {
>> 	...
>> 	need_split = is_abnormal = is_abnormal_io(bio);
>> 	if (static_branch_unlikely(&zoned_enabled))
>> 		need_split = is_abnormal || dm_zone_bio_needs_split(md, bio);
>>
>> 	...
>>
>> 	/*
>> 	 * Use the block layer zone write plugging for mapped devices that
>> 	 * need zone append emulation (e.g. dm-crypt).
>> 	 */
>> 	if (static_branch_unlikely(&zoned_enabled) &&
>> 	    dm_zone_write_plug_bio(md, bio))
>> 		return;
>>
>> 	...
>>
>> with these added to dm-core.h:
>>
>> static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
>> 					   struct bio *bio)
>> {
>> 	return md->emulate_zone_append && bio_straddle_zones(bio);
>> }
>> static inline bool dm_zone_write_plug_bio(struct mapped_device *md,
>> 					  struct bio *bio)
>> {
>> 	return md->emulate_zone_append && blk_zone_write_plug_bio(bio, 0);
>> }
>>
>> These 2 helpers define to "return false" for !CONFIG_BLK_DEV_ZONED.
>> I hope this works for you. Otherwise, I will drop your review tag when posting V2.
> 
> Why expose them in dm-core.h ?
> 
> Just have what you point in dm-core.h above dm_split_and_process_bio in dm.c ?

I wanted to avoid "#ifdef CONFIG_BLK_DEV_ZONED" in the .c files. But if you are
OK with that, I can move these inline functions in dm.c.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/26] Zone write plugging
  2024-02-05 17:21 ` Bart Van Assche
@ 2024-02-05 23:42   ` Damien Le Moal
  2024-02-06  0:57     ` Bart Van Assche
  0 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-05 23:42 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/6/24 02:21, Bart Van Assche wrote:
> On 2/1/24 23:30, Damien Le Moal wrote:
>> The patch series introduces zone write plugging (ZWP) as the new
>> mechanism to control the ordering of writes to zoned block devices.
>> ZWP replaces zone write locking (ZWL) which is implemented only by
>> mq-deadline today. ZWP also allows emulating zone append operations
>> using regular writes for zoned devices that do not natively support this
>> operation (e.g. SMR HDDs). This patch series removes the scsi disk
>> driver and device mapper zone append emulation to use ZWP emulation.
> 
> How are SCSI unit attention conditions handled?

???? How does that have anything to do with this series ?
Whatever SCSI sd is doing with unit attention conditions remains the same. I did
not touch that.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 01/26] block: Restore sector of flush requests
  2024-02-05 17:22   ` Bart Van Assche
@ 2024-02-05 23:42     ` Damien Le Moal
  0 siblings, 0 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-05 23:42 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/6/24 02:22, Bart Van Assche wrote:
> On 2/1/24 23:30, Damien Le Moal wrote:
>> diff --git a/block/blk-flush.c b/block/blk-flush.c
>> index b0f314f4bc14..2f58ae018464 100644
>> --- a/block/blk-flush.c
>> +++ b/block/blk-flush.c
>> @@ -130,6 +130,7 @@ static void blk_flush_restore_request(struct request *rq)
>>   	 * original @rq->bio.  Restore it.
>>   	 */
>>   	rq->bio = rq->biotail;
>> +	rq->__sector = rq->bio->bi_iter.bi_sector;
> 
> Hmm ... is it guaranteed that rq->bio != NULL in this context?

Yes.

> 
> Thanks,
> 
> Bart.
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 02/26] block: Remove req_bio_endio()
  2024-02-05 17:28   ` Bart Van Assche
@ 2024-02-05 23:45     ` Damien Le Moal
  2024-02-09  6:53     ` Damien Le Moal
  1 sibling, 0 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-05 23:45 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/6/24 02:28, Bart Van Assche wrote:
> On 2/1/24 23:30, Damien Le Moal wrote:
>> @@ -916,9 +888,8 @@ bool blk_update_request(struct request *req, blk_status_t error,
>>   	if (blk_crypto_rq_has_keyslot(req) && nr_bytes >= blk_rq_bytes(req))
>>   		__blk_crypto_rq_put_keyslot(req);
>>   
>> -	if (unlikely(error && !blk_rq_is_passthrough(req) &&
>> -		     !(req->rq_flags & RQF_QUIET)) &&
>> -		     !test_bit(GD_DEAD, &req->q->disk->state)) {
>> +	if (unlikely(error && !blk_rq_is_passthrough(req) && !quiet) &&
>> +	    !test_bit(GD_DEAD, &req->q->disk->state)) {
> 
> The new indentation of !test_bit(GD_DEAD, &req->q->disk->state) looks odd to me ...
> 
>>   		blk_print_req_error(req, error);
>>   		trace_block_rq_error(req, error, nr_bytes);
>>   	}
>> @@ -930,12 +901,37 @@ bool blk_update_request(struct request *req, blk_status_t error,
>>   		struct bio *bio = req->bio;
>>   		unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
>>   
>> -		if (bio_bytes == bio->bi_iter.bi_size)
>> +		if (unlikely(error))
>> +			bio->bi_status = error;
>> +
>> +		if (bio_bytes == bio->bi_iter.bi_size) {
>>   			req->bio = bio->bi_next;
> 
> The behavior has been changed compared to the original code: the original code
> only tests bio_bytes if error == 0. The new code tests bio_bytes no matter what
> value the 'error' variable has. Is this behavior change intentional?

No. I do not think it is a problem though since if there is an error, bio_bytes
will always be less than bio->bi_iter.bi_size. I will tweak this to restore the
previous behavior.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-05 17:48   ` Bart Van Assche
@ 2024-02-05 23:48     ` Damien Le Moal
  2024-02-06  0:52       ` Bart Van Assche
  0 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-05 23:48 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/6/24 02:48, Bart Van Assche wrote:
> On 2/1/24 23:30, Damien Le Moal wrote:
>> The next plugged BIO is unplugged and issued once the write request completes.
> 
> So this patch series is orthogonal to my patch series that implements zoned
> write pipelining?

I don't know.

>> simultaneously without lock contention.
> 
> This is not correct. Device usage is not maximized since zone write bios
> are serialized. Pipelining zoned writes results in higher device
> utilization.

I meant to say that the locking scheme does not get in the way of maximizing
device utilization/parallelism. If you are only writing to a single zone, then
sure, it does not matter since the drive will be used at qd=1 for that case.

>> +#define blk_zone_wplug_lock(zwplug, flags) \
>> +	spin_lock_irqsave(&zwplug->lock, flags)
>> +
>> +#define blk_zone_wplug_unlock(zwplug, flags) \
>> +	spin_unlock_irqrestore(&zwplug->lock, flags)
> 
> Hmm ... these macros may make code harder to read rather than improve
> readability of the code.

I do not see how they make the code less readable. The macro names are not clear
enough ?

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
  2024-02-05 17:51     ` Bart Van Assche
@ 2024-02-05 23:55       ` Damien Le Moal
  2024-02-06 21:20         ` Bart Van Assche
  0 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-05 23:55 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, linux-block, Jens Axboe,
	linux-scsi, Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/6/24 02:51, Bart Van Assche wrote:
> On 2/4/24 04:42, Hannes Reinecke wrote:
>> On 2/2/24 15:31, Damien Le Moal wrote:
>>> With this mechanism, the amount of memory used per block device for zone
>>> write plugs is roughly reduced by a factor of 4. E.g. for a 28 TB SMR
>>> hard disk, memory usage is reduce to about 1.6 MB.
>>>
>> Hmm. Wouldn't it sufficient to tie the number of available plugs to the
>> number of open zones? Of course that doesn't help for drives not reporting that, but otherwise?
> 
> I have the same question. I think the number of zoned opened by filesystems
> like BTRFS and F2FS is much smaller than the total number of zoned supported
> by zoned drives.

I am not sure what Hannes meant, nor what you mean either.
The array of struct blk_zone_wplug for the disk is sized for the total number of
zones of the drive. The reason for that is that we want to retain the wp_offset
value for all zones, even if they are not being written. Otherwise, everytime we
start writing a zone, we would need to do a report zones to be able to emulate
zone append operations if the drive requested that.

Once the user/FS starts writing to a zone, we allocate a struct
blk_zone_active_wplug (64 B) from a mempool that is sized according to the drive
maximum number of active zones or maximum number of open zones. The mempool
ensure minimal overhead for that allocation, far less than what a report zone
command would cost us in lost time (we are talking about 1ms at least on SMR
drives). Also, with SMR, report zones cause a cache flush, so this command can
be very slow and we *really* want to avoid it in the hot path.

So yes, it is somewhat a large amount of memory (16 B per zone), but even with a
super large 28 TB SMR drive, that is still a reasonnable 1.6 MB. But I am of
course open to suggestions to optimize this further.

> 
> Thanks,
> 
> Bart.
> 
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 08/26] block: Implement zone append emulation
  2024-02-05 17:58   ` Bart Van Assche
@ 2024-02-05 23:57     ` Damien Le Moal
  0 siblings, 0 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-05 23:57 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/6/24 02:58, Bart Van Assche wrote:
> On 2/1/24 23:30, Damien Le Moal wrote:
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index 661ef61ca3b1..929c28796c41 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -42,6 +42,8 @@ struct blk_zone_wplug {
>>   	unsigned int		flags;
>>   	struct bio_list		bio_list;
>>   	struct work_struct	bio_work;
>> +	unsigned int		wp_offset;
>> +	unsigned int		capacity;
>>   };
> 
> This patch increases the size of struct blk_zone_wplug for all zoned storage
> use cases, including the use cases where zone append is not used at all.
> Shouldn't the size of this data structure depend on whether or not zone append
> is in use?

The block layer does not know what command the user/FS will be issuing.
And see patch 25 of the series for the memory optimization that mitigates the
increase of the struct size that this patch introduce.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/26] Zone write plugging
  2024-02-05 18:18 ` Bart Van Assche
@ 2024-02-06  0:07   ` Damien Le Moal
  2024-02-06  1:25     ` Bart Van Assche
  0 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-06  0:07 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/6/24 03:18, Bart Van Assche wrote:
> On 2/1/24 23:30, Damien Le Moal wrote:
>>   - Zone write plugging operates on BIOs instead of requests. Plugged
>>     BIOs waiting for execution thus do not hold scheduling tags and thus
>>     do not prevent other BIOs from being submitted to the device (reads
>>     or writes to other zones). Depending on the workload, this can
>>     significantly improve the device use and the performance.
> 
> Deep queues may introduce performance problems. In Android we had to
> restrict the number of pending writes to the device queue depth because
> otherwise read latency is too high (e.g. to start the camera app).

With zone write plugging, BIOS are delayed well above the scheduler and device.
BIOs that are plugged/delayed by ZWP do not hold tags, not even a scheduler tag,
so that allows reads (which are never plugged) to proceed. That is actually
unlike zone write locking which can hold on to all scheduler tags thus
preventing reads to proceed.

> I'm not convinced that queuing zoned write bios is a better approach than
> queuing zoned write requests.

Well, I do not see why not. The above point on its own is actually to me a good
argument enough. And various tests with btrfs showed that even with a slow HDD I
can see better overall thoughtput with ZWP compared to zone write locking.
And for fast sloid state zoned device (NVMe/UFS), you do not even need an IO
scheduler anymore.

> 
> Are there numbers available about the performance differences (bandwidth
> and latency) between plugging zoned write bios and zoned write plugging
> requests?

Finish reading the cover letter. It has lots of measurements with rc2, Jens
block/for-next and ZWP...

I actually reran all these perf tests over the weekend, but this time did 10
runs and took the average for comparison. Overall, I confirmed the results
showed in the cover letter: performance is generally on-par with ZWP or better,
but there is one exception: small sequential writes at high qd. There seem to be
an issue with regular plugging (current->plug) which result in lost merging
opportunists, causing the performance regression. I am digging into that to
understand what is happening.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/26] block: Introduce zone write plugging
  2024-02-05 23:48     ` Damien Le Moal
@ 2024-02-06  0:52       ` Bart Van Assche
  0 siblings, 0 replies; 107+ messages in thread
From: Bart Van Assche @ 2024-02-06  0:52 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/5/24 15:48, Damien Le Moal wrote:
> On 2/6/24 02:48, Bart Van Assche wrote:
>> On 2/1/24 23:30, Damien Le Moal wrote:
>>> +#define blk_zone_wplug_lock(zwplug, flags) \
>>> +	spin_lock_irqsave(&zwplug->lock, flags)
>>> +
>>> +#define blk_zone_wplug_unlock(zwplug, flags) \
>>> +	spin_unlock_irqrestore(&zwplug->lock, flags)
>>
>> Hmm ... these macros may make code harder to read rather than improve
>> readability of the code.
> 
> I do not see how they make the code less readable. The macro names are not clear
> enough ?

The macro names are clear but the macro names are almost as long as the
macro definitions. Usually this is a sign that it's better not to
introduce macros and instead inline the macros.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/26] Zone write plugging
  2024-02-05 23:42   ` Damien Le Moal
@ 2024-02-06  0:57     ` Bart Van Assche
  0 siblings, 0 replies; 107+ messages in thread
From: Bart Van Assche @ 2024-02-06  0:57 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/5/24 15:42, Damien Le Moal wrote:
> On 2/6/24 02:21, Bart Van Assche wrote:
>> On 2/1/24 23:30, Damien Le Moal wrote:
>>> The patch series introduces zone write plugging (ZWP) as the new
>>> mechanism to control the ordering of writes to zoned block devices.
>>> ZWP replaces zone write locking (ZWL) which is implemented only by
>>> mq-deadline today. ZWP also allows emulating zone append operations
>>> using regular writes for zoned devices that do not natively support this
>>> operation (e.g. SMR HDDs). This patch series removes the scsi disk
>>> driver and device mapper zone append emulation to use ZWP emulation.
>>
>> How are SCSI unit attention conditions handled?
> 
> ???? How does that have anything to do with this series ?
> Whatever SCSI sd is doing with unit attention conditions remains the same. I did
> not touch that.

I wrote my question before I had realized that this patch series
restricts the number of outstanding writes to one per zone. Hence,
there is no risk of unaligned write pointer errors due to reordering
of writes due to unit attention conditions. Hence, my question can
be ignored :-)

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/26] Zone write plugging
  2024-02-06  0:07   ` Damien Le Moal
@ 2024-02-06  1:25     ` Bart Van Assche
  2024-02-09  4:03       ` Damien Le Moal
  0 siblings, 1 reply; 107+ messages in thread
From: Bart Van Assche @ 2024-02-06  1:25 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/5/24 16:07, Damien Le Moal wrote:
> On 2/6/24 03:18, Bart Van Assche wrote:
>> Are there numbers available about the performance differences (bandwidth
>> and latency) between plugging zoned write bios and zoned write plugging
>> requests?
> 
> Finish reading the cover letter. It has lots of measurements with rc2, Jens
> block/for-next and ZWP...
Hmm ... as far as I know nobody ever implemented zoned write plugging
for requests in the block layer core so these numbers can't be in the
cover letter.

Has the bio plugging approach perhaps been chosen because it works
better for bio-based device mapper drivers?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 11/26] scsi: sd: Use the block layer zone append emulation
  2024-02-02  7:30 ` [PATCH 11/26] scsi: sd: " Damien Le Moal
  2024-02-04 12:29   ` Hannes Reinecke
@ 2024-02-06  1:55   ` Martin K. Petersen
  1 sibling, 0 replies; 107+ messages in thread
From: Martin K. Petersen @ 2024-02-06  1:55 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, Christoph Hellwig


Damien,

> Set the request queue of a TYPE_ZBC device as needing zone append
> emulation by setting the device queue max_zone_append_sectors limit to
> 0. This enables the block layer generic implementation provided by zone
> write plugging. With this, the sd driver will never see a
> REQ_OP_ZONE_APPEND request and the zone append emulation code
> implemented in sd_zbc.c can be removed.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 10/26] dm: Use the block layer zone append emulation
  2024-02-05 23:40         ` Damien Le Moal
@ 2024-02-06 20:41           ` Mike Snitzer
  0 siblings, 0 replies; 107+ messages in thread
From: Mike Snitzer @ 2024-02-06 20:41 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Christoph Hellwig

On Mon, Feb 05 2024 at  6:40P -0500,
Damien Le Moal <dlemoal@kernel.org> wrote:

> On 2/6/24 05:33, Mike Snitzer wrote:
> > On Mon, Feb 05 2024 at 12:38P -0500,
> > Damien Le Moal <dlemoal@kernel.org> wrote:
> > 
> >> On 2/4/24 02:58, Mike Snitzer wrote:
> >>> Love the overall improvement to the DM core code and the broader block
> >>> layer by switching to this bio-based ZWP approach.
> >>>
> >>> Reviewed-by: Mike Snitzer <snitzer@kernel.org>
> >>
> >> Thanks Mike !
> >>
> >>> But one incremental suggestion inlined below.
> >>
> >> I made this change, but in a lightly different form as I noticed that I was
> >> getting compile errors when CONFIG_BLK_DEV_ZONED is disabled.
> >> The change look like this now:
> >>
> >> static void dm_split_and_process_bio(struct mapped_device *md,
> >> 				     struct dm_table *map, struct bio *bio)
> >> {
> >> 	...
> >> 	need_split = is_abnormal = is_abnormal_io(bio);
> >> 	if (static_branch_unlikely(&zoned_enabled))
> >> 		need_split = is_abnormal || dm_zone_bio_needs_split(md, bio);
> >>
> >> 	...
> >>
> >> 	/*
> >> 	 * Use the block layer zone write plugging for mapped devices that
> >> 	 * need zone append emulation (e.g. dm-crypt).
> >> 	 */
> >> 	if (static_branch_unlikely(&zoned_enabled) &&
> >> 	    dm_zone_write_plug_bio(md, bio))
> >> 		return;
> >>
> >> 	...
> >>
> >> with these added to dm-core.h:
> >>
> >> static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
> >> 					   struct bio *bio)
> >> {
> >> 	return md->emulate_zone_append && bio_straddle_zones(bio);
> >> }
> >> static inline bool dm_zone_write_plug_bio(struct mapped_device *md,
> >> 					  struct bio *bio)
> >> {
> >> 	return md->emulate_zone_append && blk_zone_write_plug_bio(bio, 0);
> >> }
> >>
> >> These 2 helpers define to "return false" for !CONFIG_BLK_DEV_ZONED.
> >> I hope this works for you. Otherwise, I will drop your review tag when posting V2.
> > 
> > Why expose them in dm-core.h ?
> > 
> > Just have what you put in dm-core.h above dm_split_and_process_bio in dm.c ?
> 
> I wanted to avoid "#ifdef CONFIG_BLK_DEV_ZONED" in the .c files. But if you are
> OK with that, I can move these inline functions in dm.c.

I'm OK with it, dm.c already does something like this for
dm_queue_destroy_crypto_profile() by checking if
CONFIG_BLK_INLINE_ENCRYPTION defined.

Mike

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
  2024-02-05 23:55       ` Damien Le Moal
@ 2024-02-06 21:20         ` Bart Van Assche
  2024-02-09  3:58           ` Damien Le Moal
  0 siblings, 1 reply; 107+ messages in thread
From: Bart Van Assche @ 2024-02-06 21:20 UTC (permalink / raw)
  To: Damien Le Moal, Hannes Reinecke, linux-block, Jens Axboe,
	linux-scsi, Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/5/24 15:55, Damien Le Moal wrote:
> The array of struct blk_zone_wplug for the disk is sized for the total number of
> zones of the drive. The reason for that is that we want to retain the wp_offset
> value for all zones, even if they are not being written. Otherwise, everytime we
> start writing a zone, we would need to do a report zones to be able to emulate
> zone append operations if the drive requested that.

We do not need to track wp_offset for empty zones nor for full zones. The data
structure with plug information would become a lot smaller if it only tracks
information for zones that are neither empty nor full. If a zone append is
submitted to a zone and no information is being tracked for that zone, we can
initialize wp_offset to zero. That may not match the actual write pointer if
the zone is full but that shouldn't be an issue since write appends submitted
to a zone that is full fail anyway.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
  2024-02-06 21:20         ` Bart Van Assche
@ 2024-02-09  3:58           ` Damien Le Moal
  2024-02-09 19:36             ` Bart Van Assche
  0 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-09  3:58 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, linux-block, Jens Axboe,
	linux-scsi, Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/7/24 06:20, Bart Van Assche wrote:
> On 2/5/24 15:55, Damien Le Moal wrote:
>> The array of struct blk_zone_wplug for the disk is sized for the total number of
>> zones of the drive. The reason for that is that we want to retain the wp_offset
>> value for all zones, even if they are not being written. Otherwise, everytime we
>> start writing a zone, we would need to do a report zones to be able to emulate
>> zone append operations if the drive requested that.
> 
> We do not need to track wp_offset for empty zones nor for full zones. The data
> structure with plug information would become a lot smaller if it only tracks
> information for zones that are neither empty nor full. If a zone append is
> submitted to a zone and no information is being tracked for that zone, we can
> initialize wp_offset to zero. That may not match the actual write pointer if
> the zone is full but that shouldn't be an issue since write appends submitted
> to a zone that is full fail anyway.

We still need to keep in memory the write pointer offset of zones that are not
being actively written to but have been previously partially written. So I do
not see how excluding empty and full zones from that tracking simplifies
anything at all. And the union of wp offset+zone capacity with a pointer to the
active zone plug structure is not *that* complicated to handle...

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/26] Zone write plugging
  2024-02-06  1:25     ` Bart Van Assche
@ 2024-02-09  4:03       ` Damien Le Moal
  0 siblings, 0 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-09  4:03 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/6/24 10:25, Bart Van Assche wrote:
> On 2/5/24 16:07, Damien Le Moal wrote:
>> On 2/6/24 03:18, Bart Van Assche wrote:
>>> Are there numbers available about the performance differences (bandwidth
>>> and latency) between plugging zoned write bios and zoned write plugging
>>> requests?
>>
>> Finish reading the cover letter. It has lots of measurements with rc2, Jens
>> block/for-next and ZWP...
> Hmm ... as far as I know nobody ever implemented zoned write plugging
> for requests in the block layer core so these numbers can't be in the
> cover letter.

No, I have not implemented zone write plugging for requests as I beleive it
would lead to very similar results as zone write locking, that is, a potential
problem with efficiently using a device in a mixed read/write workload as
having too many plugged writes can lead to read starvation (blocking of read
submission on request allocation when nr_requests is reached).

> Has the bio plugging approach perhaps been chosen because it works
> better for bio-based device mapper drivers?

Not that it "works better" but rather that doing plugging at the BIO level
allows re-using the exact same code for zone append emulation, and write
ordering (if a DM driver wants the block layer to handle that). We had zone
append emulation implemented for DM (for dm-crypt) using BIOs and in scsi sd
driver using requests. ZWP unifies all this and will trivially allow enabling
that emulation for other device types as well (e.g. NVMe ZNS drives that do not
have native zone append support).

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/26] Zone write plugging
  2024-02-03 12:11   ` Jens Axboe
@ 2024-02-09  5:28     ` Damien Le Moal
  0 siblings, 0 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-09  5:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/3/24 21:11, Jens Axboe wrote:
>> I forgot to mention that the patches are against Jens block/for-next
>> branch with the addition of Christoph's "clean up blk_mq_submit_bio"
>> patches [1] and my patch "null_blk: Always split BIOs to respect queue
>> limits" [2].
> 
> I figured that was the case, I'll get both of these properly setup in a
> for-6.9/block branch, just wanted -rc3 to get cut first. JFYI that they
> are coming tomorrow.

Jens,

I saw the updated rc3-based for-next branch. Thanks for that. But it seems that
you removed the mq-deadline insert optimization ? Is that in purpose or did I
mess up something ?

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 02/26] block: Remove req_bio_endio()
  2024-02-05 17:28   ` Bart Van Assche
  2024-02-05 23:45     ` Damien Le Moal
@ 2024-02-09  6:53     ` Damien Le Moal
  1 sibling, 0 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-09  6:53 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/6/24 02:28, Bart Van Assche wrote:
> On 2/1/24 23:30, Damien Le Moal wrote:
>> @@ -916,9 +888,8 @@ bool blk_update_request(struct request *req, blk_status_t
>> error,
>>       if (blk_crypto_rq_has_keyslot(req) && nr_bytes >= blk_rq_bytes(req))
>>           __blk_crypto_rq_put_keyslot(req);
>>   -    if (unlikely(error && !blk_rq_is_passthrough(req) &&
>> -             !(req->rq_flags & RQF_QUIET)) &&
>> -             !test_bit(GD_DEAD, &req->q->disk->state)) {
>> +    if (unlikely(error && !blk_rq_is_passthrough(req) && !quiet) &&
>> +        !test_bit(GD_DEAD, &req->q->disk->state)) {
> 
> The new indentation of !test_bit(GD_DEAD, &req->q->disk->state) looks odd to me

But it is actually correct because that test bit is not part of the unlikely().
Not sure if that is intentional though.

> ...
> 
>>           blk_print_req_error(req, error);
>>           trace_block_rq_error(req, error, nr_bytes);
>>       }
>> @@ -930,12 +901,37 @@ bool blk_update_request(struct request *req,
>> blk_status_t error,
>>           struct bio *bio = req->bio;
>>           unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
>>   -        if (bio_bytes == bio->bi_iter.bi_size)
>> +        if (unlikely(error))
>> +            bio->bi_status = error;
>> +
>> +        if (bio_bytes == bio->bi_iter.bi_size) {
>>               req->bio = bio->bi_next;
> 
> The behavior has been changed compared to the original code: the original code
> only tests bio_bytes if error == 0. The new code tests bio_bytes no matter what
> value the 'error' variable has. Is this behavior change intentional?

No change actually. The bio_bytes test was in blk_update_request() already.

> 
> Otherwise this patch looks good to me.
> 
> Thanks,
> 
> Bart.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
  2024-02-09  3:58           ` Damien Le Moal
@ 2024-02-09 19:36             ` Bart Van Assche
  2024-02-10  0:06               ` Damien Le Moal
  0 siblings, 1 reply; 107+ messages in thread
From: Bart Van Assche @ 2024-02-09 19:36 UTC (permalink / raw)
  To: Damien Le Moal, Hannes Reinecke, linux-block, Jens Axboe,
	linux-scsi, Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/8/24 19:58, Damien Le Moal wrote:
> We still need to keep in memory the write pointer offset of zones that are not
> being actively written to but have been previously partially written. So I do
> not see how excluding empty and full zones from that tracking simplifies
> anything at all. And the union of wp offset+zone capacity with a pointer to the
> active zone plug structure is not *that* complicated to handle...

Multiple zoned storage device have 1000 or more zones. The number of partially
written zones is typically less than 10. Hence, tracking the partially written
zones only will result in significantly less memory being used, fewer CPU cache
misses and fewer MMU TLB lookup misses. I expect that this will matter since the
zone information data structure will be accessed every time a zoned write bio is
processed.

Bart.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
  2024-02-09 19:36             ` Bart Van Assche
@ 2024-02-10  0:06               ` Damien Le Moal
  2024-02-11  3:40                 ` Bart Van Assche
  0 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-10  0:06 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, linux-block, Jens Axboe,
	linux-scsi, Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/10/24 04:36, Bart Van Assche wrote:
> On 2/8/24 19:58, Damien Le Moal wrote:
>> We still need to keep in memory the write pointer offset of zones that are not
>> being actively written to but have been previously partially written. So I do
>> not see how excluding empty and full zones from that tracking simplifies
>> anything at all. And the union of wp offset+zone capacity with a pointer to the
>> active zone plug structure is not *that* complicated to handle...
> 
> Multiple zoned storage device have 1000 or more zones. The number of partially

Try multiplying that by 100... 28TB SMR drives have 104000 zones.

> written zones is typically less than 10. Hence, tracking the partially written

That is far from guaranteed, especially with devices that have no active zone
limits like SMR drives.

> zones only will result in significantly less memory being used, fewer CPU cache
> misses and fewer MMU TLB lookup misses. I expect that this will matter since the
> zone information data structure will be accessed every time a zoned write bio is
> processed.

May be. The performance numbers I have suggest that this is not an issue.

But in any case, what exactly is your idea here ? Can you actually suggest
something ? Are you suggesting that a sparse array of zone plugs be used, with
an rb-tree or an xarray ? If that is what you are thinking, I can already tell
you that this is the first thing I tried to do. Early versions of this work used
a sparse xarray of zone plugs. But the problem with such approach is that it is
a lot more complicated and there is a need for a single lock to manage that
structure (which is really not good for performance).

Hence this series which used a statically allocated array of zone plugs to
simplify things. Overall, this series is a significant change to the zone write
path and I wanted something simple/reliable that is not a nightmare to debug and
test. I believe that an xarray based optimization can be re-tried as an
incremental change on top of this series. The nice thing about it is that the
API should not need to change, meaning that all changes can be contained within
blk-zone.c.

But I may be missing entirely your point. So clarify please.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
  2024-02-10  0:06               ` Damien Le Moal
@ 2024-02-11  3:40                 ` Bart Van Assche
  2024-02-12  1:09                   ` Damien Le Moal
  2024-02-12  8:23                   ` Damien Le Moal
  0 siblings, 2 replies; 107+ messages in thread
From: Bart Van Assche @ 2024-02-11  3:40 UTC (permalink / raw)
  To: Damien Le Moal, Hannes Reinecke, linux-block, Jens Axboe,
	linux-scsi, Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/9/24 16:06, Damien Le Moal wrote:
> On 2/10/24 04:36, Bart Van Assche wrote:
>> written zones is typically less than 10. Hence, tracking the partially written
> 
> That is far from guaranteed, especially with devices that have no active zone
> limits like SMR drives.

Interesting. The zoned devices I'm working with try to keep data in memory
for all zones that are neither empty nor full and hence impose an upper limit
on the number of open zones.

> But in any case, what exactly is your idea here ? Can you actually suggest
> something ? Are you suggesting that a sparse array of zone plugs be used, with
> an rb-tree or an xarray ? If that is what you are thinking, I can already tell
> you that this is the first thing I tried to do. Early versions of this work used
> a sparse xarray of zone plugs. But the problem with such approach is that it is
> a lot more complicated and there is a need for a single lock to manage that
> structure (which is really not good for performance).

Hmm ... since the xarray data structure supports RCU I think that locking the
entire xarray is only required if the zone condition changes from empty into
not empty or from neither empty nor full into full?

For the use cases I'm interested in a hash table implementation that supports
RCU-lookups probably will work better than an xarray. I think that the hash
table implementation in <linux/hashtable.h> supports RCU for lookups, insertion
and removal.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
  2024-02-11  3:40                 ` Bart Van Assche
@ 2024-02-12  1:09                   ` Damien Le Moal
  2024-02-12 18:58                     ` Bart Van Assche
  2024-02-12  8:23                   ` Damien Le Moal
  1 sibling, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-12  1:09 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, linux-block, Jens Axboe,
	linux-scsi, Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/11/24 12:40, Bart Van Assche wrote:
> On 2/9/24 16:06, Damien Le Moal wrote:
>> On 2/10/24 04:36, Bart Van Assche wrote:
>>> written zones is typically less than 10. Hence, tracking the partially written
>>
>> That is far from guaranteed, especially with devices that have no active zone
>> limits like SMR drives.
> 
> Interesting. The zoned devices I'm working with try to keep data in memory
> for all zones that are neither empty nor full and hence impose an upper limit
> on the number of open zones.
> 
>> But in any case, what exactly is your idea here ? Can you actually suggest
>> something ? Are you suggesting that a sparse array of zone plugs be used, with
>> an rb-tree or an xarray ? If that is what you are thinking, I can already tell
>> you that this is the first thing I tried to do. Early versions of this work used
>> a sparse xarray of zone plugs. But the problem with such approach is that it is
>> a lot more complicated and there is a need for a single lock to manage that
>> structure (which is really not good for performance).
> 
> Hmm ... since the xarray data structure supports RCU I think that locking the
> entire xarray is only required if the zone condition changes from empty into
> not empty or from neither empty nor full into full?

I will try to revisit this. But again, that could be an incremental change on
top of this series...

> For the use cases I'm interested in a hash table implementation that supports
> RCU-lookups probably will work better than an xarray. I think that the hash
> table implementation in <linux/hashtable.h> supports RCU for lookups, insertion
> and removal.

It does, but the API for it is not the easiest, and I do not see how that could
be faster than an xarray, especially as the number of zones grows with high
capacity devices (read: potentially more collisions which will slow zone plug
lookups).

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
  2024-02-11  3:40                 ` Bart Van Assche
  2024-02-12  1:09                   ` Damien Le Moal
@ 2024-02-12  8:23                   ` Damien Le Moal
  2024-02-12  8:47                     ` Damien Le Moal
  1 sibling, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-12  8:23 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, linux-block, Jens Axboe,
	linux-scsi, Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/11/24 12:40, Bart Van Assche wrote:
> On 2/9/24 16:06, Damien Le Moal wrote:
>> On 2/10/24 04:36, Bart Van Assche wrote:
>>> written zones is typically less than 10. Hence, tracking the partially written
>>
>> That is far from guaranteed, especially with devices that have no active zone
>> limits like SMR drives.
> 
> Interesting. The zoned devices I'm working with try to keep data in memory
> for all zones that are neither empty nor full and hence impose an upper limit
> on the number of open zones.
> 
>> But in any case, what exactly is your idea here ? Can you actually suggest
>> something ? Are you suggesting that a sparse array of zone plugs be used, with
>> an rb-tree or an xarray ? If that is what you are thinking, I can already tell
>> you that this is the first thing I tried to do. Early versions of this work used
>> a sparse xarray of zone plugs. But the problem with such approach is that it is
>> a lot more complicated and there is a need for a single lock to manage that
>> structure (which is really not good for performance).
> 
> Hmm ... since the xarray data structure supports RCU I think that locking the
> entire xarray is only required if the zone condition changes from empty into
> not empty or from neither empty nor full into full?
> 
> For the use cases I'm interested in a hash table implementation that supports
> RCU-lookups probably will work better than an xarray. I think that the hash
> table implementation in <linux/hashtable.h> supports RCU for lookups, insertion
> and removal.

I spent some time digging into this and also revisiting the possibility of using
an xarray. Conclusion is that this does not work well, at least in no way not
better than what I did, and most of the time much worse. The reason is that we
need at the very least to keep this information around:
1) If the zone is conventional or not
2) The zone capacity of sequential write required zones

Unless we keep this information, a report zone would be needed before starting
writing to a zone that does not yet have a zone write plug allocated.

(1) and (2) above can be trivially combined into a single 32-bits value. But
that value must exist for all zones. So at the very least, we need nr_zones * 4B
of memory allocated at all time. For such case (i.e. non-sparse structure),
xarray or hash table would be more costly in memory than a simple static array.

Given that we want to allocate/free zone write plugs dynamically as needed, we
essentially need an array of pointers, so 8B * nr_zones for the base structure.
From there, ideally, we should be able to use rcu to safely dereference/modify
the array entries. However, static arrays are not supported by the rcu code from
what I read.

Given this, my current approach that uses 16B per zone is the next best thing I
can think of without introducing a single lock for modifying the array entries.

If you have any other idea, please share.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
  2024-02-12  8:23                   ` Damien Le Moal
@ 2024-02-12  8:47                     ` Damien Le Moal
  2024-02-12 18:40                       ` Bart Van Assche
  0 siblings, 1 reply; 107+ messages in thread
From: Damien Le Moal @ 2024-02-12  8:47 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, linux-block, Jens Axboe,
	linux-scsi, Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/12/24 17:23, Damien Le Moal wrote:
> On 2/11/24 12:40, Bart Van Assche wrote:
>> On 2/9/24 16:06, Damien Le Moal wrote:
>>> On 2/10/24 04:36, Bart Van Assche wrote:
>>>> written zones is typically less than 10. Hence, tracking the partially written
>>>
>>> That is far from guaranteed, especially with devices that have no active zone
>>> limits like SMR drives.
>>
>> Interesting. The zoned devices I'm working with try to keep data in memory
>> for all zones that are neither empty nor full and hence impose an upper limit
>> on the number of open zones.
>>
>>> But in any case, what exactly is your idea here ? Can you actually suggest
>>> something ? Are you suggesting that a sparse array of zone plugs be used, with
>>> an rb-tree or an xarray ? If that is what you are thinking, I can already tell
>>> you that this is the first thing I tried to do. Early versions of this work used
>>> a sparse xarray of zone plugs. But the problem with such approach is that it is
>>> a lot more complicated and there is a need for a single lock to manage that
>>> structure (which is really not good for performance).
>>
>> Hmm ... since the xarray data structure supports RCU I think that locking the
>> entire xarray is only required if the zone condition changes from empty into
>> not empty or from neither empty nor full into full?
>>
>> For the use cases I'm interested in a hash table implementation that supports
>> RCU-lookups probably will work better than an xarray. I think that the hash
>> table implementation in <linux/hashtable.h> supports RCU for lookups, insertion
>> and removal.
> 
> I spent some time digging into this and also revisiting the possibility of using
> an xarray. Conclusion is that this does not work well, at least in no way not
> better than what I did, and most of the time much worse. The reason is that we
> need at the very least to keep this information around:
> 1) If the zone is conventional or not
> 2) The zone capacity of sequential write required zones
> 
> Unless we keep this information, a report zone would be needed before starting
> writing to a zone that does not yet have a zone write plug allocated.
> 
> (1) and (2) above can be trivially combined into a single 32-bits value. But
> that value must exist for all zones. So at the very least, we need nr_zones * 4B
> of memory allocated at all time. For such case (i.e. non-sparse structure),
> xarray or hash table would be more costly in memory than a simple static array.
> 
> Given that we want to allocate/free zone write plugs dynamically as needed, we
> essentially need an array of pointers, so 8B * nr_zones for the base structure.
> From there, ideally, we should be able to use rcu to safely dereference/modify
> the array entries. However, static arrays are not supported by the rcu code from
> what I read.
> 
> Given this, my current approach that uses 16B per zone is the next best thing I
> can think of without introducing a single lock for modifying the array entries.
> 
> If you have any other idea, please share.

Replying to myself as I had an idea:
1) Store the zone capacity in a separate array: 4B * nr_zones needed. Storing
"0" as a value for a zone in that array would indicate that the zone is
conventional. No additional zone bitmap needed.
2) Use a sparse xarray for managing allocated zone write plugs: 64B per
allocated zone write plug needed, which for an SMR drive would generally be at
most 128 * 64B = 8K.

So for an SMR drive with 100,000 zones, that would be a total of 408 KB, instead
of the current 1.6 MB. Will try to prototype this to see how performance goes (I
am worried about the xarray lookup overhead in the hot path).

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
  2024-02-12  8:47                     ` Damien Le Moal
@ 2024-02-12 18:40                       ` Bart Van Assche
  2024-02-13  0:05                         ` Damien Le Moal
  0 siblings, 1 reply; 107+ messages in thread
From: Bart Van Assche @ 2024-02-12 18:40 UTC (permalink / raw)
  To: Damien Le Moal, Hannes Reinecke, linux-block, Jens Axboe,
	linux-scsi, Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/12/24 00:47, Damien Le Moal wrote:
> Replying to myself as I had an idea:
> 1) Store the zone capacity in a separate array: 4B * nr_zones needed. Storing
> "0" as a value for a zone in that array would indicate that the zone is
> conventional. No additional zone bitmap needed.
> 2) Use a sparse xarray for managing allocated zone write plugs: 64B per
> allocated zone write plug needed, which for an SMR drive would generally be at
> most 128 * 64B = 8K.
> 
> So for an SMR drive with 100,000 zones, that would be a total of 408 KB, instead
> of the current 1.6 MB. Will try to prototype this to see how performance goes (I
> am worried about the xarray lookup overhead in the hot path).

Hi Damien,

Are there any zoned devices where the sequential write required zones occur before
the conventional zones? If not, does this mean that the conventional zones always
occur before the write pointer zones and also that storing the number of conventional
zones is sufficient?

Are there zoned storage devices where each zone has a different capacity? I have
not yet encountered any such device. I'm wondering whether a single capacity
variable would be sufficient for the entire device.

Thank you,

Bart.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
  2024-02-12  1:09                   ` Damien Le Moal
@ 2024-02-12 18:58                     ` Bart Van Assche
  0 siblings, 0 replies; 107+ messages in thread
From: Bart Van Assche @ 2024-02-12 18:58 UTC (permalink / raw)
  To: Damien Le Moal, Hannes Reinecke, linux-block, Jens Axboe,
	linux-scsi, Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/11/24 17:09, Damien Le Moal wrote:
> On 2/11/24 12:40, Bart Van Assche wrote:
>> For the use cases I'm interested in a hash table implementation that supports
>> RCU-lookups probably will work better than an xarray. I think that the hash
>> table implementation in <linux/hashtable.h> supports RCU for lookups, insertion
>> and removal.
> 
> It does, but the API for it is not the easiest, and I do not see how that could
> be faster than an xarray, especially as the number of zones grows with high
> capacity devices (read: potentially more collisions which will slow zone plug
> lookups).

 From the xarray documentation: "The XArray implementation is efficient when the
indices used are densely clustered". I think we are dealing with a sparse array
and hence that an xarray may not be the best suited data structure. How about
using a hash table and making the hash table larger if the number of open zones
equals the hash table size? That is possible as follows:
* Instead of using DEFINE_HASHTABLE() or DECLARE_HASHTABLE(), allocate the hash
   table dynamically and use the  struct hlist_head __rcu * data type.
* Use rcu_assign_pointer() to modify that pointer and kfree_rcu() to free old
   versions of the hash table.
* Use rcu_dereference_protected() for hash table lookups.

For an example, see also the output of the following command:
$ git grep -nHw state_bydst

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
  2024-02-12 18:40                       ` Bart Van Assche
@ 2024-02-13  0:05                         ` Damien Le Moal
  0 siblings, 0 replies; 107+ messages in thread
From: Damien Le Moal @ 2024-02-13  0:05 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, linux-block, Jens Axboe,
	linux-scsi, Martin K . Petersen, dm-devel, Mike Snitzer
  Cc: Christoph Hellwig

On 2/13/24 03:40, Bart Van Assche wrote:
> On 2/12/24 00:47, Damien Le Moal wrote:
>> Replying to myself as I had an idea:
>> 1) Store the zone capacity in a separate array: 4B * nr_zones needed. Storing
>> "0" as a value for a zone in that array would indicate that the zone is
>> conventional. No additional zone bitmap needed.
>> 2) Use a sparse xarray for managing allocated zone write plugs: 64B per
>> allocated zone write plug needed, which for an SMR drive would generally be at
>> most 128 * 64B = 8K.
>>
>> So for an SMR drive with 100,000 zones, that would be a total of 408 KB, instead
>> of the current 1.6 MB. Will try to prototype this to see how performance goes (I
>> am worried about the xarray lookup overhead in the hot path).
> 
> Hi Damien,
> 
> Are there any zoned devices where the sequential write required zones occur before
> the conventional zones? If not, does this mean that the conventional zones always
> occur before the write pointer zones and also that storing the number of conventional
> zones is sufficient?

Not sure where you want to go with this... In any case, there are SMR drives
which have conventional zones before and after the bulk of the capacity as
sequential write required zones. Conventional zones can be anywhere.

> Are there zoned storage devices where each zone has a different capacity? I have
> not yet encountered any such device. I'm wondering whether a single capacity
> variable would be sufficient for the entire device.

Yes, I did this optimization. Right now, for the 28TB SMR disk case, I am down
to a bitmap for conventional zones (16KB) plus max-open-zones * 64 B for the
zone write plugs. Cannot go lower than that. I am still looking into xarray vs
hash table for the zone write plugs for the overhead/performance.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 107+ messages in thread

end of thread, other threads:[~2024-02-13  0:05 UTC | newest]

Thread overview: 107+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-02  7:30 [PATCH 00/26] Zone write plugging Damien Le Moal
2024-02-02  7:30 ` [PATCH 01/26] block: Restore sector of flush requests Damien Le Moal
2024-02-04 11:55   ` Hannes Reinecke
2024-02-05 17:22   ` Bart Van Assche
2024-02-05 23:42     ` Damien Le Moal
2024-02-02  7:30 ` [PATCH 02/26] block: Remove req_bio_endio() Damien Le Moal
2024-02-04 11:57   ` Hannes Reinecke
2024-02-05 17:28   ` Bart Van Assche
2024-02-05 23:45     ` Damien Le Moal
2024-02-09  6:53     ` Damien Le Moal
2024-02-02  7:30 ` [PATCH 03/26] block: Introduce bio_straddle_zones() and bio_offset_from_zone_start() Damien Le Moal
2024-02-03  4:09   ` Bart Van Assche
2024-02-04 11:58   ` Hannes Reinecke
2024-02-02  7:30 ` [PATCH 04/26] block: Introduce blk_zone_complete_request_bio() Damien Le Moal
2024-02-04 11:59   ` Hannes Reinecke
2024-02-02  7:30 ` [PATCH 05/26] block: Allow using bio_attempt_back_merge() internally Damien Le Moal
2024-02-03  4:11   ` Bart Van Assche
2024-02-04 12:00   ` Hannes Reinecke
2024-02-02  7:30 ` [PATCH 06/26] block: Introduce zone write plugging Damien Le Moal
2024-02-04  3:56   ` Ming Lei
2024-02-04 23:57     ` Damien Le Moal
2024-02-05  2:19       ` Ming Lei
2024-02-05  2:41         ` Damien Le Moal
2024-02-05  3:38           ` Ming Lei
2024-02-05  5:11           ` Christoph Hellwig
2024-02-05  5:37             ` Damien Le Moal
2024-02-05  5:50               ` Christoph Hellwig
2024-02-05  6:14                 ` Damien Le Moal
2024-02-05 10:06           ` Ming Lei
2024-02-05 12:20             ` Damien Le Moal
2024-02-05 12:43               ` Damien Le Moal
2024-02-04 12:14   ` Hannes Reinecke
2024-02-05 17:48   ` Bart Van Assche
2024-02-05 23:48     ` Damien Le Moal
2024-02-06  0:52       ` Bart Van Assche
2024-02-02  7:30 ` [PATCH 07/26] block: Allow zero value of max_zone_append_sectors queue limit Damien Le Moal
2024-02-04 12:15   ` Hannes Reinecke
2024-02-02  7:30 ` [PATCH 08/26] block: Implement zone append emulation Damien Le Moal
2024-02-04 12:24   ` Hannes Reinecke
2024-02-05  0:10     ` Damien Le Moal
2024-02-05 17:58   ` Bart Van Assche
2024-02-05 23:57     ` Damien Le Moal
2024-02-02  7:30 ` [PATCH 09/26] block: Allow BIO-based drivers to use blk_revalidate_disk_zones() Damien Le Moal
2024-02-04 12:26   ` Hannes Reinecke
2024-02-02  7:30 ` [PATCH 10/26] dm: Use the block layer zone append emulation Damien Le Moal
2024-02-03 17:58   ` Mike Snitzer
2024-02-05  5:38     ` Damien Le Moal
2024-02-05 20:33       ` Mike Snitzer
2024-02-05 23:40         ` Damien Le Moal
2024-02-06 20:41           ` Mike Snitzer
2024-02-04 12:30   ` Hannes Reinecke
2024-02-02  7:30 ` [PATCH 11/26] scsi: sd: " Damien Le Moal
2024-02-04 12:29   ` Hannes Reinecke
2024-02-06  1:55   ` Martin K. Petersen
2024-02-02  7:30 ` [PATCH 12/26] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature Damien Le Moal
2024-02-04 12:31   ` Hannes Reinecke
2024-02-02  7:30 ` [PATCH 13/26] null_blk: " Damien Le Moal
2024-02-04 12:31   ` Hannes Reinecke
2024-02-02  7:30 ` [PATCH 14/26] null_blk: Introduce zone_append_max_sectors attribute Damien Le Moal
2024-02-04 12:32   ` Hannes Reinecke
2024-02-02  7:30 ` [PATCH 15/26] null_blk: Introduce fua attribute Damien Le Moal
2024-02-04 12:33   ` Hannes Reinecke
2024-02-02  7:30 ` [PATCH 16/26] nvmet: zns: Do not reference the gendisk conv_zones_bitmap Damien Le Moal
2024-02-04 12:34   ` Hannes Reinecke
2024-02-02  7:30 ` [PATCH 17/26] block: Remove BLK_STS_ZONE_RESOURCE Damien Le Moal
2024-02-04 12:34   ` Hannes Reinecke
2024-02-02  7:30 ` [PATCH 18/26] block: Simplify blk_revalidate_disk_zones() interface Damien Le Moal
2024-02-04 12:35   ` Hannes Reinecke
2024-02-02  7:30 ` [PATCH 19/26] block: mq-deadline: Remove support for zone write locking Damien Le Moal
2024-02-04 12:36   ` Hannes Reinecke
2024-02-02  7:30 ` [PATCH 20/26] block: Remove elevator required features Damien Le Moal
2024-02-04 12:36   ` Hannes Reinecke
2024-02-02  7:30 ` [PATCH 21/26] block: Do not check zone type in blk_check_zone_append() Damien Le Moal
2024-02-04 12:37   ` Hannes Reinecke
2024-02-02  7:31 ` [PATCH 22/26] block: Move zone related debugfs attribute to blk-zoned.c Damien Le Moal
2024-02-04 12:38   ` Hannes Reinecke
2024-02-02  7:31 ` [PATCH 23/26] block: Remove zone write locking Damien Le Moal
2024-02-04 12:38   ` Hannes Reinecke
2024-02-02  7:31 ` [PATCH 24/26] block: Do not special-case plugging of zone write operations Damien Le Moal
2024-02-04 12:39   ` Hannes Reinecke
2024-02-02  7:31 ` [PATCH 25/26] block: Reduce zone write plugging memory usage Damien Le Moal
2024-02-04 12:42   ` Hannes Reinecke
2024-02-05 17:51     ` Bart Van Assche
2024-02-05 23:55       ` Damien Le Moal
2024-02-06 21:20         ` Bart Van Assche
2024-02-09  3:58           ` Damien Le Moal
2024-02-09 19:36             ` Bart Van Assche
2024-02-10  0:06               ` Damien Le Moal
2024-02-11  3:40                 ` Bart Van Assche
2024-02-12  1:09                   ` Damien Le Moal
2024-02-12 18:58                     ` Bart Van Assche
2024-02-12  8:23                   ` Damien Le Moal
2024-02-12  8:47                     ` Damien Le Moal
2024-02-12 18:40                       ` Bart Van Assche
2024-02-13  0:05                         ` Damien Le Moal
2024-02-02  7:31 ` [PATCH 26/26] block: Add zone_active_wplugs debugfs entry Damien Le Moal
2024-02-04 12:43   ` Hannes Reinecke
2024-02-02  7:37 ` [PATCH 00/26] Zone write plugging Damien Le Moal
2024-02-03 12:11   ` Jens Axboe
2024-02-09  5:28     ` Damien Le Moal
2024-02-05 17:21 ` Bart Van Assche
2024-02-05 23:42   ` Damien Le Moal
2024-02-06  0:57     ` Bart Van Assche
2024-02-05 18:18 ` Bart Van Assche
2024-02-06  0:07   ` Damien Le Moal
2024-02-06  1:25     ` Bart Van Assche
2024-02-09  4:03       ` Damien Le Moal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.