All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/30] Zone write plugging
@ 2024-03-28  0:43 Damien Le Moal
  2024-03-28  0:43 ` [PATCH v3 01/30] block: Do not force full zone append completion in req_bio_endio() Damien Le Moal
                   ` (30 more replies)
  0 siblings, 31 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

The patch series introduces zone write plugging (ZWP) as the new
mechanism to control the ordering of writes to zoned block devices.
ZWP replaces zone write locking (ZWL) which is implemented only by
mq-deadline today. ZWP also allows emulating zone append operations
using regular writes for zoned devices that do not natively support this
operation (e.g. SMR HDDs). This patch series removes the scsi disk
driver and device mapper zone append emulation to use ZWP emulation.

Unlike ZWL which operates on requests, ZWP operates on BIOs. A zone
write plug is simply a BIO list that is atomically manipulated using a
spinlock and a kblockd submission work. A write BIO to a zone is
"plugged" to delay its execution if a write BIO for the same zone was
already issued, that is, if a write request for the same zone is being
executed. The next plugged BIO is unplugged and issued once the write
request completes.

This mechanism allows to:
 - Untangle zone write ordering from the block IO schedulers. This
   allows removing the restriction on using only mq-deadline for zoned
   block devices. Any block IO scheduler, including "none" can be used.
 - Zone write plugging operates on BIOs instead of requests. Plugged
   BIOs waiting for execution thus do not hold scheduling tags and thus
   do not prevent other BIOs from being submitted to the device (reads
   or writes to other zones). Depending on the workload, this can
   significantly improve the device use and the performance.
 - Both blk-mq (request) based zoned devices and BIO-based devices (e.g.
   device mapper) can use ZWP. It is mandatory for the
   former but optional for the latter: BIO-based driver can use zone
   write plugging to implement write ordering guarantees, or the drivers
   can implement their own if needed.
 - The code is less invasive in the block layer and in device drivers.
   ZWP implementation is mostly limited to blk-zoned.c, with some small
   changes in blk-mq.c, blk-merge.c and bio.c.

Performance evaluation results are shown below.

The series is based on 6.9.0-rc1 and organized as follows:

 - Patch 1 to 7 are preparatory changes for patch 8.
 - Patch 8, 9 and 10 introduce ZWP
 - Patch 11 and 12 add zone append emulation to ZWP.
 - Patch 13 to 20 modify zoned block device drivers to use ZWP and
   prepare for the removal of ZWL.
 - Patch 21 to 30 remove zone write locking

Overall, these changes do not significantly increase the amount of code
(the higher number of addition shown by diff-stat is in fact due to a
larger amount of comments in the code).

Many thanks must go to Christoph Hellwig for comments and suggestions
he provided on earlier versions of these patches.

Performance evaluation results
==============================

Environments:
 - Intel Xeon 16-cores/32-threads, 128GB of RAM
 - Kernel:
   - ZWL (baseline): 6.9.0-rc1
   - ZWP: 6.9.0-rc1 patched kernel to add zone write plugging
   (both kernels were compiled with the same configuration turning off
   most heavy debug features)

Workoads:
 - seqw4K1: 4KB sequential write, qd=1
 - seqw4K16: 4KB sequential write, qd=16
 - seqw1M16: 1MB sequential write, qd=16
 - rndw4K16: 4KB random write, qd=16
 - rndw128K16: 128KB random write, qd=16
 - btrfs workoad: Single fio job writing 128 MB files using 128 KB
   direct IOs at qd=16.

Devices:
 - nullblk (zoned): 4096 zones of 256 MB, 128 max open zones.
 - NVMe ZNS drive: 1 TB ZNS drive with 2GB zone size, 14 max open and
   active zones.
 - SMR HDD: 20 TB disk with 256MB zone size, 128 max open zones.

For ZWP, the result show the performance percentage increase (or
decrease) against ZWL (baseline) case.

1) null_blk zoned device:

             +--------+--------+-------+--------+---------+---------+
             |seqw4K1 |seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
             |(MB/s)  | (MB/s) |(MB/s) | (MB/s) | (KIOPS)| (KIOPS)  |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWL    | 961    | 835    | 18640 | 14480  | 415    | 167      |
 |mq-deadline|        |        |       |        |        |          |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 963    | 845    | 18640 | 14630  | 452    | 165      |
 |mq-deadline| (+0%)  | (+1%)  | (+0%) | (+1%)  | (+8%)  | (-1%)    |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 731    | 651    | 15780 | 12710  | 129    | 108      |
 |    bfq    | (-23%) | (-22%) | (-15%)| (-12%) | (-68%) | (-15%)   |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 2511   | 1632   | 27470 | 19340  | 336    | 150      |
 |   none    | (+161%)| (+95%) | (+47%)| (+33%) | (-19%) | (-19%)   |
 +-----------+--------+--------+-------+--------+--------+----------+

ZWP with mq-deadline gives performance very similar to zone write
locking, showing that zone write plugging overhead is acceptable.
But ZWP ability to run fast block devices with the none scheduler
shows brings all the benefits of zone write plugging and results in
significant performance increase for all workloads. The exception to
this are random write workloads with multiple jobs: for these, the
faster request submission rate achieved by zone write plugging results
in higher contention on null-blk zone spinlock, which degrades
performance.

2) NVMe ZNS drive:

             +--------+--------+-------+--------+--------+----------+
             |seqw4K1 |seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
             |(MB/s)  | (MB/s) |(MB/s) | (MB/s) | (KIOPS)|  (KIOPS) |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWL    | 183    | 707    | 1083  | 1101   | 53.6   | 14.1     |
 |mq-deadline|        |        |       |        |        |          |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 183    | 719    | 1082  | 1103   |55.5    | 14.1     |
 |mq-deadline|(-0%)   | (+1%)  | (+0%) | (+0%)  |(+3%)   | (+0%)    |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 175    | 691    | 1078  | 1097   | 28.3   | 11.2     |
 |    bfq    | (-4%)  | (-2%)  | (-0%) | (-0%)  | (-47%) | (-20%)   |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 190    | 665    | 1083  | 1105   | 51.4   | 14.1     |
 |   none    | (+4%)  | (-5%)  | (+0%) | (+0%)  | (-4%)  | (+0%)    |
 +-----------+--------+--------+-------+--------+--------+----------+

Zone write plugging overhead does not significantly impact performance.
Similar to nullblk, using the none scheduler leads to performance
increase for most workloads.

3) SMR SATA HDD:

             +-------+--------+-------+--------+--------+----------+
             |seqw4K1|seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
             |(MB/s) | (MB/s) |(MB/s) | (MB/s) | (KIOPS)|  (KIOPS) |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWL    | 107   | 243    | 246   | 246    | 2.2    | 0.769    |
 |mq-deadline|       |        |       |        |        |          |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWP    | 109   | 240    | 246   | 244    | 2.2    | 0.767    |
 |mq-deadline|(+1%)  | (-1%)  | (-0%) | (-0%)  | (+0%)  | (+0%)    |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWP    | 104   | 240    | 247   | 244    | 2.3    | 0.765    |
 |    bfq    | (-2%) | (-1%)  | (+0%) | (-0%)  | (+0%)  | (+0%)    |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWP    | 115   | 235    | 246   | 243    | 2.2    | 0.771    |
 |   none    | (+7%) | (-3%)  | (+0%) | (-1%)  | (+0%)  | (+0%)    |
 +-----------+-------+--------+-------+--------+--------+----------+

Performance with purely sequential write workloads at high queue depth
somewhat decrease a little when using zone write plugging. This is due
to the different IO pattern that ZWP generates where the first writes to
a zone start being issued when the end of the previous zone are still
being written. Depending on how the disk handles queued commands, seek
may be generated, slightly impacting the throughput achieved. Such pure
sequential write workloads are however rare with SMR drives.

4) Zone append tests using btrfs:

             +-------------+-------------+-----------+-------------+
             |  null-blk   |  null_blk   |    ZNS    |     SMR     |
             |  native ZA  | emulated ZA | native ZA | emulated ZA |
             |    (MB/s)   |   (MB/s)    |   (MB/s)  |    (MB/s)   |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWL    | 2434        | N/A         | 1083      | 244         |
 |mq-deadline|             |             |           |             |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWP    | 2361        | 3111        | 1087      | 239         |
 |mq-deadline| (+1%)       |             | (+0%)     | (-2%)       |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWP    | 2299        | 2840        | 1082      | 239         |
 |    bfq    | (-4%)       |             | (+0%)     | (-2%)       |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWP    | 2443        | 3152        | 1078      | 238         |
 |    none   | (+0%)       |             | (-0%)     | (-2%)       |
 +-----------+-------------+-------------+-----------+-------------+

With a more realistic use of the device though a file system, ZWP does
not introduce significant performance differences, except for SMR for
the same reason as with the fio sequential workloads at high queue
depth.

Changes from v2:
 - Added Patch 1 (Christoph's comment)
 - Fixed error code setup in Patch 3 (Bart's comment)
 - Split former patch 26 into patches 27 and 28
 - Modified patch 8 (zone write plugging) introduction to remove the
   kmem_cache use and address Bart's and Christoph comments.
 - Changed from using a mempool of zone write plugs to using a simple
   free-list (patch 9)
 - Simplified patch 10 as suggested by Christoph
 - Moved common code to a helper in patch 13 as suggested by Christoph

Changes from v1:
 - Added patch 6
 - Rewrite of patch 7 to use a hash table of dynamically allocated zone
   write plugs. This results in changes in patch 11 and the addition of
   patch 8 and 9.
 - Rebased everything on 6.9.0-rc1
 - Added review tags for patches that did not change

Damien Le Moal (30):
  block: Do not force full zone append completion in req_bio_endio()
  block: Restore sector of flush requests
  block: Remove req_bio_endio()
  block: Introduce blk_zone_update_request_bio()
  block: Introduce bio_straddles_zones() and
    bio_offset_from_zone_start()
  block: Allow using bio_attempt_back_merge() internally
  block: Remember zone capacity when revalidating zones
  block: Introduce zone write plugging
  block: Pre-allocate zone write plugs
  block: Fake max open zones limit when there is no limit
  block: Allow zero value of max_zone_append_sectors queue limit
  block: Implement zone append emulation
  block: Allow BIO-based drivers to use blk_revalidate_disk_zones()
  dm: Use the block layer zone append emulation
  scsi: sd: Use the block layer zone append emulation
  ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  null_blk: Introduce zone_append_max_sectors attribute
  null_blk: Introduce fua attribute
  nvmet: zns: Do not reference the gendisk conv_zones_bitmap
  block: Remove BLK_STS_ZONE_RESOURCE
  block: Simplify blk_revalidate_disk_zones() interface
  block: mq-deadline: Remove support for zone write locking
  block: Remove elevator required features
  block: Do not check zone type in blk_check_zone_append()
  block: Move zone related debugfs attribute to blk-zoned.c
  block: Replace zone_wlock debugfs entry with zone_wplugs entry
  block: Remove zone write locking
  block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED
  block: Do not special-case plugging of zone write operations

 block/Kconfig                     |    5 -
 block/Makefile                    |    1 -
 block/bio.c                       |    7 +
 block/blk-core.c                  |   11 +-
 block/blk-flush.c                 |    1 +
 block/blk-merge.c                 |   22 +-
 block/blk-mq-debugfs-zoned.c      |   22 -
 block/blk-mq-debugfs.c            |    3 +-
 block/blk-mq-debugfs.h            |    6 +-
 block/blk-mq.c                    |  140 ++-
 block/blk-mq.h                    |   31 -
 block/blk-settings.c              |   46 +-
 block/blk-sysfs.c                 |    2 +-
 block/blk-zoned.c                 | 1412 +++++++++++++++++++++++++++--
 block/blk.h                       |   69 +-
 block/elevator.c                  |   46 +-
 block/elevator.h                  |    1 -
 block/genhd.c                     |    3 +-
 block/mq-deadline.c               |  176 +---
 drivers/block/null_blk/main.c     |   24 +-
 drivers/block/null_blk/null_blk.h |    2 +
 drivers/block/null_blk/zoned.c    |   23 +-
 drivers/block/ublk_drv.c          |    5 +-
 drivers/block/virtio_blk.c        |    2 +-
 drivers/md/dm-core.h              |    2 +-
 drivers/md/dm-zone.c              |  476 +---------
 drivers/md/dm.c                   |   75 +-
 drivers/md/dm.h                   |    4 +-
 drivers/nvme/host/core.c          |    2 +-
 drivers/nvme/target/zns.c         |   10 +-
 drivers/scsi/scsi_lib.c           |    1 -
 drivers/scsi/sd.c                 |    8 -
 drivers/scsi/sd.h                 |   19 -
 drivers/scsi/sd_zbc.c             |  335 +------
 include/linux/blk-mq.h            |   85 +-
 include/linux/blk_types.h         |   30 +-
 include/linux/blkdev.h            |  104 +--
 37 files changed, 1745 insertions(+), 1466 deletions(-)
 delete mode 100644 block/blk-mq-debugfs-zoned.c

-- 
2.44.0


^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH v3 01/30] block: Do not force full zone append completion in req_bio_endio()
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  4:10   ` Christoph Hellwig
  2024-03-28 18:14   ` Bart Van Assche
  2024-03-28  0:43 ` [PATCH v3 02/30] block: Restore sector of flush requests Damien Le Moal
                   ` (29 subsequent siblings)
  30 siblings, 2 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

This reverts commit 748dc0b65ec2b4b7b3dbd7befcc4a54fdcac7988.

Partial zone append completions cannot be supported as there is no
guarantees that the fragmented data will be written sequentially in the
same manner as with a full command. Commit 748dc0b65ec2 ("block: fix
partial zone append completion handling in req_bio_endio()") changed
req_bio_endio() to always advance a partially failed BIO by its full
length, but this can lead to incorrect accounting. So revert this
change and let low level device drivers handle this case by always
failing completely zone append operations. With this revert, users will
still see an IO error for a partially completed zone append BIO.

Fixes: 748dc0b65ec2 ("block: fix partial zone append completion handling in req_bio_endio()")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-mq.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 555ada922cf0..32afb87efbd0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -770,16 +770,11 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 		/*
 		 * Partial zone append completions cannot be supported as the
 		 * BIO fragments may end up not being written sequentially.
-		 * For such case, force the completed nbytes to be equal to
-		 * the BIO size so that bio_advance() sets the BIO remaining
-		 * size to 0 and we end up calling bio_endio() before returning.
 		 */
-		if (bio->bi_iter.bi_size != nbytes) {
+		if (bio->bi_iter.bi_size != nbytes)
 			bio->bi_status = BLK_STS_IOERR;
-			nbytes = bio->bi_iter.bi_size;
-		} else {
+		else
 			bio->bi_iter.bi_sector = rq->__sector;
-		}
 	}
 
 	bio_advance(bio, nbytes);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 02/30] block: Restore sector of flush requests
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
  2024-03-28  0:43 ` [PATCH v3 01/30] block: Do not force full zone append completion in req_bio_endio() Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  0:43 ` [PATCH v3 03/30] block: Remove req_bio_endio() Damien Le Moal
                   ` (28 subsequent siblings)
  30 siblings, 0 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

On completion of a flush sequence, blk_flush_restore_request() restores
the bio of a request to the original submitted BIO. However, the last
use of the request in the flush sequence may have been for a POSTFLUSH
which does not have a sector. So make sure to restore the request sector
using the iter sector of the original BIO. This BIO has not changed yet
since the completions of the flush sequence intermediate steps use
requeueing of the request until all steps are completed.

Restoring the request sector ensures that blk_mq_end_request() will see
a valid sector as originally set when the flush BIO was submitted.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-flush.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index b0f314f4bc14..2f58ae018464 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -130,6 +130,7 @@ static void blk_flush_restore_request(struct request *rq)
 	 * original @rq->bio.  Restore it.
 	 */
 	rq->bio = rq->biotail;
+	rq->__sector = rq->bio->bi_iter.bi_sector;
 
 	/* make @rq a normal request */
 	rq->rq_flags &= ~RQF_FLUSH_SEQ;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 03/30] block: Remove req_bio_endio()
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
  2024-03-28  0:43 ` [PATCH v3 01/30] block: Do not force full zone append completion in req_bio_endio() Damien Le Moal
  2024-03-28  0:43 ` [PATCH v3 02/30] block: Restore sector of flush requests Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  4:13   ` Christoph Hellwig
  2024-03-28 21:28   ` Bart Van Assche
  2024-03-28  0:43 ` [PATCH v3 04/30] block: Introduce blk_zone_update_request_bio() Damien Le Moal
                   ` (27 subsequent siblings)
  30 siblings, 2 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Moving req_bio_endio() code into its only caller, blk_update_request(),
allows reducing accesses to and tests of bio and request fields. Also,
given that partial completions of zone append operations is not
possible and that zone append operations cannot be merged, the update
of the BIO sector using the request sector for these operations can be
moved directly before the call to bio_endio().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-mq.c | 58 ++++++++++++++++++++++++--------------------------
 1 file changed, 28 insertions(+), 30 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 32afb87efbd0..e55af6058cbf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -761,31 +761,6 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
 }
 EXPORT_SYMBOL(blk_dump_rq_flags);
 
-static void req_bio_endio(struct request *rq, struct bio *bio,
-			  unsigned int nbytes, blk_status_t error)
-{
-	if (unlikely(error)) {
-		bio->bi_status = error;
-	} else if (req_op(rq) == REQ_OP_ZONE_APPEND) {
-		/*
-		 * Partial zone append completions cannot be supported as the
-		 * BIO fragments may end up not being written sequentially.
-		 */
-		if (bio->bi_iter.bi_size != nbytes)
-			bio->bi_status = BLK_STS_IOERR;
-		else
-			bio->bi_iter.bi_sector = rq->__sector;
-	}
-
-	bio_advance(bio, nbytes);
-
-	if (unlikely(rq->rq_flags & RQF_QUIET))
-		bio_set_flag(bio, BIO_QUIET);
-	/* don't actually finish bio if it's part of flush sequence */
-	if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
-		bio_endio(bio);
-}
-
 static void blk_account_io_completion(struct request *req, unsigned int bytes)
 {
 	if (req->part && blk_do_io_stat(req)) {
@@ -889,6 +864,8 @@ static void blk_complete_request(struct request *req)
 bool blk_update_request(struct request *req, blk_status_t error,
 		unsigned int nr_bytes)
 {
+	bool is_flush = req->rq_flags & RQF_FLUSH_SEQ;
+	bool quiet = req->rq_flags & RQF_QUIET;
 	int total_bytes;
 
 	trace_block_rq_complete(req, error, nr_bytes);
@@ -909,9 +886,8 @@ bool blk_update_request(struct request *req, blk_status_t error,
 	if (blk_crypto_rq_has_keyslot(req) && nr_bytes >= blk_rq_bytes(req))
 		__blk_crypto_rq_put_keyslot(req);
 
-	if (unlikely(error && !blk_rq_is_passthrough(req) &&
-		     !(req->rq_flags & RQF_QUIET)) &&
-		     !test_bit(GD_DEAD, &req->q->disk->state)) {
+	if (unlikely(error && !blk_rq_is_passthrough(req) && !quiet) &&
+	    !test_bit(GD_DEAD, &req->q->disk->state)) {
 		blk_print_req_error(req, error);
 		trace_block_rq_error(req, error, nr_bytes);
 	}
@@ -923,12 +899,34 @@ bool blk_update_request(struct request *req, blk_status_t error,
 		struct bio *bio = req->bio;
 		unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
 
-		if (bio_bytes == bio->bi_iter.bi_size)
+		if (unlikely(error))
+			bio->bi_status = error;
+
+		if (bio_bytes == bio->bi_iter.bi_size) {
 			req->bio = bio->bi_next;
+		} else if (req_op(req) == REQ_OP_ZONE_APPEND &&
+			   error == BLK_STS_OK) {
+			/*
+			 * Partial zone append completions cannot be supported
+			 * as the BIO fragments may end up not being written
+			 * sequentially.
+			 */
+			bio->bi_status = BLK_STS_IOERR;
+		}
 
 		/* Completion has already been traced */
 		bio_clear_flag(bio, BIO_TRACE_COMPLETION);
-		req_bio_endio(req, bio, bio_bytes, error);
+		if (unlikely(quiet))
+			bio_set_flag(bio, BIO_QUIET);
+
+		bio_advance(bio, bio_bytes);
+
+		/* Don't actually finish bio if it's part of flush sequence */
+		if (!bio->bi_iter.bi_size && !is_flush) {
+			if (req_op(req) == REQ_OP_ZONE_APPEND)
+				bio->bi_iter.bi_sector = req->__sector;
+			bio_endio(bio);
+		}
 
 		total_bytes += bio_bytes;
 		nr_bytes -= bio_bytes;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 04/30] block: Introduce blk_zone_update_request_bio()
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (2 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 03/30] block: Remove req_bio_endio() Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  4:14   ` Christoph Hellwig
  2024-03-28 21:31   ` Bart Van Assche
  2024-03-28  0:43 ` [PATCH v3 05/30] block: Introduce bio_straddles_zones() and bio_offset_from_zone_start() Damien Le Moal
                   ` (26 subsequent siblings)
  30 siblings, 2 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

On completion of a zone append request, the request sector indicates the
location of the written data. This value must be returned to the user
through the BIO iter sector. This is done in 2 places: in
blk_complete_request() and in blk_update_request(). Introduce the inline
helper function blk_zone_update_request_bio() to avoid duplicating
this BIO update for zone append requests, and to compile out this
helper call when CONFIG_BLK_DEV_ZONED is not enabled.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/blk-mq.c | 11 +++++------
 block/blk.h    | 19 ++++++++++++++++++-
 2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e55af6058cbf..70dfb4af65cf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -820,11 +820,11 @@ static void blk_complete_request(struct request *req)
 		/* Completion has already been traced */
 		bio_clear_flag(bio, BIO_TRACE_COMPLETION);
 
-		if (req_op(req) == REQ_OP_ZONE_APPEND)
-			bio->bi_iter.bi_sector = req->__sector;
-
-		if (!is_flush)
+		if (!is_flush) {
+			blk_zone_update_request_bio(req, bio);
 			bio_endio(bio);
+		}
+
 		bio = next;
 	} while (bio);
 
@@ -923,8 +923,7 @@ bool blk_update_request(struct request *req, blk_status_t error,
 
 		/* Don't actually finish bio if it's part of flush sequence */
 		if (!bio->bi_iter.bi_size && !is_flush) {
-			if (req_op(req) == REQ_OP_ZONE_APPEND)
-				bio->bi_iter.bi_sector = req->__sector;
+			blk_zone_update_request_bio(req, bio);
 			bio_endio(bio);
 		}
 
diff --git a/block/blk.h b/block/blk.h
index 5cac4e29ae17..a12cde1d45de 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -409,12 +409,29 @@ static inline struct bio *blk_queue_bounce(struct bio *bio,
 
 #ifdef CONFIG_BLK_DEV_ZONED
 void disk_free_zone_bitmaps(struct gendisk *disk);
+static inline void blk_zone_update_request_bio(struct request *rq,
+					       struct bio *bio)
+{
+	/*
+	 * For zone append requests, the request sector indicates the location
+	 * at which the BIO data was written. Return this value to the BIO
+	 * issuer through the BIO iter sector.
+	 */
+	if (req_op(rq) == REQ_OP_ZONE_APPEND)
+		bio->bi_iter.bi_sector = rq->__sector;
+}
 int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd,
 		unsigned long arg);
 int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
 		unsigned int cmd, unsigned long arg);
 #else /* CONFIG_BLK_DEV_ZONED */
-static inline void disk_free_zone_bitmaps(struct gendisk *disk) {}
+static inline void disk_free_zone_bitmaps(struct gendisk *disk)
+{
+}
+static inline void blk_zone_update_request_bio(struct request *rq,
+					       struct bio *bio)
+{
+}
 static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
 		unsigned int cmd, unsigned long arg)
 {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 05/30] block: Introduce bio_straddles_zones() and bio_offset_from_zone_start()
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (3 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 04/30] block: Introduce blk_zone_update_request_bio() Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28 21:32   ` Bart Van Assche
  2024-03-28  0:43 ` [PATCH v3 06/30] block: Allow using bio_attempt_back_merge() internally Damien Le Moal
                   ` (25 subsequent siblings)
  30 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Implement the inline helper functions bio_straddles_zones() and
bio_offset_from_zone_start() to respectively test if a BIO crosses a
zone boundary (the start sector and last sector belong to different
zones) and to obtain the offset of a BIO from the start sector of its
target zone.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/blkdev.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c3e8f7cf96be..ec7bd7091467 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -853,6 +853,13 @@ static inline unsigned int bio_zone_no(struct bio *bio)
 	return disk_zone_no(bio->bi_bdev->bd_disk, bio->bi_iter.bi_sector);
 }
 
+static inline bool bio_straddles_zones(struct bio *bio)
+{
+	return bio_sectors(bio) &&
+		bio_zone_no(bio) !=
+		disk_zone_no(bio->bi_bdev->bd_disk, bio_end_sector(bio) - 1);
+}
+
 static inline unsigned int bio_zone_is_seq(struct bio *bio)
 {
 	return disk_zone_is_seq(bio->bi_bdev->bd_disk, bio->bi_iter.bi_sector);
@@ -1328,6 +1335,12 @@ static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
 	return sector & (bdev_zone_sectors(bdev) - 1);
 }
 
+static inline sector_t bio_offset_from_zone_start(struct bio *bio)
+{
+	return bdev_offset_from_zone_start(bio->bi_bdev,
+					   bio->bi_iter.bi_sector);
+}
+
 static inline bool bdev_is_zone_start(struct block_device *bdev,
 				      sector_t sector)
 {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 06/30] block: Allow using bio_attempt_back_merge() internally
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (4 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 05/30] block: Introduce bio_straddles_zones() and bio_offset_from_zone_start() Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  0:43 ` [PATCH v3 07/30] block: Remember zone capacity when revalidating zones Damien Le Moal
                   ` (24 subsequent siblings)
  30 siblings, 0 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Remove "static" from the definition of bio_attempt_back_merge() and
declare this function in block/blk.h to allow using it internally from
other block layer files. The definition of enum bio_merge_status is
also moved to block/blk.h.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-merge.c | 8 +-------
 block/blk.h       | 8 ++++++++
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2a06fd33039d..88367c10c8bc 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -972,13 +972,7 @@ static void blk_account_io_merge_bio(struct request *req)
 	part_stat_unlock();
 }
 
-enum bio_merge_status {
-	BIO_MERGE_OK,
-	BIO_MERGE_NONE,
-	BIO_MERGE_FAILED,
-};
-
-static enum bio_merge_status bio_attempt_back_merge(struct request *req,
+enum bio_merge_status bio_attempt_back_merge(struct request *req,
 		struct bio *bio, unsigned int nr_segs)
 {
 	const blk_opf_t ff = bio_failfast(bio);
diff --git a/block/blk.h b/block/blk.h
index a12cde1d45de..f2a521b72f9d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -269,6 +269,14 @@ static inline void bio_integrity_free(struct bio *bio)
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
 
+enum bio_merge_status {
+	BIO_MERGE_OK,
+	BIO_MERGE_NONE,
+	BIO_MERGE_FAILED,
+};
+
+enum bio_merge_status bio_attempt_back_merge(struct request *req,
+		struct bio *bio, unsigned int nr_segs);
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 		unsigned int nr_segs);
 bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 07/30] block: Remember zone capacity when revalidating zones
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (5 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 06/30] block: Allow using bio_attempt_back_merge() internally Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28 21:38   ` Bart Van Assche
  2024-03-28  0:43 ` [PATCH v3 08/30] block: Introduce zone write plugging Damien Le Moal
                   ` (23 subsequent siblings)
  30 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

In preparation for adding zone write plugging, modify
blk_revalidate_disk_zones() to get the capacity of zones of a zoned
block device. This capacity value as a number of 512B sectors is stored
in the gendisk zone_capacity field.

Given that host-managed SMR disks (including zoned UFS drives) and all
known NVMe ZNS devices have the same zone capacity for all zones
blk_revalidate_disk_zones() returns an error if different capacities are
detected for different zones.

This also adds check to verify that the values reported by the device
for zone capacities are correct, that is, that the zone capacity is
never 0, does not exceed the zone size and is equal to the zone size for
conventional zones.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-zoned.c      | 26 ++++++++++++++++++++++++++
 include/linux/blkdev.h |  1 +
 2 files changed, 27 insertions(+)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index da0f4b2a8fa0..23d9bb21c459 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -438,6 +438,7 @@ struct blk_revalidate_zone_args {
 	unsigned long	*conv_zones_bitmap;
 	unsigned long	*seq_zones_wlock;
 	unsigned int	nr_zones;
+	unsigned int	zone_capacity;
 	sector_t	sector;
 };
 
@@ -482,9 +483,20 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 		return -ENODEV;
 	}
 
+	if (!zone->capacity || zone->capacity > zone->len) {
+		pr_warn("%s: Invalid zone capacity\n",
+			disk->disk_name);
+		return -ENODEV;
+	}
+
 	/* Check zone type */
 	switch (zone->type) {
 	case BLK_ZONE_TYPE_CONVENTIONAL:
+		if (zone->capacity != zone->len) {
+			pr_warn("%s: Invalid conventional zone capacity\n",
+				disk->disk_name);
+			return -ENODEV;
+		}
 		if (!args->conv_zones_bitmap) {
 			args->conv_zones_bitmap =
 				blk_alloc_zone_bitmap(q->node, args->nr_zones);
@@ -500,6 +512,18 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 			if (!args->seq_zones_wlock)
 				return -ENOMEM;
 		}
+
+		/*
+		 * Remember the capacity of the first sequential zone and check
+		 * if it is constant for all zones.
+		 */
+		if (!args->zone_capacity)
+			args->zone_capacity = zone->capacity;
+		if (zone->capacity != args->zone_capacity) {
+			pr_warn("%s: Invalid variable zone capacity\n",
+				disk->disk_name);
+			return -ENODEV;
+		}
 		break;
 	case BLK_ZONE_TYPE_SEQWRITE_PREF:
 	default:
@@ -595,6 +619,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 	blk_mq_freeze_queue(q);
 	if (ret > 0) {
 		disk->nr_zones = args.nr_zones;
+		disk->zone_capacity = args.zone_capacity;
 		swap(disk->seq_zones_wlock, args.seq_zones_wlock);
 		swap(disk->conv_zones_bitmap, args.conv_zones_bitmap);
 		if (update_driver_data)
@@ -608,6 +633,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 
 	kfree(args.seq_zones_wlock);
 	kfree(args.conv_zones_bitmap);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ec7bd7091467..4e81f714cca7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -191,6 +191,7 @@ struct gendisk {
 	 * blk_mq_unfreeze_queue().
 	 */
 	unsigned int		nr_zones;
+	unsigned int		zone_capacity;
 	unsigned long		*conv_zones_bitmap;
 	unsigned long		*seq_zones_wlock;
 #endif /* CONFIG_BLK_DEV_ZONED */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 08/30] block: Introduce zone write plugging
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (6 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 07/30] block: Remember zone capacity when revalidating zones Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  4:48   ` Christoph Hellwig
  2024-03-28 22:20   ` Bart Van Assche
  2024-03-28  0:43 ` [PATCH v3 09/30] block: Pre-allocate zone write plugs Damien Le Moal
                   ` (22 subsequent siblings)
  30 siblings, 2 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Zone write plugging implements a per-zone "plug" for write operations
to control the submission and execution order of write operations to
sequential write required zones of a zoned block device. Per-zone
plugging guarantees that at any time there is at most only one write
request per zone being executed. This mechanism is intended to replace
zone write locking which implements a similar per-zone write throttling
at the scheduler level, but is implemented only by mq-deadline.

Unlike zone write locking which operates on requests, zone write
plugging operates on BIOs. A zone write plug is simply a BIO list that
is atomically manipulated using a spinlock and a kblockd submission
work. A write BIO to a zone is "plugged" to delay its execution if a
write BIO for the same zone was already issued, that is, if a write
request for the same zone is being executed. The next plugged BIO is
unplugged and issued once the write request completes.

This mechanism allows to:
 - Untangle zone write ordering from block IO schedulers. This allows
   removing the restriction on using mq-deadline for writing to zoned
   block devices. Any block IO scheduler, including "none" can be used.
 - Zone write plugging operates on BIOs instead of requests. Plugged
   BIOs waiting for execution thus do not hold scheduling tags and thus
   are not preventing other BIOs from executing (reads or writes to
   other zones). Depending on the workload, this can significantly
   improve the device use (higher queue depth operation) and
   performance.
 - Both blk-mq (request based) zoned devices and BIO-based zoned devices
   (e.g.  device mapper) can use zone write plugging. It is mandatory
   for the former but optional for the latter. BIO-based drivers can
   use zone write plugging to implement write ordering guarantees, or
   the drivers can implement their own if needed.
 - The code is less invasive in the block layer and is mostly limited to
   blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
   bio.c.

Zone write plugging is implemented using struct blk_zone_wplug. This
structure includes a spinlock, a BIO list and a work structure to
handle the submission of plugged BIOs. Zone write plugs structures are
managed using a per-disk hash table.

Plugging of zone write BIOs is done using the function
blk_zone_write_plug_bio() which returns false if a BIO execution does
not need to be delayed and true otherwise. This function is called
from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
spanning multiple zones which would cause mishandling of zone write
plugs. This ichange enables by default zone write plugging for any mq
request-based block device. BIO-based device drivers can also use zone
write plugging by expliclty calling blk_zone_write_plug_bio() in their
->submit_bio method. For such devices, the driver must ensure that a
BIO passed to blk_zone_write_plug_bio() is already split and not
straddling zone boundaries.

Only write and write zeroes BIOs are plugged. Zone write plugging does
not introduce any significant overhead for other operations. A BIO that
is being handled through zone write plugging is flagged using the new
BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
The completion of BIOs and requests flagged trigger respectively calls
to the functions blk_zone_write_plug_bio_endio() and
blk_zone_write_plug_complete_request(). The latter function is used to
trigger submission of the next plugged BIO using the zone plug work.
blk_zone_write_plug_bio_endio() does the same for BIO-based devices.
This ensures that at any time, at most one request (blk-mq devices) or
one BIO (BIO-based devices) is being executed for any zone. The
handling of zone write plugs using a per-zone plug spinlock maximizes
parallelism and device usage by allowing multiple zones to be writen
simultaneously without lock contention.

Zone write plugging ignores flush BIOs without data. Hovever, any flush
BIO that has data is always plugged so that the write part of the flush
sequence is serialized with other regular writes.

Given that any BIO handled through zone write plugging will be the only
BIO in flight for the target zone when it is executed, the unplugging
and submission of a BIO will have no chance of successfully merging with
plugged requests or requests in the scheduler. To overcome this
potential performance degradation, blk_mq_submit_bio() calls the
function blk_zone_write_plug_attempt_merge() to try to merge other
plugged BIOs with the one just unplugged and submitted. Successful
merging is signaled using blk_zone_write_plug_bio_merged(), called from
bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
of segments of plugged BIOs to attempt merging, the number of segments
of a plugged BIO is saved using the new struct bio field
__bi_nr_segments. To avoid growing the size of struct bio, this field is
added as a union with the bio_cookie field. This is safe to do as
polling is always disabled for plugged BIOs.

When BIOs are plugged in a zone write plug, the device request queue
usage counter is always incremented. This reference is kept and reused
for blk-mq devices when the plugged BIO is unplugged and submitted
again using submit_bio_noacct_nocheck(). For this case, the unplugged
BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
blk_mq_submit_bio() proceeds directly to allocating a new request for
the BIO, re-using the usage reference count taken when the BIO was
plugged. This extra reference count is dropped in
blk_zone_write_plug_attempt_merge() for any plugged BIO that is
successfully merged. Given that BIO-based devices will not take this
path, the extra reference is dropped after a plugged BIO is unplugged
and submitted.

Zone write plugs are dynamically allocated and managed using a hash
table (an array of struct hlist_head) with RCU protection.
A zone write plug is allocated when a write BIO is received for the
zone and not freed until the zone is fully written, reset or finished.
To detect when a zone write plug can be freed, the write state of each
zone is tracked using a write pointer offset which corresponds to the
offset of a zone write pointer relative to the zone start. Write
operations always increment this write pointer offset. Zone reset
operations set it to 0 and zone finish operations set it to the zone
size.

If a write error happens, the wp_offset value of a zone write plug may
become incorrect and out of sync with the device managed write pointer.
This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
The function blk_zone_wplug_handle_error() is called from the new disk
zone write plug work when this flag is set. This function executes a
report zone to update the zone write pointer offset to the current
value as indicated by the device. The disk zone write plug work is
scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
write. Once scheduled, the disk zone write plugs work keeps running
until all zone errors are handled.

To match the new data structures used for zoned disks, the function
disk_free_zone_bitmaps() is renamed to the more generic
disk_free_zone_resources(). The function disk_init_zone_resources() is
also introduced to initialize zone write plugs resources when a gendisk
is allocated.

This commit contains contributions from Christoph Hellwig <hch@lst.de>.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/bio.c               |    7 +
 block/blk-merge.c         |   11 +
 block/blk-mq.c            |   38 +-
 block/blk-zoned.c         | 1098 ++++++++++++++++++++++++++++++++++++-
 block/blk.h               |   36 +-
 block/genhd.c             |    3 +-
 include/linux/blk-mq.h    |    2 +
 include/linux/blk_types.h |    8 +-
 include/linux/blkdev.h    |   11 +
 9 files changed, 1203 insertions(+), 11 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index d24420ed1c4c..4ece8cef1fbe 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1576,6 +1576,13 @@ void bio_endio(struct bio *bio)
 	if (!bio_integrity_endio(bio))
 		return;
 
+	/*
+	 * For BIOs handled through a zone write plug, signal the completion
+	 * of the BIO so that the next plugged BIO can be submitted.
+	 */
+	if (bio_zone_write_plugging(bio))
+		blk_zone_write_plug_bio_endio(bio);
+
 	rq_qos_done_bio(bio);
 
 	if (bio->bi_bdev && bio_flagged(bio, BIO_TRACE_COMPLETION)) {
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 88367c10c8bc..b96466d2ba94 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -377,6 +377,7 @@ struct bio *__bio_split_to_limits(struct bio *bio,
 		blkcg_bio_issue_init(split);
 		bio_chain(split, bio);
 		trace_block_split(split, bio->bi_iter.bi_sector);
+		WARN_ON_ONCE(bio_zone_write_plugging(bio));
 		submit_bio_noacct(bio);
 		return split;
 	}
@@ -988,6 +989,9 @@ enum bio_merge_status bio_attempt_back_merge(struct request *req,
 
 	blk_update_mixed_merge(req, bio, false);
 
+	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
+		blk_zone_write_plug_bio_merged(bio);
+
 	req->biotail->bi_next = bio;
 	req->biotail = bio;
 	req->__data_len += bio->bi_iter.bi_size;
@@ -1003,6 +1007,13 @@ static enum bio_merge_status bio_attempt_front_merge(struct request *req,
 {
 	const blk_opf_t ff = bio_failfast(bio);
 
+	/*
+	 * A front merge for zone writes can happen only if the user submitted
+	 * writes out of order. Do not attempt this to let the write fail.
+	 */
+	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
+		return BIO_MERGE_FAILED;
+
 	if (!ll_front_merge_fn(req, bio, nr_segs))
 		return BIO_MERGE_FAILED;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 70dfb4af65cf..4b8dd2e7b870 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -828,6 +828,9 @@ static void blk_complete_request(struct request *req)
 		bio = next;
 	} while (bio);
 
+	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
+		blk_zone_write_plug_complete_request(req);
+
 	/*
 	 * Reset counters so that the request stacking driver
 	 * can find how many bytes remain in the request
@@ -938,6 +941,9 @@ bool blk_update_request(struct request *req, blk_status_t error,
 	 * completely done
 	 */
 	if (!req->bio) {
+		if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
+			blk_zone_write_plug_complete_request(req);
+
 		/*
 		 * Reset counters so that the request stacking driver
 		 * can find how many bytes remain in the request
@@ -2952,15 +2958,30 @@ void blk_mq_submit_bio(struct bio *bio)
 	struct request *rq;
 	blk_status_t ret;
 
+	/*
+	 * If the plug has a cached request for this queue, try use it.
+	 */
+	rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
+
+	/*
+	 * A BIO that was released from a zone write plug has already been
+	 * through the preparation in this function, already holds a reference
+	 * on the queue usage counter, and is the only write BIO in-flight for
+	 * the target zone. Go straight to preparing a request for it.
+	 */
+	if (bio_zone_write_plugging(bio)) {
+		nr_segs = bio->__bi_nr_segments;
+		if (rq)
+			blk_queue_exit(q);
+		goto new_request;
+	}
+
 	bio = blk_queue_bounce(bio, q);
 
 	/*
-	 * If the plug has a cached request for this queue, try use it.
-	 *
 	 * The cached request already holds a q_usage_counter reference and we
 	 * don't have to acquire a new one if we use it.
 	 */
-	rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
 	if (!rq) {
 		if (unlikely(bio_queue_enter(bio)))
 			return;
@@ -2977,6 +2998,10 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
 		goto queue_exit;
 
+	if (blk_queue_is_zoned(q) && blk_zone_write_plug_bio(bio, nr_segs))
+		goto queue_exit;
+
+new_request:
 	if (!rq) {
 		rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
 		if (unlikely(!rq))
@@ -2993,8 +3018,12 @@ void blk_mq_submit_bio(struct bio *bio)
 
 	ret = blk_crypto_rq_get_keyslot(rq);
 	if (ret != BLK_STS_OK) {
+		bool zwplugging = bio_zone_write_plugging(bio);
+
 		bio->bi_status = ret;
 		bio_endio(bio);
+		if (zwplugging)
+			blk_zone_write_plug_complete_request(rq);
 		blk_mq_free_request(rq);
 		return;
 	}
@@ -3002,6 +3031,9 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
 		return;
 
+	if (bio_zone_write_plugging(bio))
+		blk_zone_write_plug_attempt_merge(rq);
+
 	if (plug) {
 		blk_add_rq_to_plug(plug, rq);
 		return;
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 23d9bb21c459..03083522df84 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -7,6 +7,7 @@
  *
  * Copyright (c) 2016, Damien Le Moal
  * Copyright (c) 2016, Western Digital
+ * Copyright (c) 2024, Western Digital Corporation or its affiliates.
  */
 
 #include <linux/kernel.h>
@@ -16,8 +17,11 @@
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/mm.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
 
 #include "blk.h"
+#include "blk-mq-sched.h"
 
 #define ZONE_COND_NAME(name) [BLK_ZONE_COND_##name] = #name
 static const char *const zone_cond_name[] = {
@@ -32,6 +36,62 @@ static const char *const zone_cond_name[] = {
 };
 #undef ZONE_COND_NAME
 
+/*
+ * Per-zone write plug.
+ * @node: hlist_node structure for managing the plug using a hash table.
+ * @link: To list the plug in the zone write plug error list of the disk.
+ * @ref: Zone write plug reference counter. A zone write plug reference is
+ *       always at least 1 when the plug is hashed in the disk plug hash table.
+ *       The reference is incremented whenever a new BIO needing plugging is
+ *       submitted and when a function needs to manipulate a plug. The
+ *       reference count is decremented whenever a plugged BIO completes and
+ *       when a function that referenced the plug returns. The initial
+ *       reference is dropped whenever the zone of the zone write plug is reset,
+ *       finished and when the zone becomes full (last write BIO to the zone
+ *       completes).
+ * @lock: Spinlock to atomically manipulate the plug.
+ * @flags: Flags indicating the plug state.
+ * @zone_no: The number of the zone the plug is managing.
+ * @wp_offset: The zone write pointer location relative to the start of the zone
+ *             as a number of 512B sectors.
+ * @bio_list: The list of BIOs that are currently plugged.
+ * @bio_work: Work struct to handle issuing of plugged BIOs
+ * @rcu_head: RCU head to free zone write plugs with an RCU grace period.
+ */
+struct blk_zone_wplug {
+	struct hlist_node	node;
+	struct list_head	link;
+	atomic_t		ref;
+	spinlock_t		lock;
+	unsigned int		flags;
+	unsigned int		zone_no;
+	unsigned int		wp_offset;
+	struct bio_list		bio_list;
+	struct work_struct	bio_work;
+	struct rcu_head		rcu_head;
+};
+
+/*
+ * Zone write plug flags bits:
+ *  - BLK_ZONE_WPLUG_PLUGGED: Indicate that the zone write plug is plugged,
+ *    that is, that write BIOs are being throttled due to a write BIO already
+ *    being executed or the zone write plug bio list is not empty.
+ *  - BLK_ZONE_WPLUG_ERROR: Indicate that a write error happened which will be
+ *    recovered with a report zone to update the zone write pointer offset.
+ *  - BLK_ZONE_WPLUG_UNHASHED: Indicates that the zone write plug was removed
+ *    from the disk hash table and that the initial reference to the zone
+ *    write plug set when the plug was first added to the hash table has been
+ *    dropped. This flag is set when a zone is reset, finished or become full,
+ *    to prevent new references to the zone write plug to be taken for
+ *    newly incoming BIOs. A zone write plug flagged with this flag will be
+ *    freed once all remaining references from BIOs or functions are dropped.
+ */
+#define BLK_ZONE_WPLUG_PLUGGED		(1U << 0)
+#define BLK_ZONE_WPLUG_ERROR		(1U << 1)
+#define BLK_ZONE_WPLUG_UNHASHED		(1U << 2)
+
+#define BLK_ZONE_WPLUG_BUSY	(BLK_ZONE_WPLUG_PLUGGED | BLK_ZONE_WPLUG_ERROR)
+
 /**
  * blk_zone_cond_str - Return string XXX in BLK_ZONE_COND_XXX.
  * @zone_cond: BLK_ZONE_COND_XXX.
@@ -425,12 +485,1020 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
 	return ret;
 }
 
-void disk_free_zone_bitmaps(struct gendisk *disk)
+static inline bool disk_zone_is_conv(struct gendisk *disk, sector_t sector)
+{
+	if (!disk->conv_zones_bitmap)
+		return false;
+	return test_bit(disk_zone_no(disk, sector), disk->conv_zones_bitmap);
+}
+
+static inline bool bio_zone_is_conv(struct bio *bio)
+{
+	return disk_zone_is_conv(bio->bi_bdev->bd_disk, bio->bi_iter.bi_sector);
+}
+
+static void blk_zone_wplug_bio_work(struct work_struct *work);
+
+static void disk_init_zone_wplug(struct gendisk *disk,
+				 struct blk_zone_wplug *zwplug,
+				 unsigned int flags, sector_t sector)
+{
+	unsigned int zno = disk_zone_no(disk, sector);
+
+	/*
+	 * Initialize the zoen write plug with an extra reference so that
+	 * it is not freed when the zone write plug becomes idle without
+	 * the zone being full.
+	 */
+	INIT_HLIST_NODE(&zwplug->node);
+	INIT_LIST_HEAD(&zwplug->link);
+	atomic_set(&zwplug->ref, 2);
+	spin_lock_init(&zwplug->lock);
+	zwplug->flags = flags;
+	zwplug->zone_no = zno;
+	zwplug->wp_offset = sector & (disk->queue->limits.chunk_sectors - 1);
+	bio_list_init(&zwplug->bio_list);
+	INIT_WORK(&zwplug->bio_work, blk_zone_wplug_bio_work);
+}
+
+static struct blk_zone_wplug *disk_alloc_zone_wplug(struct gendisk *disk,
+						sector_t sector, gfp_t gfp_mask)
+{
+	struct blk_zone_wplug *zwplug;
+
+	/* Allocate a new zone write plug. */
+	zwplug = kmalloc(sizeof(struct blk_zone_wplug), gfp_mask);
+	if (!zwplug)
+		return NULL;
+
+	disk_init_zone_wplug(disk, zwplug, 0, sector);
+
+	return zwplug;
+}
+
+static bool disk_insert_zone_wplug(struct gendisk *disk,
+				   struct blk_zone_wplug *zwplug)
+{
+	struct blk_zone_wplug *zwplg;
+	unsigned long flags;
+	unsigned int idx =
+		hash_32(zwplug->zone_no, disk->zone_wplugs_hash_bits);
+
+	/*
+	 * Add the new zone write plug to the hash table, but carefully as we
+	 * are racing with other submission context, so we may already have a
+	 * zone write plug for the same zone.
+	 */
+	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+	hlist_for_each_entry_rcu(zwplg, &disk->zone_wplugs_hash[idx], node) {
+		if (zwplg->zone_no == zwplug->zone_no) {
+			spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+			return false;
+		}
+	}
+	hlist_add_head_rcu(&zwplug->node, &disk->zone_wplugs_hash[idx]);
+	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+
+	return true;
+}
+
+static void disk_remove_zone_wplug(struct gendisk *disk,
+				   struct blk_zone_wplug *zwplug)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+	zwplug->flags |= BLK_ZONE_WPLUG_UNHASHED;
+	atomic_dec(&zwplug->ref);
+	hlist_del_init_rcu(&zwplug->node);
+	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+}
+
+static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
+						 struct blk_zone_wplug *zwplug)
+{
+	/* If the zone is still busy, the plug cannot be removed. */
+	if (zwplug->flags & BLK_ZONE_WPLUG_BUSY)
+		return false;
+
+	/* We can remove zone write plugs for zones that are empty or full. */
+	return !zwplug->wp_offset ||
+		zwplug->wp_offset >= disk->zone_capacity;
+}
+
+static inline struct blk_zone_wplug *
+disk_lookup_zone_wplug(struct gendisk *disk, sector_t sector)
+{
+	unsigned int zno = disk_zone_no(disk, sector);
+	unsigned int idx = hash_32(zno, disk->zone_wplugs_hash_bits);
+	struct blk_zone_wplug *zwplug;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(zwplug, &disk->zone_wplugs_hash[idx], node) {
+		if (zwplug->zone_no == zno)
+			goto unlock;
+	}
+	zwplug = NULL;
+
+unlock:
+	rcu_read_unlock();
+	return zwplug;
+}
+
+static inline struct blk_zone_wplug *bio_lookup_zone_wplug(struct bio *bio)
+{
+	return disk_lookup_zone_wplug(bio->bi_bdev->bd_disk,
+				      bio->bi_iter.bi_sector);
+}
+
+static inline void blk_get_zone_wplug(struct blk_zone_wplug *zwplug)
+{
+	atomic_inc(&zwplug->ref);
+}
+
+static struct blk_zone_wplug *disk_get_zone_wplug(struct gendisk *disk,
+						  sector_t sector)
+{
+	struct blk_zone_wplug *zwplug;
+
+	rcu_read_lock();
+	zwplug = disk_lookup_zone_wplug(disk, sector);
+	if (zwplug && !atomic_inc_not_zero(&zwplug->ref))
+		zwplug = NULL;
+	rcu_read_unlock();
+
+	return zwplug;
+}
+
+static inline void disk_put_zone_wplug(struct blk_zone_wplug *zwplug)
+{
+	if (atomic_dec_and_test(&zwplug->ref)) {
+		WARN_ON_ONCE(!bio_list_empty(&zwplug->bio_list));
+		WARN_ON_ONCE(!list_empty(&zwplug->link));
+
+		kfree_rcu(zwplug, rcu_head);
+	}
+}
+
+static void blk_zone_wplug_bio_work(struct work_struct *work);
+
+/*
+ * Get a reference on the write plug for the zone containing @sector.
+ * If the plug does not exist, it is allocated and hashed.
+ * Return a pointer to the zone write plug with the plug spinlock held.
+ */
+static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
+					sector_t sector, gfp_t gfp_mask,
+					unsigned long *flags)
+{
+	struct blk_zone_wplug *zwplug;
+
+again:
+	zwplug = disk_get_zone_wplug(disk, sector);
+	if (zwplug) {
+		/*
+		 * Check that a BIO completion or a zone reset or finish
+		 * operation has not already removed the zone write plug from
+		 * the hash table and dropped its reference count. In such case,
+		 * we need to get a new plug so start over from the beginning.
+		 */
+		spin_lock_irqsave(&zwplug->lock, *flags);
+		if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED) {
+			spin_unlock_irqrestore(&zwplug->lock, *flags);
+			disk_put_zone_wplug(zwplug);
+			goto again;
+		}
+		return zwplug;
+	}
+
+	zwplug = disk_alloc_zone_wplug(disk, sector, gfp_mask);
+	if (!zwplug)
+		return NULL;
+
+	spin_lock_irqsave(&zwplug->lock, *flags);
+
+	/*
+	 * Insert the new zone write plug in the hash table. This can fail only
+	 * if another context already inserted a plug. Retry from the beginning
+	 * in such case.
+	 */
+	if (!disk_insert_zone_wplug(disk, zwplug)) {
+		spin_unlock_irqrestore(&zwplug->lock, *flags);
+		kfree(zwplug);
+		goto again;
+	}
+
+	return zwplug;
+}
+
+static struct blk_zone_wplug *bio_get_and_lock_zone_wplug(struct bio *bio,
+							  unsigned long *flags)
+{
+	gfp_t gfp_mask;
+
+	if (bio->bi_opf & REQ_NOWAIT)
+		gfp_mask = GFP_NOWAIT;
+	else
+		gfp_mask = GFP_NOIO;
+
+	return disk_get_and_lock_zone_wplug(bio->bi_bdev->bd_disk,
+				bio->bi_iter.bi_sector, gfp_mask, flags);
+}
+
+static inline void blk_zone_wplug_bio_io_error(struct bio *bio)
+{
+	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+
+	bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING);
+	bio_io_error(bio);
+	blk_queue_exit(q);
+}
+
+/*
+ * Abort (fail) all plugged BIOs of a zone write plug.
+ */
+static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
+{
+	struct bio *bio;
+
+	while ((bio = bio_list_pop(&zwplug->bio_list))) {
+		blk_zone_wplug_bio_io_error(bio);
+		disk_put_zone_wplug(zwplug);
+	}
+}
+
+/*
+ * Abort (fail) all plugged BIOs of a zone write plug that are not aligned
+ * with the assumed write pointer location of the zone when the BIO will
+ * be unplugged.
+ */
+static void disk_zone_wplug_abort_unaligned(struct gendisk *disk,
+					    struct blk_zone_wplug *zwplug)
+{
+	unsigned int zone_capacity = disk->zone_capacity;
+	unsigned int wp_offset = zwplug->wp_offset;
+	struct bio_list bl = BIO_EMPTY_LIST;
+	struct bio *bio;
+
+	while ((bio = bio_list_pop(&zwplug->bio_list))) {
+		if (wp_offset >= zone_capacity ||
+		     bio_offset_from_zone_start(bio) != wp_offset) {
+			blk_zone_wplug_bio_io_error(bio);
+			disk_put_zone_wplug(zwplug);
+			continue;
+		}
+
+		wp_offset += bio_sectors(bio);
+		bio_list_add(&bl, bio);
+	}
+
+	bio_list_merge(&zwplug->bio_list, &bl);
+}
+
+/*
+ * Set a zone write plug write pointer offset to either 0 (zone reset case)
+ * or to the zone size (zone finish case). This aborts all plugged BIOs, which
+ * is fine to do as doing a zone reset or zone finish while writes are in-flight
+ * is a mistake from the user which will most likely cause all plugged BIOs to
+ * fail anyway.
+ */
+static void disk_zone_wplug_set_wp_offset(struct gendisk *disk,
+					  struct blk_zone_wplug *zwplug,
+					  unsigned int wp_offset)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&zwplug->lock, flags);
+
+	/*
+	 * Make sure that a BIO completion or another zone reset or finish
+	 * operation has not already removed the plug from the hash table.
+	 */
+	if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED) {
+		spin_unlock_irqrestore(&zwplug->lock, flags);
+		return;
+	}
+
+	/* Update the zone write pointer and abort all plugged BIOs. */
+	zwplug->wp_offset = wp_offset;
+	disk_zone_wplug_abort(zwplug);
+
+	/*
+	 * Updating the write pointer offset puts back the zone
+	 * in a good state. So clear the error flag and decrement the
+	 * error count if we were in error state.
+	 */
+	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) {
+		zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
+		spin_lock(&disk->zone_wplugs_lock);
+		list_del_init(&zwplug->link);
+		spin_unlock(&disk->zone_wplugs_lock);
+	}
+
+	/*
+	 * The zone write plug now has no BIO plugged: remove it from the
+	 * hash table so that it cannot be seen. The plug will be freed
+	 * when the last reference is dropped.
+	 */
+	if (disk_should_remove_zone_wplug(disk, zwplug))
+		disk_remove_zone_wplug(disk, zwplug);
+
+	spin_unlock_irqrestore(&zwplug->lock, flags);
+}
+
+static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
+						  unsigned int wp_offset)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	struct blk_zone_wplug *zwplug;
+
+	/* Conventional zones cannot be reset nor finished. */
+	if (bio_zone_is_conv(bio)) {
+		bio_io_error(bio);
+		return true;
+	}
+
+	/*
+	 * If we have a zone write plug, set its write pointer offset to 0
+	 * (reset case) or to the zone size (finish case). This will abort all
+	 * BIOs plugged for the target zone. It is fine as resetting or
+	 * finishing zones while writes are still in-flight will result in the
+	 * writes failing anyway.
+	 */
+	zwplug = disk_get_zone_wplug(disk, bio->bi_iter.bi_sector);
+	if (zwplug) {
+		disk_zone_wplug_set_wp_offset(disk, zwplug, wp_offset);
+		disk_put_zone_wplug(zwplug);
+	}
+
+	return false;
+}
+
+static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	struct blk_zone_wplug *zwplug;
+	sector_t sector;
+
+	/*
+	 * Set the write pointer offset of all zone write plugs to 0. This will
+	 * abort all plugged BIOs. It is fine as resetting zones while writes
+	 * are still in-flight will result in the writes failing anyway.
+	 */
+	for (sector = 0; sector < get_capacity(disk);
+	     sector += disk->queue->limits.chunk_sectors) {
+		zwplug = disk_get_zone_wplug(disk, sector);
+		if (zwplug) {
+			disk_zone_wplug_set_wp_offset(disk, zwplug, 0);
+			disk_put_zone_wplug(zwplug);
+		}
+	}
+
+	return false;
+}
+
+static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
+					  struct bio *bio, unsigned int nr_segs)
+{
+	/*
+	 * Grab an extra reference on the BIO request queue usage counter.
+	 * This reference will be reused to submit a request for the BIO for
+	 * blk-mq devices and dropped when the BIO is failed and after
+	 * it is issued in the case of BIO-based devices.
+	 */
+	percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter);
+
+	/*
+	 * The BIO is being plugged and thus will have to wait for the on-going
+	 * write and for all other writes already plugged. So polling makes
+	 * no sense.
+	 */
+	bio_clear_polled(bio);
+
+	/*
+	 * Reuse the poll cookie field to store the number of segments when
+	 * split to the hardware limits.
+	 */
+	bio->__bi_nr_segments = nr_segs;
+
+	/*
+	 * We always receive BIOs after they are split and ready to be issued.
+	 * The block layer passes the parts of a split BIO in order, and the
+	 * user must also issue write sequentially. So simply add the new BIO
+	 * at the tail of the list to preserve the sequential write order.
+	 */
+	bio_list_add(&zwplug->bio_list, bio);
+}
+
+/*
+ * Called from bio_attempt_back_merge() when a BIO was merged with a request.
+ */
+void blk_zone_write_plug_bio_merged(struct bio *bio)
+{
+	struct blk_zone_wplug *zwplug;
+	unsigned long flags;
+
+	/*
+	 * If the BIO was already plugged, then we were called through
+	 * blk_zone_write_plug_attempt_merge() -> blk_attempt_bio_merge().
+	 * For this case, blk_zone_write_plug_attempt_merge() will handle the
+	 * zone write pointer offset update.
+	 */
+	if (bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING))
+		return;
+
+	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
+
+	/*
+	 * Increase the plug reference count and advance the zone write
+	 * pointer offset.
+	 */
+	zwplug = bio_lookup_zone_wplug(bio);
+	spin_lock_irqsave(&zwplug->lock, flags);
+	blk_get_zone_wplug(zwplug);
+	zwplug->wp_offset += bio_sectors(bio);
+	spin_unlock_irqrestore(&zwplug->lock, flags);
+}
+
+/*
+ * Attempt to merge plugged BIOs with a newly prepared request for a BIO that
+ * already went through zone write plugging (either a new BIO or one that was
+ * unplugged).
+ */
+void blk_zone_write_plug_attempt_merge(struct request *req)
+{
+	sector_t req_back_sector = blk_rq_pos(req) + blk_rq_sectors(req);
+	struct request_queue *q = req->q;
+	struct gendisk *disk = q->disk;
+	unsigned int zone_capacity = disk->zone_capacity;
+	struct blk_zone_wplug *zwplug =
+		disk_lookup_zone_wplug(disk, blk_rq_pos(req));
+	unsigned long flags;
+	struct bio *bio;
+
+	/*
+	 * Completion of this request needs to be handled with
+	 * blk_zone_write_complete_request().
+	 */
+	req->rq_flags |= RQF_ZONE_WRITE_PLUGGING;
+	blk_get_zone_wplug(zwplug);
+
+	if (blk_queue_nomerges(q))
+		return;
+
+	/*
+	 * Walk through the list of plugged BIOs to check if they can be merged
+	 * into the back of the request.
+	 */
+	spin_lock_irqsave(&zwplug->lock, flags);
+	while (zwplug->wp_offset < zone_capacity &&
+	       (bio = bio_list_peek(&zwplug->bio_list))) {
+		if (bio->bi_iter.bi_sector != req_back_sector ||
+		    !blk_rq_merge_ok(req, bio))
+			break;
+
+		WARN_ON_ONCE(bio_op(bio) != REQ_OP_WRITE_ZEROES &&
+			     !bio->__bi_nr_segments);
+
+		bio_list_pop(&zwplug->bio_list);
+		if (bio_attempt_back_merge(req, bio, bio->__bi_nr_segments) !=
+		    BIO_MERGE_OK) {
+			bio_list_add_head(&zwplug->bio_list, bio);
+			break;
+		}
+
+		/*
+		 * Drop the extra reference on the queue usage we got when
+		 * plugging the BIO and advance the write pointer offset.
+		 */
+		blk_queue_exit(q);
+		zwplug->wp_offset += bio_sectors(bio);
+
+		req_back_sector += bio_sectors(bio);
+	}
+	spin_unlock_irqrestore(&zwplug->lock, flags);
+}
+
+static inline void disk_zone_wplug_set_error(struct gendisk *disk,
+					     struct blk_zone_wplug *zwplug)
+{
+	if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR)) {
+		unsigned long flags;
+
+		zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
+
+		spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+		list_add_tail(&zwplug->link, &disk->zone_wplugs_err_list);
+		spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+	}
+}
+
+/*
+ * Check and prepare a BIO for submission by incrementing the write pointer
+ * offset of its zone write plug.
+ */
+static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
+				       struct bio *bio)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+
+	/*
+	 * Check that the user is not attempting to write to a full zone.
+	 * We know such BIO will fail, and that would potentially overflow our
+	 * write pointer offset beyond the end of the zone.
+	 */
+	if (zwplug->wp_offset >= disk->zone_capacity)
+		goto err;
+
+	/*
+	 * Check for non-sequential writes early because we avoid a
+	 * whole lot of error handling trouble if we don't send it off
+	 * to the driver.
+	 */
+	if (bio_offset_from_zone_start(bio) != zwplug->wp_offset)
+		goto err;
+
+	/* Advance the zone write pointer offset. */
+	zwplug->wp_offset += bio_sectors(bio);
+
+	return true;
+
+err:
+	/* We detected an invalid write BIO: schedule error recovery. */
+	disk_zone_wplug_set_error(disk, zwplug);
+	kblockd_schedule_work(&disk->zone_wplugs_work);
+	return false;
+}
+
+static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
+{
+	struct blk_zone_wplug *zwplug;
+	unsigned long flags;
+
+	/*
+	 * BIOs must be fully contained within a zone so that we use the correct
+	 * zone write plug for the entire BIO. For blk-mq devices, the block
+	 * layer should already have done any splitting required to ensure this
+	 * and this BIO should thus not be straddling zone boundaries. For
+	 * BIO-based devices, it is the responsibility of the driver to split
+	 * the bio before submitting it.
+	 */
+	if (WARN_ON_ONCE(bio_straddles_zones(bio))) {
+		bio_io_error(bio);
+		return true;
+	}
+
+	/* Conventional zones do not need write plugging. */
+	if (bio_zone_is_conv(bio))
+		return false;
+
+	zwplug = bio_get_and_lock_zone_wplug(bio, &flags);
+	if (!zwplug) {
+		bio_io_error(bio);
+		return true;
+	}
+
+	/* Indicate that this BIO is being handled using zone write plugging. */
+	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
+
+	/*
+	 * If the zone is already plugged or has a pending error, add the BIO
+	 * to the plug BIO list. Otherwise, plug and let the BIO execute.
+	 */
+	if (zwplug->flags & BLK_ZONE_WPLUG_BUSY)
+		goto plug;
+
+	/*
+	 * If an error is detected when preparing the BIO, add it to the BIO
+	 * list so that error recovery can deal with it.
+	 */
+	if (!blk_zone_wplug_prepare_bio(zwplug, bio))
+		goto plug;
+
+	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+
+	spin_unlock_irqrestore(&zwplug->lock, flags);
+
+	return false;
+
+plug:
+	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+	blk_zone_wplug_add_bio(zwplug, bio, nr_segs);
+
+	spin_unlock_irqrestore(&zwplug->lock, flags);
+
+	return true;
+}
+
+/**
+ * blk_zone_write_plug_bio - Handle a zone write BIO with zone write plugging
+ * @bio: The BIO being submitted
+ * @nr_segs: The number of physical segments of @bio
+ *
+ * Handle write and write zeroes operations using zone write plugging.
+ * Return true whenever @bio execution needs to be delayed through the zone
+ * write plug. Otherwise, return false to let the submission path process
+ * @bio normally.
+ */
+bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs)
+{
+	struct block_device *bdev = bio->bi_bdev;
+
+	if (!bdev->bd_disk->zone_wplugs_hash)
+		return false;
+
+	/*
+	 * If the BIO already has the plugging flag set, then it was already
+	 * handled through this path and this is a submission from the zone
+	 * plug bio submit work.
+	 */
+	if (bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING))
+		return false;
+
+	/*
+	 * We do not need to do anything special for empty flush BIOs, e.g
+	 * BIOs such as issued by blkdev_issue_flush(). The is because it is
+	 * the responsibility of the user to first wait for the completion of
+	 * write operations for flush to have any effect on the persistence of
+	 * the written data.
+	 */
+	if (op_is_flush(bio->bi_opf) && !bio_sectors(bio))
+		return false;
+
+	/*
+	 * Regular writes and write zeroes need to be handled through the target
+	 * zone write plug. This includes writes with REQ_FUA | REQ_PREFLUSH
+	 * which may need to go through the flush machinery depending on the
+	 * target device capabilities. Plugging such writes is fine as the flush
+	 * machinery operates at the request level, below the plug, and
+	 * completion of the flush sequence will go through the regular BIO
+	 * completion, which will handle zone write plugging.
+	 * Zone reset, reset all and finish commands need special treatment
+	 * to correctly track the write pointer offset of zones. These commands
+	 * are not plugged as we do not need serialization with write
+	 * operations. It is the responsibility of the user to not issue reset
+	 * and finish commands when write operations are in flight.
+	 */
+	switch (bio_op(bio)) {
+	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
+		return blk_zone_wplug_handle_write(bio, nr_segs);
+	case REQ_OP_ZONE_RESET:
+		return blk_zone_wplug_handle_reset_or_finish(bio, 0);
+	case REQ_OP_ZONE_FINISH:
+		return blk_zone_wplug_handle_reset_or_finish(bio,
+						bdev_zone_sectors(bdev));
+	case REQ_OP_ZONE_RESET_ALL:
+		return blk_zone_wplug_handle_reset_all(bio);
+	default:
+		return false;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(blk_zone_write_plug_bio);
+
+static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
+				       struct blk_zone_wplug *zwplug)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&zwplug->lock, flags);
+
+	/*
+	 * If we had an error, schedule error recovery. The recovery work
+	 * will restart submission of plugged BIOs.
+	 */
+	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) {
+		spin_unlock_irqrestore(&zwplug->lock, flags);
+		kblockd_schedule_work(&disk->zone_wplugs_work);
+		return;
+	}
+
+	/* Schedule submission of the next plugged BIO if we have one. */
+	if (!bio_list_empty(&zwplug->bio_list)) {
+		spin_unlock_irqrestore(&zwplug->lock, flags);
+		kblockd_schedule_work(&zwplug->bio_work);
+		return;
+	}
+
+	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+
+	/*
+	 * If the zone is full (it was fully written or finished, or empty
+	 * (it was reset), remove its zone write plug from the hash table.
+	 */
+	if (disk_should_remove_zone_wplug(disk, zwplug))
+		disk_remove_zone_wplug(disk, zwplug);
+
+	spin_unlock_irqrestore(&zwplug->lock, flags);
+}
+
+void blk_zone_write_plug_bio_endio(struct bio *bio)
+{
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	struct blk_zone_wplug *zwplug = bio_lookup_zone_wplug(bio);
+	unsigned long flags;
+
+	/* Make sure we do not see this BIO again by clearing the plug flag. */
+	bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING);
+
+	/*
+	 * If the BIO failed, mark the plug as having an error to trigger
+	 * recovery.
+	 */
+	if (bio->bi_status != BLK_STS_OK) {
+		spin_lock_irqsave(&zwplug->lock, flags);
+		disk_zone_wplug_set_error(disk, zwplug);
+		spin_unlock_irqrestore(&zwplug->lock, flags);
+	}
+
+	/*
+	 * For BIO-based devices, blk_zone_write_plug_complete_request()
+	 * is not called. So we need to schedule execution of the next
+	 * plugged BIO here.
+	 */
+	if (bio->bi_bdev->bd_has_submit_bio)
+		disk_zone_wplug_unplug_bio(disk, zwplug);
+
+	disk_put_zone_wplug(zwplug);
+}
+
+void blk_zone_write_plug_complete_request(struct request *req)
+{
+	struct gendisk *disk = req->q->disk;
+	struct blk_zone_wplug *zwplug =
+		disk_lookup_zone_wplug(disk, req->__sector);
+
+	req->rq_flags &= ~RQF_ZONE_WRITE_PLUGGING;
+
+	disk_zone_wplug_unplug_bio(disk, zwplug);
+
+	disk_put_zone_wplug(zwplug);
+}
+
+static void blk_zone_wplug_bio_work(struct work_struct *work)
+{
+	struct blk_zone_wplug *zwplug =
+		container_of(work, struct blk_zone_wplug, bio_work);
+	struct block_device *bdev;
+	unsigned long flags;
+	struct bio *bio;
+
+	/*
+	 * Submit the next plugged BIO. If we do not have any, clear
+	 * the plugged flag.
+	 */
+	spin_lock_irqsave(&zwplug->lock, flags);
+
+	bio = bio_list_pop(&zwplug->bio_list);
+	if (!bio) {
+		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+		spin_unlock_irqrestore(&zwplug->lock, flags);
+		return;
+	}
+
+	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
+		/* Error recovery will decide what to do with the BIO. */
+		bio_list_add_head(&zwplug->bio_list, bio);
+		spin_unlock_irqrestore(&zwplug->lock, flags);
+		return;
+	}
+
+	spin_unlock_irqrestore(&zwplug->lock, flags);
+
+	bdev = bio->bi_bdev;
+	submit_bio_noacct_nocheck(bio);
+
+	/*
+	 * blk-mq devices will reuse the extra reference on the request queue
+	 * usage counter we took when the BIO was plugged, but the submission
+	 * path for BIO-based devices will not do that. So drop this extra
+	 * reference here.
+	 */
+	if (bdev->bd_has_submit_bio)
+		blk_queue_exit(bdev->bd_disk->queue);
+}
+
+static unsigned int blk_zone_wp_offset(struct blk_zone *zone)
+{
+	switch (zone->cond) {
+	case BLK_ZONE_COND_IMP_OPEN:
+	case BLK_ZONE_COND_EXP_OPEN:
+	case BLK_ZONE_COND_CLOSED:
+		return zone->wp - zone->start;
+	case BLK_ZONE_COND_FULL:
+		return zone->len;
+	case BLK_ZONE_COND_EMPTY:
+		return 0;
+	case BLK_ZONE_COND_NOT_WP:
+	case BLK_ZONE_COND_OFFLINE:
+	case BLK_ZONE_COND_READONLY:
+	default:
+		/*
+		 * Conventional, offline and read-only zones do not have a valid
+		 * write pointer.
+		 */
+		return UINT_MAX;
+	}
+}
+
+static int blk_zone_wplug_report_zone_cb(struct blk_zone *zone,
+					 unsigned int idx, void *data)
+{
+	struct blk_zone *zonep = data;
+
+	*zonep = *zone;
+	return 0;
+}
+
+static void disk_zone_wplug_handle_error(struct gendisk *disk,
+					 struct blk_zone_wplug *zwplug)
+{
+	sector_t zone_start_sector =
+		bdev_zone_sectors(disk->part0) * zwplug->zone_no;
+	unsigned int noio_flag;
+	struct blk_zone zone;
+	unsigned long flags;
+	int ret;
+
+	/* Get the current zone information from the device. */
+	noio_flag = memalloc_noio_save();
+	ret = disk->fops->report_zones(disk, zone_start_sector, 1,
+				       blk_zone_wplug_report_zone_cb, &zone);
+	memalloc_noio_restore(noio_flag);
+
+	spin_lock_irqsave(&zwplug->lock, flags);
+
+	/*
+	 * A zone reset or finish may have cleared the error already. In such
+	 * case, do nothing as the report zones may have seen the "old" write
+	 * pointer value before the reset/finish operation completed.
+	 */
+	if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR))
+		goto unlock;
+
+	zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
+
+	if (ret != 1) {
+		/*
+		 * We failed to get the zone information, meaning that something
+		 * is likely really wrong with the device. Abort all remaining
+		 * plugged BIOs as otherwise we could endup waiting forever on
+		 * plugged BIOs to complete if there is a queue freeze on-going.
+		 */
+		disk_zone_wplug_abort(zwplug);
+		goto unplug;
+	}
+
+	/* Update the zone write pointer offset. */
+	zwplug->wp_offset = blk_zone_wp_offset(&zone);
+	disk_zone_wplug_abort_unaligned(disk, zwplug);
+
+	/* Restart BIO submission if we still have any BIO left. */
+	if (!bio_list_empty(&zwplug->bio_list)) {
+		WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
+		kblockd_schedule_work(&zwplug->bio_work);
+		goto unlock;
+	}
+
+unplug:
+	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+	if (disk_should_remove_zone_wplug(disk, zwplug))
+		disk_remove_zone_wplug(disk, zwplug);
+
+unlock:
+	spin_unlock_irqrestore(&zwplug->lock, flags);
+}
+
+static void disk_zone_wplugs_work(struct work_struct *work)
+{
+	struct gendisk *disk =
+		container_of(work, struct gendisk, zone_wplugs_work);
+	struct blk_zone_wplug *zwplug;
+	unsigned long flags;
+
+	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+
+	while (!list_empty(&disk->zone_wplugs_err_list)) {
+		zwplug = list_first_entry(&disk->zone_wplugs_err_list,
+					  struct blk_zone_wplug, link);
+		list_del_init(&zwplug->link);
+		blk_get_zone_wplug(zwplug);
+		spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+
+		disk_zone_wplug_handle_error(disk, zwplug);
+		disk_put_zone_wplug(zwplug);
+
+		spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+	}
+
+	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+}
+
+static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
+{
+	return 1U << disk->zone_wplugs_hash_bits;
+}
+
+static void disk_free_zone_wplugs(struct gendisk *disk)
+{
+	struct blk_zone_wplug *zwplug;
+	unsigned int i;
+
+	if (!disk->zone_wplugs_hash)
+		return;
+
+	/* Free all the zone write plugs we have. */
+	for (i = 0; i < disk_zone_wplugs_hash_size(disk); i++) {
+		while (!hlist_empty(&disk->zone_wplugs_hash[i])) {
+			zwplug = hlist_entry(disk->zone_wplugs_hash[i].first,
+					     struct blk_zone_wplug, node);
+			blk_get_zone_wplug(zwplug);
+			disk_remove_zone_wplug(disk, zwplug);
+			disk_put_zone_wplug(zwplug);
+		}
+	}
+
+	/* Wait for the zone write plugs to be RCU-freed. */
+	rcu_barrier();
+}
+
+void disk_init_zone_resources(struct gendisk *disk)
+{
+	spin_lock_init(&disk->zone_wplugs_lock);
+	INIT_LIST_HEAD(&disk->zone_wplugs_err_list);
+	INIT_WORK(&disk->zone_wplugs_work, disk_zone_wplugs_work);
+}
+
+/*
+ * For the size of a disk zone write plug hash table, use the disk maximum
+ * open zones and maximum active zones limits, but do not exceed 4KB (512 hlist
+ * head entries), that is, 9 bits. For a disk that has no limits, default to
+ * 128 zones for the number of zone write plugs to hash.
+ */
+#define BLK_ZONE_MAX_WPLUG_HASH_BITS		9
+#define BLK_ZONE_DEFAULT_MAX_NR_WPLUGS		128
+
+static int disk_alloc_zone_resources(struct gendisk *disk,
+				     unsigned int max_nr_zwplugs)
+{
+	unsigned int i;
+
+	disk->zone_wplugs_hash_bits =
+		min(ilog2(max_nr_zwplugs) + 1, BLK_ZONE_MAX_WPLUG_HASH_BITS);
+
+	disk->zone_wplugs_hash =
+		kcalloc(disk_zone_wplugs_hash_size(disk),
+			sizeof(struct hlist_head), GFP_KERNEL);
+	if (!disk->zone_wplugs_hash)
+		return -ENOMEM;
+
+	for (i = 0; i < disk_zone_wplugs_hash_size(disk); i++)
+		INIT_HLIST_HEAD(&disk->zone_wplugs_hash[i]);
+
+	return 0;
+}
+
+void disk_free_zone_resources(struct gendisk *disk)
 {
+	cancel_work_sync(&disk->zone_wplugs_work);
+
+	disk_free_zone_wplugs(disk);
+
+	kfree(disk->zone_wplugs_hash);
+	disk->zone_wplugs_hash = NULL;
+	disk->zone_wplugs_hash_bits = 0;
+
 	kfree(disk->conv_zones_bitmap);
 	disk->conv_zones_bitmap = NULL;
 	kfree(disk->seq_zones_wlock);
 	disk->seq_zones_wlock = NULL;
+
+	disk->zone_capacity = 0;
+	disk->nr_zones = 0;
+}
+
+static int disk_revalidate_zone_resources(struct gendisk *disk,
+					  unsigned int nr_zones)
+{
+	struct queue_limits *lim = &disk->queue->limits;
+	unsigned int max_nr_zwplugs;
+
+	/*
+	 * If the device has no limit on the maximum number of open and active
+	 * zones, use BLK_ZONE_DEFAULT_MAX_NR_WPLUGS for the maximum number
+	 * of zone write plugs to hash.
+	 */
+	max_nr_zwplugs = max(lim->max_open_zones, lim->max_active_zones);
+	if (!max_nr_zwplugs)
+		max_nr_zwplugs =
+			min(BLK_ZONE_DEFAULT_MAX_NR_WPLUGS, nr_zones);
+
+	if (!disk->zone_wplugs_hash)
+		return disk_alloc_zone_resources(disk, max_nr_zwplugs);
+
+	return 0;
 }
 
 struct blk_revalidate_zone_args {
@@ -453,6 +1521,9 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 	struct request_queue *q = disk->queue;
 	sector_t capacity = get_capacity(disk);
 	sector_t zone_sectors = q->limits.chunk_sectors;
+	struct blk_zone_wplug *zwplug;
+	unsigned long flags;
+	unsigned int wp_offset;
 
 	/* Check for bad zones and holes in the zone report */
 	if (zone->start != args->sector) {
@@ -524,6 +1595,22 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 				disk->disk_name);
 			return -ENODEV;
 		}
+
+		/*
+		 * We need to track the write pointer of all zones that are not
+		 * empty nor full. So make sure we have a zone write plug for
+		 * such zone.
+		 */
+		wp_offset = blk_zone_wp_offset(zone);
+		if (wp_offset && wp_offset < zone_sectors) {
+			zwplug = disk_get_and_lock_zone_wplug(disk, zone->start,
+							      GFP_NOIO, &flags);
+			if (!zwplug)
+				return -ENOMEM;
+			spin_unlock_irqrestore(&zwplug->lock, flags);
+			disk_put_zone_wplug(zwplug);
+		}
+
 		break;
 	case BLK_ZONE_TYPE_SEQWRITE_PREF:
 	default:
@@ -560,7 +1647,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 	sector_t capacity = get_capacity(disk);
 	struct blk_revalidate_zone_args args = { };
 	unsigned int noio_flag;
-	int ret;
+	int ret = -ENOMEM;
 
 	if (WARN_ON_ONCE(!blk_queue_is_zoned(q)))
 		return -EIO;
@@ -593,6 +1680,11 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 	args.disk = disk;
 	args.nr_zones = (capacity + zone_sectors - 1) >> ilog2(zone_sectors);
 	noio_flag = memalloc_noio_save();
+	ret = disk_revalidate_zone_resources(disk, args.nr_zones);
+	if (ret) {
+		memalloc_noio_restore(noio_flag);
+		return ret;
+	}
 	ret = disk->fops->report_zones(disk, 0, UINT_MAX,
 				       blk_revalidate_zone_cb, &args);
 	if (!ret) {
@@ -627,7 +1719,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 		ret = 0;
 	} else {
 		pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
-		disk_free_zone_bitmaps(disk);
+		disk_free_zone_resources(disk);
 	}
 	blk_mq_unfreeze_queue(q);
 
diff --git a/block/blk.h b/block/blk.h
index f2a521b72f9d..7c7059655c97 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -416,7 +416,14 @@ static inline struct bio *blk_queue_bounce(struct bio *bio,
 }
 
 #ifdef CONFIG_BLK_DEV_ZONED
-void disk_free_zone_bitmaps(struct gendisk *disk);
+void disk_init_zone_resources(struct gendisk *disk);
+void disk_free_zone_resources(struct gendisk *disk);
+static inline bool bio_zone_write_plugging(struct bio *bio)
+{
+	return bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING);
+}
+void blk_zone_write_plug_bio_merged(struct bio *bio);
+void blk_zone_write_plug_attempt_merge(struct request *rq);
 static inline void blk_zone_update_request_bio(struct request *rq,
 					       struct bio *bio)
 {
@@ -424,22 +431,45 @@ static inline void blk_zone_update_request_bio(struct request *rq,
 	 * For zone append requests, the request sector indicates the location
 	 * at which the BIO data was written. Return this value to the BIO
 	 * issuer through the BIO iter sector.
+	 * For plugged zone writes, we need the original BIO sector so
+	 * that blk_zone_write_plug_bio_endio() can lookup the zone write plug.
 	 */
-	if (req_op(rq) == REQ_OP_ZONE_APPEND)
+	if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio))
 		bio->bi_iter.bi_sector = rq->__sector;
 }
+void blk_zone_write_plug_bio_endio(struct bio *bio);
+void blk_zone_write_plug_complete_request(struct request *rq);
 int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd,
 		unsigned long arg);
 int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
 		unsigned int cmd, unsigned long arg);
 #else /* CONFIG_BLK_DEV_ZONED */
-static inline void disk_free_zone_bitmaps(struct gendisk *disk)
+static inline void disk_init_zone_resources(struct gendisk *disk)
+{
+}
+static inline void disk_free_zone_resources(struct gendisk *disk)
+{
+}
+static inline bool bio_zone_write_plugging(struct bio *bio)
+{
+	return false;
+}
+static inline void blk_zone_write_plug_bio_merged(struct bio *bio)
+{
+}
+static inline void blk_zone_write_plug_attempt_merge(struct request *rq)
 {
 }
 static inline void blk_zone_update_request_bio(struct request *rq,
 					       struct bio *bio)
 {
 }
+static inline void blk_zone_write_plug_bio_endio(struct bio *bio)
+{
+}
+static inline void blk_zone_write_plug_complete_request(struct request *rq)
+{
+}
 static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
 		unsigned int cmd, unsigned long arg)
 {
diff --git a/block/genhd.c b/block/genhd.c
index bb29a68e1d67..eb893df56d51 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1182,7 +1182,7 @@ static void disk_release(struct device *dev)
 
 	disk_release_events(disk);
 	kfree(disk->random);
-	disk_free_zone_bitmaps(disk);
+	disk_free_zone_resources(disk);
 	xa_destroy(&disk->part_tbl);
 
 	disk->queue->disk = NULL;
@@ -1364,6 +1364,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 	if (blkcg_init_disk(disk))
 		goto out_erase_part0;
 
+	disk_init_zone_resources(disk);
 	rand_initialize_disk(disk);
 	disk_to_dev(disk)->class = &block_class;
 	disk_to_dev(disk)->type = &disk_type;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d3d8fd8e229b..60090c8366fb 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -56,6 +56,8 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
 /* The per-zone write lock is held for this request */
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
+/* The request completion needs to be signaled to zone write pluging. */
+#define RQF_ZONE_WRITE_PLUGGING	((__force req_flags_t)(1 << 20))
 /* ->timeout has been called, don't expire again */
 #define RQF_TIMED_OUT		((__force req_flags_t)(1 << 21))
 #define RQF_RESV		((__force req_flags_t)(1 << 23))
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cb1526ec44b5..ed45de07d2ef 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -234,7 +234,12 @@ struct bio {
 
 	struct bvec_iter	bi_iter;
 
-	blk_qc_t		bi_cookie;
+	union {
+		/* for polled bios: */
+		blk_qc_t		bi_cookie;
+		/* for plugged zoned writes only: */
+		unsigned int		__bi_nr_segments;
+	};
 	bio_end_io_t		*bi_end_io;
 	void			*bi_private;
 #ifdef CONFIG_BLK_CGROUP
@@ -305,6 +310,7 @@ enum {
 	BIO_QOS_MERGED,		/* but went through rq_qos merge path */
 	BIO_REMAPPED,
 	BIO_ZONE_WRITE_LOCKED,	/* Owns a zoned device zone write lock */
+	BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
 	BIO_FLAG_LAST
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4e81f714cca7..6faa1abe8506 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -194,6 +194,11 @@ struct gendisk {
 	unsigned int		zone_capacity;
 	unsigned long		*conv_zones_bitmap;
 	unsigned long		*seq_zones_wlock;
+	unsigned int            zone_wplugs_hash_bits;
+	spinlock_t              zone_wplugs_lock;
+	struct hlist_head       *zone_wplugs_hash;
+	struct list_head        zone_wplugs_err_list;
+	struct work_struct	zone_wplugs_work;
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 #if IS_ENABLED(CONFIG_CDROM)
@@ -663,6 +668,7 @@ static inline unsigned int bdev_max_active_zones(struct block_device *bdev)
 	return bdev->bd_disk->queue->limits.max_active_zones;
 }
 
+bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline unsigned int bdev_nr_zones(struct block_device *bdev)
 {
@@ -690,6 +696,11 @@ static inline unsigned int bdev_max_active_zones(struct block_device *bdev)
 {
 	return 0;
 }
+static inline bool blk_zone_write_plug_bio(struct bio *bio,
+					   unsigned int nr_segs)
+{
+	return false;
+}
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 static inline unsigned int blk_queue_depth(struct request_queue *q)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (7 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 08/30] block: Introduce zone write plugging Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  4:30   ` Christoph Hellwig
  2024-03-28 22:29   ` Bart Van Assche
  2024-03-28  0:43 ` [PATCH v3 10/30] block: Fake max open zones limit when there is no limit Damien Le Moal
                   ` (21 subsequent siblings)
  30 siblings, 2 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Allocating zone write plugs using kmalloc() does not guarantee that
enough write plugs can be allocated to simultaneously write up to
the maximum number of active zones or maximum number of open zones of
a zoned block device.

Avoid any issue with memory allocation by pre-allocating zone write
plugs up to the disk maximum number of open zones or maximum number of
active zones, whichever is larger. For zoned devices that do not have
open or active zone limits, the default 128 is used as the number of
write plugs to pre-allocate.

Pre-allocated zone write plugs are managed using a free list. If a
change to the device zone limits is detected, the disk free list is
grown if needed when blk_revalidate_disk_zones() is executed.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c      | 124 ++++++++++++++++++++++++++++++++++++-----
 include/linux/blkdev.h |   2 +
 2 files changed, 113 insertions(+), 13 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 03083522df84..3084dae5408e 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -39,7 +39,8 @@ static const char *const zone_cond_name[] = {
 /*
  * Per-zone write plug.
  * @node: hlist_node structure for managing the plug using a hash table.
- * @link: To list the plug in the zone write plug error list of the disk.
+ * @link: To list the plug in the zone write plug free list or error list of
+ *        the disk.
  * @ref: Zone write plug reference counter. A zone write plug reference is
  *       always at least 1 when the plug is hashed in the disk plug hash table.
  *       The reference is incremented whenever a new BIO needing plugging is
@@ -57,6 +58,7 @@ static const char *const zone_cond_name[] = {
  * @bio_list: The list of BIOs that are currently plugged.
  * @bio_work: Work struct to handle issuing of plugged BIOs
  * @rcu_head: RCU head to free zone write plugs with an RCU grace period.
+ * @disk: The gendisk the plug belongs to.
  */
 struct blk_zone_wplug {
 	struct hlist_node	node;
@@ -69,6 +71,7 @@ struct blk_zone_wplug {
 	struct bio_list		bio_list;
 	struct work_struct	bio_work;
 	struct rcu_head		rcu_head;
+	struct gendisk		*disk;
 };
 
 /*
@@ -85,10 +88,14 @@ struct blk_zone_wplug {
  *    to prevent new references to the zone write plug to be taken for
  *    newly incoming BIOs. A zone write plug flagged with this flag will be
  *    freed once all remaining references from BIOs or functions are dropped.
+ *  - BLK_ZONE_WPLUG_NEEDS_FREE: Indicates that the zone write plug was
+ *    dynamically allocated and needs to be freed instead of returned to the
+ *    free list of zone write plugs of the disk.
  */
 #define BLK_ZONE_WPLUG_PLUGGED		(1U << 0)
 #define BLK_ZONE_WPLUG_ERROR		(1U << 1)
 #define BLK_ZONE_WPLUG_UNHASHED		(1U << 2)
+#define BLK_ZONE_WPLUG_NEEDS_FREE	(1U << 3)
 
 #define BLK_ZONE_WPLUG_BUSY	(BLK_ZONE_WPLUG_PLUGGED | BLK_ZONE_WPLUG_ERROR)
 
@@ -519,23 +526,51 @@ static void disk_init_zone_wplug(struct gendisk *disk,
 	zwplug->wp_offset = sector & (disk->queue->limits.chunk_sectors - 1);
 	bio_list_init(&zwplug->bio_list);
 	INIT_WORK(&zwplug->bio_work, blk_zone_wplug_bio_work);
+	zwplug->disk = disk;
 }
 
 static struct blk_zone_wplug *disk_alloc_zone_wplug(struct gendisk *disk,
 						sector_t sector, gfp_t gfp_mask)
 {
-	struct blk_zone_wplug *zwplug;
+	struct blk_zone_wplug *zwplug = NULL;
+	unsigned int zwp_flags = 0;
+	unsigned long flags;
 
-	/* Allocate a new zone write plug. */
-	zwplug = kmalloc(sizeof(struct blk_zone_wplug), gfp_mask);
-	if (!zwplug)
-		return NULL;
+	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+	zwplug = list_first_entry_or_null(&disk->zone_wplugs_free_list,
+					  struct blk_zone_wplug, link);
+	if (zwplug)
+		list_del_init(&zwplug->link);
+	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
 
-	disk_init_zone_wplug(disk, zwplug, 0, sector);
+	if (!zwplug) {
+		/* Allocate a new zone write plug. */
+		zwplug = kmalloc(sizeof(struct blk_zone_wplug), gfp_mask);
+		if (!zwplug)
+			return NULL;
+		zwp_flags = BLK_ZONE_WPLUG_NEEDS_FREE;
+	}
+
+	disk_init_zone_wplug(disk, zwplug, zwp_flags, sector);
 
 	return zwplug;
 }
 
+static void disk_free_zone_wplug(struct blk_zone_wplug *zwplug)
+{
+	struct gendisk *disk = zwplug->disk;
+	unsigned long flags;
+
+	if (zwplug->flags & BLK_ZONE_WPLUG_NEEDS_FREE) {
+		kfree(zwplug);
+		return;
+	}
+
+	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+	list_add_tail(&zwplug->link, &disk->zone_wplugs_free_list);
+	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+}
+
 static bool disk_insert_zone_wplug(struct gendisk *disk,
 				   struct blk_zone_wplug *zwplug)
 {
@@ -630,18 +665,24 @@ static struct blk_zone_wplug *disk_get_zone_wplug(struct gendisk *disk,
 	return zwplug;
 }
 
+static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head)
+{
+	struct blk_zone_wplug *zwplug =
+		container_of(rcu_head, struct blk_zone_wplug, rcu_head);
+
+	disk_free_zone_wplug(zwplug);
+}
+
 static inline void disk_put_zone_wplug(struct blk_zone_wplug *zwplug)
 {
 	if (atomic_dec_and_test(&zwplug->ref)) {
 		WARN_ON_ONCE(!bio_list_empty(&zwplug->bio_list));
 		WARN_ON_ONCE(!list_empty(&zwplug->link));
 
-		kfree_rcu(zwplug, rcu_head);
+		call_rcu(&zwplug->rcu_head, disk_free_zone_wplug_rcu);
 	}
 }
 
-static void blk_zone_wplug_bio_work(struct work_struct *work);
-
 /*
  * Get a reference on the write plug for the zone containing @sector.
  * If the plug does not exist, it is allocated and hashed.
@@ -684,7 +725,7 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
 	 */
 	if (!disk_insert_zone_wplug(disk, zwplug)) {
 		spin_unlock_irqrestore(&zwplug->lock, *flags);
-		kfree(zwplug);
+		disk_free_zone_wplug(zwplug);
 		goto again;
 	}
 
@@ -1401,6 +1442,30 @@ static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
 	return 1U << disk->zone_wplugs_hash_bits;
 }
 
+static int disk_alloc_zone_wplugs(struct gendisk *disk,
+				  unsigned int max_nr_zwplugs)
+{
+	struct blk_zone_wplug *zwplug;
+	unsigned int i;
+
+	if (!disk->zone_wplugs_hash)
+		return 0;
+
+	/* Pre-allocate zone write plugs */
+	for (i = 0; i < max_nr_zwplugs; i++) {
+		zwplug = kmalloc(sizeof(struct blk_zone_wplug), GFP_KERNEL);
+		if (!zwplug)
+			return -ENOMEM;
+		disk_init_zone_wplug(disk, zwplug, 0, 0);
+
+		list_add_tail(&zwplug->link, &disk->zone_wplugs_free_list);
+	}
+
+	disk->zone_wplugs_max_nr += max_nr_zwplugs;
+
+	return 0;
+}
+
 static void disk_free_zone_wplugs(struct gendisk *disk)
 {
 	struct blk_zone_wplug *zwplug;
@@ -1422,11 +1487,22 @@ static void disk_free_zone_wplugs(struct gendisk *disk)
 
 	/* Wait for the zone write plugs to be RCU-freed. */
 	rcu_barrier();
+
+	while (!list_empty(&disk->zone_wplugs_free_list)) {
+		zwplug = list_first_entry(&disk->zone_wplugs_free_list,
+					  struct blk_zone_wplug, link);
+		list_del_init(&zwplug->link);
+
+		kfree(zwplug);
+	}
+
+	disk->zone_wplugs_max_nr = 0;
 }
 
 void disk_init_zone_resources(struct gendisk *disk)
 {
 	spin_lock_init(&disk->zone_wplugs_lock);
+	INIT_LIST_HEAD(&disk->zone_wplugs_free_list);
 	INIT_LIST_HEAD(&disk->zone_wplugs_err_list);
 	INIT_WORK(&disk->zone_wplugs_work, disk_zone_wplugs_work);
 }
@@ -1444,6 +1520,7 @@ static int disk_alloc_zone_resources(struct gendisk *disk,
 				     unsigned int max_nr_zwplugs)
 {
 	unsigned int i;
+	int ret;
 
 	disk->zone_wplugs_hash_bits =
 		min(ilog2(max_nr_zwplugs) + 1, BLK_ZONE_MAX_WPLUG_HASH_BITS);
@@ -1457,6 +1534,15 @@ static int disk_alloc_zone_resources(struct gendisk *disk,
 	for (i = 0; i < disk_zone_wplugs_hash_size(disk); i++)
 		INIT_HLIST_HEAD(&disk->zone_wplugs_hash[i]);
 
+	ret = disk_alloc_zone_wplugs(disk, max_nr_zwplugs);
+	if (ret) {
+		disk_free_zone_wplugs(disk);
+		kfree(disk->zone_wplugs_hash);
+		disk->zone_wplugs_hash = NULL;
+		disk->zone_wplugs_hash_bits = 0;
+		return ret;
+	}
+
 	return 0;
 }
 
@@ -1484,6 +1570,7 @@ static int disk_revalidate_zone_resources(struct gendisk *disk,
 {
 	struct queue_limits *lim = &disk->queue->limits;
 	unsigned int max_nr_zwplugs;
+	int ret;
 
 	/*
 	 * If the device has no limit on the maximum number of open and active
@@ -1495,8 +1582,19 @@ static int disk_revalidate_zone_resources(struct gendisk *disk,
 		max_nr_zwplugs =
 			min(BLK_ZONE_DEFAULT_MAX_NR_WPLUGS, nr_zones);
 
-	if (!disk->zone_wplugs_hash)
-		return disk_alloc_zone_resources(disk, max_nr_zwplugs);
+	if (!disk->zone_wplugs_hash) {
+		ret = disk_alloc_zone_resources(disk, max_nr_zwplugs);
+		if (ret)
+			return ret;
+	}
+
+	/* Grow the free list of zone write plugs if needed. */
+	if (disk->zone_wplugs_max_nr < max_nr_zwplugs) {
+		ret = disk_alloc_zone_wplugs(disk,
+				max_nr_zwplugs - disk->zone_wplugs_max_nr);
+		if (ret)
+			return ret;
+	}
 
 	return 0;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6faa1abe8506..962ee0496659 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -194,9 +194,11 @@ struct gendisk {
 	unsigned int		zone_capacity;
 	unsigned long		*conv_zones_bitmap;
 	unsigned long		*seq_zones_wlock;
+	unsigned int		zone_wplugs_max_nr;
 	unsigned int            zone_wplugs_hash_bits;
 	spinlock_t              zone_wplugs_lock;
 	struct hlist_head       *zone_wplugs_hash;
+	struct list_head        zone_wplugs_free_list;
 	struct list_head        zone_wplugs_err_list;
 	struct work_struct	zone_wplugs_work;
 #endif /* CONFIG_BLK_DEV_ZONED */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 10/30] block: Fake max open zones limit when there is no limit
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (8 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 09/30] block: Pre-allocate zone write plugs Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  4:49   ` Christoph Hellwig
  2024-03-29 20:37   ` Bart Van Assche
  2024-03-28  0:43 ` [PATCH v3 11/30] block: Allow zero value of max_zone_append_sectors queue limit Damien Le Moal
                   ` (20 subsequent siblings)
  30 siblings, 2 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

For a zoned block device that has no limit on the number of open zones
and no limit on the number of active zones, the zone write plug free
list is initialized with 128 zone write plugs. For such case, set the
device max_open_zones queue limit to this value to indicate to the user
the potential performance penalty that may happen when writing
simultaneously to more zones than the free list size.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 3084dae5408e..8ad5d271d3f8 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1570,17 +1570,24 @@ static int disk_revalidate_zone_resources(struct gendisk *disk,
 {
 	struct queue_limits *lim = &disk->queue->limits;
 	unsigned int max_nr_zwplugs;
+	bool set_max_open = false;
 	int ret;
 
 	/*
 	 * If the device has no limit on the maximum number of open and active
 	 * zones, use BLK_ZONE_DEFAULT_MAX_NR_WPLUGS for the maximum number
-	 * of zone write plugs to hash.
+	 * of zone write plugs to hash and set the max_open_zones queue limit
+	 * of the device to indicate to the user the number of pre-allocated
+	 * zone write plugsso that the user is aware of the potential
+	 * performance penalty for simultaneously writing to more zones than
+	 * this limit.
 	 */
 	max_nr_zwplugs = max(lim->max_open_zones, lim->max_active_zones);
-	if (!max_nr_zwplugs)
+	if (!max_nr_zwplugs) {
 		max_nr_zwplugs =
 			min(BLK_ZONE_DEFAULT_MAX_NR_WPLUGS, nr_zones);
+		set_max_open = true;
+	}
 
 	if (!disk->zone_wplugs_hash) {
 		ret = disk_alloc_zone_resources(disk, max_nr_zwplugs);
@@ -1596,6 +1603,9 @@ static int disk_revalidate_zone_resources(struct gendisk *disk,
 			return ret;
 	}
 
+	if (set_max_open)
+		disk_set_max_open_zones(disk, max_nr_zwplugs);
+
 	return 0;
 }
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 11/30] block: Allow zero value of max_zone_append_sectors queue limit
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (9 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 10/30] block: Fake max open zones limit when there is no limit Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  4:49   ` Christoph Hellwig
  2024-03-29 20:50   ` Bart Van Assche
  2024-03-28  0:43 ` [PATCH v3 12/30] block: Implement zone append emulation Damien Le Moal
                   ` (19 subsequent siblings)
  30 siblings, 2 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

In preparation for adding a generic zone append emulation using zone
write plugging, allow device drivers supporting zoned block device to
set a the max_zone_append_sectors queue limit of a device to 0 to
indicate the lack of native support for zone append operations and that
the block layer should emulate these operations using regular write
operations.

blk_queue_max_zone_append_sectors() is modified to allow passing 0 as
the max_zone_append_sectors argument. The function
queue_max_zone_append_sectors() is also modified to ensure that the
minimum of the max_hw_sectors and chunk_sectors limit is used whenever
the max_zone_append_sectors limit is 0. This minimum is consistent with
the value set for the max_zone_append_sectors limit by the function
blk_validate_zoned_limits() when limits for a queue are validated.

The helper functions queue_emulates_zone_append() and
bdev_emulates_zone_append() are added to test if a queue (or block
device) emulates zone append operations.

In order for blk_revalidate_disk_zones() to accept zoned block devices
relying on zone append emulation, the direct check to the
max_zone_append_sectors queue limit of the disk is replaced by a check
using the value returned by queue_max_zone_append_sectors(). Similarly,
queue_zone_append_max_show() is modified to use the same accessor so
that the sysfs attribute advertizes the non-zero limit that will be
used, regardless if it is for native or emulated commands.

For stacking drivers, a top device should not need to care if the
underlying devices have native or emulated zone append operations.
blk_stack_limits() is thus modified to set the top device
max_zone_append_sectors limit using the new accessor
queue_limits_max_zone_append_sectors(). queue_max_zone_append_sectors()
is modified to use this function as well. Stacking drivers that require
zone append emulation, e.g. dm-crypt, can still request this feature by
calling blk_queue_max_zone_append_sectors() with a 0 limit.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/blk-core.c       |  2 +-
 block/blk-settings.c   | 30 +++++++++++++++++++-----------
 block/blk-sysfs.c      |  2 +-
 block/blk-zoned.c      |  2 +-
 include/linux/blkdev.h | 23 ++++++++++++++++++++---
 5 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a16b5abdbbf5..3bf28149e104 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -602,7 +602,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 		return BLK_STS_IOERR;
 
 	/* Make sure the BIO is small enough and will not get split */
-	if (nr_sectors > q->limits.max_zone_append_sectors)
+	if (nr_sectors > queue_max_zone_append_sectors(q))
 		return BLK_STS_IOERR;
 
 	bio->bi_opf |= REQ_NOMERGE;
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 3c7d8d638ab5..82c61d2e4bb8 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -413,24 +413,32 @@ EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
  * blk_queue_max_zone_append_sectors - set max sectors for a single zone append
  * @q:  the request queue for the device
  * @max_zone_append_sectors: maximum number of sectors to write per command
+ *
+ * Sets the maximum number of sectors allowed for zone append commands. If
+ * Specifying 0 for @max_zone_append_sectors indicates that the queue does
+ * not natively support zone append operations and that the block layer must
+ * emulate these operations using regular writes.
  **/
 void blk_queue_max_zone_append_sectors(struct request_queue *q,
 		unsigned int max_zone_append_sectors)
 {
-	unsigned int max_sectors;
+	unsigned int max_sectors = 0;
 
 	if (WARN_ON(!blk_queue_is_zoned(q)))
 		return;
 
-	max_sectors = min(q->limits.max_hw_sectors, max_zone_append_sectors);
-	max_sectors = min(q->limits.chunk_sectors, max_sectors);
+	if (max_zone_append_sectors) {
+		max_sectors = min(q->limits.max_hw_sectors,
+				  max_zone_append_sectors);
+		max_sectors = min(q->limits.chunk_sectors, max_sectors);
 
-	/*
-	 * Signal eventual driver bugs resulting in the max_zone_append sectors limit
-	 * being 0 due to a 0 argument, the chunk_sectors limit (zone size) not set,
-	 * or the max_hw_sectors limit not set.
-	 */
-	WARN_ON(!max_sectors);
+		/*
+		 * Signal eventual driver bugs resulting in the max_zone_append
+		 * sectors limit being 0 due to the chunk_sectors limit (zone
+		 * size) not set or the max_hw_sectors limit not set.
+		 */
+		WARN_ON_ONCE(!max_sectors);
+	}
 
 	q->limits.max_zone_append_sectors = max_sectors;
 }
@@ -757,8 +765,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
-	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
-					b->max_zone_append_sectors);
+	t->max_zone_append_sectors = min(queue_limits_max_zone_append_sectors(t),
+					 queue_limits_max_zone_append_sectors(b));
 	t->bounce = max(t->bounce, b->bounce);
 
 	t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 8c8f69d8ba48..e3ed5a921aff 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -224,7 +224,7 @@ static ssize_t queue_zone_write_granularity_show(struct request_queue *q,
 
 static ssize_t queue_zone_append_max_show(struct request_queue *q, char *page)
 {
-	unsigned long long max_sectors = q->limits.max_zone_append_sectors;
+	unsigned long long max_sectors = queue_max_zone_append_sectors(q);
 
 	return sprintf(page, "%llu\n", max_sectors << SECTOR_SHIFT);
 }
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 8ad5d271d3f8..0615a73df26b 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1775,7 +1775,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 		return -ENODEV;
 	}
 
-	if (!q->limits.max_zone_append_sectors) {
+	if (!queue_max_zone_append_sectors(q)) {
 		pr_warn("%s: Invalid 0 maximum zone append limit\n",
 			disk->disk_name);
 		return -ENODEV;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 962ee0496659..45def924f7c1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1175,12 +1175,29 @@ static inline unsigned int queue_max_segment_size(const struct request_queue *q)
 	return q->limits.max_segment_size;
 }
 
-static inline unsigned int queue_max_zone_append_sectors(const struct request_queue *q)
+static inline unsigned int queue_limits_max_zone_append_sectors(struct queue_limits *l)
 {
+	unsigned int max_sectors = min(l->chunk_sectors, l->max_hw_sectors);
 
-	const struct queue_limits *l = &q->limits;
+	return min_not_zero(l->max_zone_append_sectors, max_sectors);
+}
+
+static inline unsigned int queue_max_zone_append_sectors(struct request_queue *q)
+{
+	if (!blk_queue_is_zoned(q))
+		return 0;
 
-	return min(l->max_zone_append_sectors, l->max_sectors);
+	return queue_limits_max_zone_append_sectors(&q->limits);
+}
+
+static inline bool queue_emulates_zone_append(struct request_queue *q)
+{
+	return blk_queue_is_zoned(q) && !q->limits.max_zone_append_sectors;
+}
+
+static inline bool bdev_emulates_zone_append(struct block_device *bdev)
+{
+	return queue_emulates_zone_append(bdev_get_queue(bdev));
 }
 
 static inline unsigned int
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 12/30] block: Implement zone append emulation
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (10 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 11/30] block: Allow zero value of max_zone_append_sectors queue limit Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  4:50   ` Christoph Hellwig
                     ` (2 more replies)
  2024-03-28  0:43 ` [PATCH v3 13/30] block: Allow BIO-based drivers to use blk_revalidate_disk_zones() Damien Le Moal
                   ` (18 subsequent siblings)
  30 siblings, 3 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Given that zone write plugging manages all writes to zones of a zoned
block device and track the write pointer position of all zones,
emulating zone append operations using regular writes can be
implemented generically, without relying on the underlying device driver
to implement such emulation. This is needed for devices that do not
natively support the zone append command, e.g. SMR hard-disks.

A device may request zone append emulation by setting its
max_zone_append_sectors queue limit to 0. For such device, the function
blk_zone_wplug_prepare_bio() changes zone append BIOs into
non-mergeable regular write BIOs. Modified zone append BIOs are flagged
with the new BIO flag BIO_EMULATES_ZONE_APPEND. This flag is checked
on completion of the BIO in blk_zone_write_plug_bio_endio() to restore
the original REQ_OP_ZONE_APPEND operation code of the BIO.

The block layer internal inline helper function bio_is_zone_append() is
added to test if a BIO is either a native zone append operation
(REQ_OP_ZONE_APPEND operation code) or if it is flagged with
BIO_EMULATES_ZONE_APPEND. Given that both native and emulated zone
append BIO completion handling should be similar, The functions
blk_update_request() and blk_zone_complete_request_bio() are modified to
use bio_is_zone_append() to execute blk_zone_update_request_bio() for
both native and emulated zone append operations.

This commit contains contributions from Christoph Hellwig <hch@lst.de>.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/blk-mq.c            |  3 +-
 block/blk-zoned.c         | 69 +++++++++++++++++++++++++++++++--------
 block/blk.h               | 14 ++++++--
 include/linux/blk_types.h |  1 +
 4 files changed, 69 insertions(+), 18 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4b8dd2e7b870..5299a7ed8fec 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -907,8 +907,7 @@ bool blk_update_request(struct request *req, blk_status_t error,
 
 		if (bio_bytes == bio->bi_iter.bi_size) {
 			req->bio = bio->bi_next;
-		} else if (req_op(req) == REQ_OP_ZONE_APPEND &&
-			   error == BLK_STS_OK) {
+		} else if (bio_is_zone_append(bio) && error == BLK_STS_OK) {
 			/*
 			 * Partial zone append completions cannot be supported
 			 * as the BIO fragments may end up not being written
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 0615a73df26b..eec6a392ae31 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -783,7 +783,8 @@ static void disk_zone_wplug_abort_unaligned(struct gendisk *disk,
 
 	while ((bio = bio_list_pop(&zwplug->bio_list))) {
 		if (wp_offset >= zone_capacity ||
-		     bio_offset_from_zone_start(bio) != wp_offset) {
+		    (bio_op(bio) != REQ_OP_ZONE_APPEND &&
+		     bio_offset_from_zone_start(bio) != wp_offset)) {
 			blk_zone_wplug_bio_io_error(bio);
 			disk_put_zone_wplug(zwplug);
 			continue;
@@ -1036,7 +1037,8 @@ static inline void disk_zone_wplug_set_error(struct gendisk *disk,
 
 /*
  * Check and prepare a BIO for submission by incrementing the write pointer
- * offset of its zone write plug.
+ * offset of its zone write plug and changing zone append operations into
+ * regular write when zone append emulation is needed.
  */
 static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
 				       struct bio *bio)
@@ -1051,13 +1053,30 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
 	if (zwplug->wp_offset >= disk->zone_capacity)
 		goto err;
 
-	/*
-	 * Check for non-sequential writes early because we avoid a
-	 * whole lot of error handling trouble if we don't send it off
-	 * to the driver.
-	 */
-	if (bio_offset_from_zone_start(bio) != zwplug->wp_offset)
-		goto err;
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		/*
+		 * Use a regular write starting at the current write pointer.
+		 * Similarly to native zone append operations, do not allow
+		 * merging.
+		 */
+		bio->bi_opf &= ~REQ_OP_MASK;
+		bio->bi_opf |= REQ_OP_WRITE | REQ_NOMERGE;
+		bio->bi_iter.bi_sector += zwplug->wp_offset;
+
+		/*
+		 * Remember that this BIO is in fact a zone append operation
+		 * so that we can restore its operation code on completion.
+		 */
+		bio_set_flag(bio, BIO_EMULATES_ZONE_APPEND);
+	} else {
+		/*
+		 * Check for non-sequential writes early because we avoid a
+		 * whole lot of error handling trouble if we don't send it off
+		 * to the driver.
+		 */
+		if (bio_offset_from_zone_start(bio) != zwplug->wp_offset)
+			goto err;
+	}
 
 	/* Advance the zone write pointer offset. */
 	zwplug->wp_offset += bio_sectors(bio);
@@ -1090,8 +1109,14 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	}
 
 	/* Conventional zones do not need write plugging. */
-	if (bio_zone_is_conv(bio))
+	if (bio_zone_is_conv(bio)) {
+		/* Zone append to conventional zones is not allowed. */
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+			bio_io_error(bio);
+			return true;
+		}
 		return false;
+	}
 
 	zwplug = bio_get_and_lock_zone_wplug(bio, &flags);
 	if (!zwplug) {
@@ -1136,10 +1161,10 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
  * @bio: The BIO being submitted
  * @nr_segs: The number of physical segments of @bio
  *
- * Handle write and write zeroes operations using zone write plugging.
- * Return true whenever @bio execution needs to be delayed through the zone
- * write plug. Otherwise, return false to let the submission path process
- * @bio normally.
+ * Handle write, write zeroes and zone append operations requiring emulation
+ * using zone write plugging. Return true whenever @bio execution needs to be
+ * delayed through the zone write plug. Otherwise, return false to let the
+ * submission path process @bio normally.
  */
 bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs)
 {
@@ -1174,6 +1199,9 @@ bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs)
 	 * machinery operates at the request level, below the plug, and
 	 * completion of the flush sequence will go through the regular BIO
 	 * completion, which will handle zone write plugging.
+	 * Zone append operations for devices that requested emulation must
+	 * also be plugged so that these BIOs can be changed into regular
+	 * write BIOs.
 	 * Zone reset, reset all and finish commands need special treatment
 	 * to correctly track the write pointer offset of zones. These commands
 	 * are not plugged as we do not need serialization with write
@@ -1181,6 +1209,10 @@ bool blk_zone_write_plug_bio(struct bio *bio, unsigned int nr_segs)
 	 * and finish commands when write operations are in flight.
 	 */
 	switch (bio_op(bio)) {
+	case REQ_OP_ZONE_APPEND:
+		if (!bdev_emulates_zone_append(bdev))
+			return false;
+		fallthrough;
 	case REQ_OP_WRITE:
 	case REQ_OP_WRITE_ZEROES:
 		return blk_zone_wplug_handle_write(bio, nr_segs);
@@ -1244,6 +1276,15 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
 	/* Make sure we do not see this BIO again by clearing the plug flag. */
 	bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING);
 
+	/*
+	 * If this is a regular write emulating a zone append operation,
+	 * restore the original operation code.
+	 */
+	if (bio_flagged(bio, BIO_EMULATES_ZONE_APPEND)) {
+		bio->bi_opf &= ~REQ_OP_MASK;
+		bio->bi_opf |= REQ_OP_ZONE_APPEND;
+	}
+
 	/*
 	 * If the BIO failed, mark the plug as having an error to trigger
 	 * recovery.
diff --git a/block/blk.h b/block/blk.h
index 7c7059655c97..72e9cf700e3f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -422,6 +422,11 @@ static inline bool bio_zone_write_plugging(struct bio *bio)
 {
 	return bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING);
 }
+static inline bool bio_is_zone_append(struct bio *bio)
+{
+	return bio_op(bio) == REQ_OP_ZONE_APPEND ||
+		bio_flagged(bio, BIO_EMULATES_ZONE_APPEND);
+}
 void blk_zone_write_plug_bio_merged(struct bio *bio);
 void blk_zone_write_plug_attempt_merge(struct request *rq);
 static inline void blk_zone_update_request_bio(struct request *rq,
@@ -431,8 +436,9 @@ static inline void blk_zone_update_request_bio(struct request *rq,
 	 * For zone append requests, the request sector indicates the location
 	 * at which the BIO data was written. Return this value to the BIO
 	 * issuer through the BIO iter sector.
-	 * For plugged zone writes, we need the original BIO sector so
-	 * that blk_zone_write_plug_bio_endio() can lookup the zone write plug.
+	 * For plugged zone writes, which include emulated zone append, we need
+	 * the original BIO sector so that blk_zone_write_plug_bio_endio() can
+	 * lookup the zone write plug.
 	 */
 	if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio))
 		bio->bi_iter.bi_sector = rq->__sector;
@@ -454,6 +460,10 @@ static inline bool bio_zone_write_plugging(struct bio *bio)
 {
 	return false;
 }
+static inline bool bio_is_zone_append(struct bio *bio)
+{
+	return false;
+}
 static inline void blk_zone_write_plug_bio_merged(struct bio *bio)
 {
 }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index ed45de07d2ef..29b3170431e7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -311,6 +311,7 @@ enum {
 	BIO_REMAPPED,
 	BIO_ZONE_WRITE_LOCKED,	/* Owns a zoned device zone write lock */
 	BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
+	BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
 	BIO_FLAG_LAST
 };
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 13/30] block: Allow BIO-based drivers to use blk_revalidate_disk_zones()
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (11 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 12/30] block: Implement zone append emulation Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  0:43 ` [PATCH v3 14/30] dm: Use the block layer zone append emulation Damien Le Moal
                   ` (17 subsequent siblings)
  30 siblings, 0 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

In preparation for allowing BIO based device drivers to use zone write
plugging and its zone append emulation, allow these drivers to call
blk_revalidate_disk_zones() so that all zone resources necessary to zone
write plugging can be initialized.

To do so, remove the check in blk_revalidate_disk_zones() restricting
the use of this function to mq request-based drivers to allow also
BIO-based drivers to use it. This is safe to do as long as the
BIO-based block device queue is already setup and usable, as it should,
and can be safely frozen.

The helper function disk_need_zone_resources() is added to control the
allocation and initialization of the zone write plug hash table and
of the conventional zone bitmap only for mq devices and for BIO-based
devices that require zone append emulation.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-zoned.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index eec6a392ae31..c41ac1519818 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1606,6 +1606,19 @@ void disk_free_zone_resources(struct gendisk *disk)
 	disk->nr_zones = 0;
 }
 
+static inline bool disk_need_zone_resources(struct gendisk *disk)
+{
+	/*
+	 * All mq zoned devices need zone resources so that the block layer
+	 * can automatically handle write BIO plugging. BIO-based device drivers
+	 * (e.g. DM devices) are normally responsible for handling zone write
+	 * ordering and do not need zone resources, unless the driver requires
+	 * zone append emulation.
+	 */
+	return queue_is_mq(disk->queue) ||
+		queue_emulates_zone_append(disk->queue);
+}
+
 static int disk_revalidate_zone_resources(struct gendisk *disk,
 					  unsigned int nr_zones)
 {
@@ -1614,6 +1627,9 @@ static int disk_revalidate_zone_resources(struct gendisk *disk,
 	bool set_max_open = false;
 	int ret;
 
+	if (!disk_need_zone_resources(disk))
+		return 0;
+
 	/*
 	 * If the device has no limit on the maximum number of open and active
 	 * zones, use BLK_ZONE_DEFAULT_MAX_NR_WPLUGS for the maximum number
@@ -1717,6 +1733,9 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 				disk->disk_name);
 			return -ENODEV;
 		}
+
+		if (!disk_need_zone_resources(disk))
+			break;
 		if (!args->conv_zones_bitmap) {
 			args->conv_zones_bitmap =
 				blk_alloc_zone_bitmap(q->node, args->nr_zones);
@@ -1748,10 +1767,11 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 		/*
 		 * We need to track the write pointer of all zones that are not
 		 * empty nor full. So make sure we have a zone write plug for
-		 * such zone.
+		 * such zone if the device has a zone write plug hash table.
 		 */
 		wp_offset = blk_zone_wp_offset(zone);
-		if (wp_offset && wp_offset < zone_sectors) {
+		if (disk->zone_wplugs_hash &&
+		    wp_offset && wp_offset < zone_sectors) {
 			zwplug = disk_get_and_lock_zone_wplug(disk, zone->start,
 							      GFP_NOIO, &flags);
 			if (!zwplug)
@@ -1782,8 +1802,8 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
  * be called within the disk ->revalidate method for blk-mq based drivers.
  * Before calling this function, the device driver must already have set the
  * device zone size (chunk_sector limit) and the max zone append limit.
- * For BIO based drivers, this function cannot be used. BIO based device drivers
- * only need to set disk->nr_zones so that the sysfs exposed value is correct.
+ * BIO based drivers can also use this function as long as the device queue
+ * can be safely frozen.
  * If the @update_driver_data callback function is not NULL, the callback is
  * executed with the device request queue frozen after all zones have been
  * checked.
@@ -1800,8 +1820,6 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 
 	if (WARN_ON_ONCE(!blk_queue_is_zoned(q)))
 		return -EIO;
-	if (WARN_ON_ONCE(!queue_is_mq(q)))
-		return -EIO;
 
 	if (!capacity)
 		return -ENODEV;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 14/30] dm: Use the block layer zone append emulation
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (12 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 13/30] block: Allow BIO-based drivers to use blk_revalidate_disk_zones() Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  0:43 ` [PATCH v3 15/30] scsi: sd: " Damien Le Moal
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

For targets requiring zone append operation emulation with regular
writes (e.g. dm-crypt), we can use the block layer emulation provided by
zone write plugging. Remove DM implemented zone append emulation and
enable the block layer one.

This is done by setting the max_zone_append_sectors limit of the
mapped device queue to 0 for mapped devices that have a target table
that cannot support native zone append operations (e.g. dm-crypt).
Such mapped devices are flagged with the DMF_EMULATE_ZONE_APPEND flag.
dm_split_and_process_bio() is modified to execute
blk_zone_write_plug_bio() for such device to let the block layer
transform zone append operations into regular writes.  This is done
after ensuring that the submitted BIO is split if it straddles zone
boundaries. Both changes are implemented unsing the inline helpers
dm_zone_write_plug_bio() and dm_zone_bio_needs_split() respectively.

dm_revalidate_zones() is also modified to use the block layer provided
function blk_revalidate_disk_zones() so that all zone resources needed
for zone append emulation are initialized by the block layer without DM
core needing to do anything. Since the device table is not yet live when
dm_revalidate_zones() is executed, enabling the use of
blk_revalidate_disk_zones() requires adding a pointer to the device
table in struct mapped_device. This avoids errors in
dm_blk_report_zones() trying to get the table with dm_get_live_table().
The mapped device table pointer is set to the table passed as argument
to dm_revalidate_zones() before calling blk_revalidate_disk_zones() and
reset to NULL after this function returns to restore the live table
handling for user call of report zones.

All the code related to zone append emulation is removed from
dm-zone.c. This leads to simplifications of the functions __map_bio()
and dm_zone_endio(). This later function now only needs to deal with
completions of real zone append operations for targets that support it.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/md/dm-core.h |   2 +-
 drivers/md/dm-zone.c | 476 ++++---------------------------------------
 drivers/md/dm.c      |  75 ++++---
 drivers/md/dm.h      |   4 +-
 4 files changed, 97 insertions(+), 460 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index e6757a30dcca..08700bfc3e23 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -140,7 +140,7 @@ struct mapped_device {
 
 #ifdef CONFIG_BLK_DEV_ZONED
 	unsigned int nr_zones;
-	unsigned int *zwp_offset;
+	void *zone_revalidate_map;
 #endif
 
 #ifdef CONFIG_IMA
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index eb9832b22b14..174fda0a301c 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -60,16 +60,23 @@ int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
 	struct dm_table *map;
 	int srcu_idx, ret;
 
-	if (dm_suspended_md(md))
-		return -EAGAIN;
+	if (!md->zone_revalidate_map) {
+		/* Regular user context */
+		if (dm_suspended_md(md))
+			return -EAGAIN;
 
-	map = dm_get_live_table(md, &srcu_idx);
-	if (!map)
-		return -EIO;
+		map = dm_get_live_table(md, &srcu_idx);
+		if (!map)
+			return -EIO;
+	} else {
+		/* Zone revalidation during __bind() */
+		map = md->zone_revalidate_map;
+	}
 
 	ret = dm_blk_do_report_zones(md, map, sector, nr_zones, cb, data);
 
-	dm_put_live_table(md, srcu_idx);
+	if (!md->zone_revalidate_map)
+		dm_put_live_table(md, srcu_idx);
 
 	return ret;
 }
@@ -138,85 +145,6 @@ bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
 	}
 }
 
-void dm_cleanup_zoned_dev(struct mapped_device *md)
-{
-	if (md->disk) {
-		bitmap_free(md->disk->conv_zones_bitmap);
-		md->disk->conv_zones_bitmap = NULL;
-		bitmap_free(md->disk->seq_zones_wlock);
-		md->disk->seq_zones_wlock = NULL;
-	}
-
-	kvfree(md->zwp_offset);
-	md->zwp_offset = NULL;
-	md->nr_zones = 0;
-}
-
-static unsigned int dm_get_zone_wp_offset(struct blk_zone *zone)
-{
-	switch (zone->cond) {
-	case BLK_ZONE_COND_IMP_OPEN:
-	case BLK_ZONE_COND_EXP_OPEN:
-	case BLK_ZONE_COND_CLOSED:
-		return zone->wp - zone->start;
-	case BLK_ZONE_COND_FULL:
-		return zone->len;
-	case BLK_ZONE_COND_EMPTY:
-	case BLK_ZONE_COND_NOT_WP:
-	case BLK_ZONE_COND_OFFLINE:
-	case BLK_ZONE_COND_READONLY:
-	default:
-		/*
-		 * Conventional, offline and read-only zones do not have a valid
-		 * write pointer. Use 0 as for an empty zone.
-		 */
-		return 0;
-	}
-}
-
-static int dm_zone_revalidate_cb(struct blk_zone *zone, unsigned int idx,
-				 void *data)
-{
-	struct mapped_device *md = data;
-	struct gendisk *disk = md->disk;
-
-	switch (zone->type) {
-	case BLK_ZONE_TYPE_CONVENTIONAL:
-		if (!disk->conv_zones_bitmap) {
-			disk->conv_zones_bitmap = bitmap_zalloc(disk->nr_zones,
-								GFP_NOIO);
-			if (!disk->conv_zones_bitmap)
-				return -ENOMEM;
-		}
-		set_bit(idx, disk->conv_zones_bitmap);
-		break;
-	case BLK_ZONE_TYPE_SEQWRITE_REQ:
-	case BLK_ZONE_TYPE_SEQWRITE_PREF:
-		if (!disk->seq_zones_wlock) {
-			disk->seq_zones_wlock = bitmap_zalloc(disk->nr_zones,
-							      GFP_NOIO);
-			if (!disk->seq_zones_wlock)
-				return -ENOMEM;
-		}
-		if (!md->zwp_offset) {
-			md->zwp_offset =
-				kvcalloc(disk->nr_zones, sizeof(unsigned int),
-					 GFP_KERNEL);
-			if (!md->zwp_offset)
-				return -ENOMEM;
-		}
-		md->zwp_offset[idx] = dm_get_zone_wp_offset(zone);
-
-		break;
-	default:
-		DMERR("Invalid zone type 0x%x at sectors %llu",
-		      (int)zone->type, zone->start);
-		return -ENODEV;
-	}
-
-	return 0;
-}
-
 /*
  * Revalidate the zones of a mapped device to initialize resource necessary
  * for zone append emulation. Note that we cannot simply use the block layer
@@ -226,41 +154,32 @@ static int dm_zone_revalidate_cb(struct blk_zone *zone, unsigned int idx,
 static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
 {
 	struct gendisk *disk = md->disk;
-	unsigned int noio_flag;
 	int ret;
 
-	/*
-	 * Check if something changed. If yes, cleanup the current resources
-	 * and reallocate everything.
-	 */
+	/* Revalidate ionly if something changed. */
 	if (!disk->nr_zones || disk->nr_zones != md->nr_zones)
-		dm_cleanup_zoned_dev(md);
+		md->nr_zones = 0;
+
 	if (md->nr_zones)
 		return 0;
 
 	/*
-	 * Scan all zones to initialize everything. Ensure that all vmalloc
-	 * operations in this context are done as if GFP_NOIO was specified.
+	 * Our table is not live yet. So the call to dm_get_live_table()
+	 * in dm_blk_report_zones() will fail. So set a temporary pointer to
+	 * our table for dm_blk_report_zones() to use directly.
 	 */
-	noio_flag = memalloc_noio_save();
-	ret = dm_blk_do_report_zones(md, t, 0, disk->nr_zones,
-				     dm_zone_revalidate_cb, md);
-	memalloc_noio_restore(noio_flag);
-	if (ret < 0)
-		goto err;
-	if (ret != disk->nr_zones) {
-		ret = -EIO;
-		goto err;
+	md->zone_revalidate_map = t;
+	ret = blk_revalidate_disk_zones(disk, NULL);
+	md->zone_revalidate_map = NULL;
+
+	if (ret) {
+		DMERR("Revalidate zones failed %d", ret);
+		return ret;
 	}
 
 	md->nr_zones = disk->nr_zones;
 
 	return 0;
-
-err:
-	DMERR("Revalidate zones failed %d", ret);
-	dm_cleanup_zoned_dev(md);
-	return ret;
 }
 
 static int device_not_zone_append_capable(struct dm_target *ti,
@@ -291,292 +210,27 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
 	struct mapped_device *md = t->md;
 
 	/*
-	 * For a zoned target, the number of zones should be updated for the
-	 * correct value to be exposed in sysfs queue/nr_zones.
+	 * Check if zone append is natively supported, and if not, set the
+	 * mapped device queue as needing zone append emulation.
 	 */
 	WARN_ON_ONCE(queue_is_mq(q));
-	md->disk->nr_zones = bdev_nr_zones(md->disk->part0);
-
-	/* Check if zone append is natively supported */
 	if (dm_table_supports_zone_append(t)) {
 		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
-		dm_cleanup_zoned_dev(md);
-		return 0;
+	} else {
+		set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+		blk_queue_max_zone_append_sectors(q, 0);
 	}
 
-	/*
-	 * Mark the mapped device as needing zone append emulation and
-	 * initialize the emulation resources once the capacity is set.
-	 */
-	set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
 	if (!get_capacity(md->disk))
 		return 0;
 
-	return dm_revalidate_zones(md, t);
-}
-
-static int dm_update_zone_wp_offset_cb(struct blk_zone *zone, unsigned int idx,
-				       void *data)
-{
-	unsigned int *wp_offset = data;
-
-	*wp_offset = dm_get_zone_wp_offset(zone);
-
-	return 0;
-}
-
-static int dm_update_zone_wp_offset(struct mapped_device *md, unsigned int zno,
-				    unsigned int *wp_ofst)
-{
-	sector_t sector = zno * bdev_zone_sectors(md->disk->part0);
-	unsigned int noio_flag;
-	struct dm_table *t;
-	int srcu_idx, ret;
-
-	t = dm_get_live_table(md, &srcu_idx);
-	if (!t)
-		return -EIO;
-
-	/*
-	 * Ensure that all memory allocations in this context are done as if
-	 * GFP_NOIO was specified.
-	 */
-	noio_flag = memalloc_noio_save();
-	ret = dm_blk_do_report_zones(md, t, sector, 1,
-				     dm_update_zone_wp_offset_cb, wp_ofst);
-	memalloc_noio_restore(noio_flag);
-
-	dm_put_live_table(md, srcu_idx);
-
-	if (ret != 1)
-		return -EIO;
-
-	return 0;
-}
-
-struct orig_bio_details {
-	enum req_op op;
-	unsigned int nr_sectors;
-};
-
-/*
- * First phase of BIO mapping for targets with zone append emulation:
- * check all BIO that change a zone writer pointer and change zone
- * append operations into regular write operations.
- */
-static bool dm_zone_map_bio_begin(struct mapped_device *md,
-				  unsigned int zno, struct bio *clone)
-{
-	sector_t zsectors = bdev_zone_sectors(md->disk->part0);
-	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
-
-	/*
-	 * If the target zone is in an error state, recover by inspecting the
-	 * zone to get its current write pointer position. Note that since the
-	 * target zone is already locked, a BIO issuing context should never
-	 * see the zone write in the DM_ZONE_UPDATING_WP_OFST state.
-	 */
-	if (zwp_offset == DM_ZONE_INVALID_WP_OFST) {
-		if (dm_update_zone_wp_offset(md, zno, &zwp_offset))
-			return false;
-		WRITE_ONCE(md->zwp_offset[zno], zwp_offset);
-	}
-
-	switch (bio_op(clone)) {
-	case REQ_OP_ZONE_RESET:
-	case REQ_OP_ZONE_FINISH:
-		return true;
-	case REQ_OP_WRITE_ZEROES:
-	case REQ_OP_WRITE:
-		/* Writes must be aligned to the zone write pointer */
-		if ((clone->bi_iter.bi_sector & (zsectors - 1)) != zwp_offset)
-			return false;
-		break;
-	case REQ_OP_ZONE_APPEND:
-		/*
-		 * Change zone append operations into a non-mergeable regular
-		 * writes directed at the current write pointer position of the
-		 * target zone.
-		 */
-		clone->bi_opf = REQ_OP_WRITE | REQ_NOMERGE |
-			(clone->bi_opf & (~REQ_OP_MASK));
-		clone->bi_iter.bi_sector += zwp_offset;
-		break;
-	default:
-		DMWARN_LIMIT("Invalid BIO operation");
-		return false;
-	}
-
-	/* Cannot write to a full zone */
-	if (zwp_offset >= zsectors)
-		return false;
-
-	return true;
-}
-
-/*
- * Second phase of BIO mapping for targets with zone append emulation:
- * update the zone write pointer offset array to account for the additional
- * data written to a zone. Note that at this point, the remapped clone BIO
- * may already have completed, so we do not touch it.
- */
-static blk_status_t dm_zone_map_bio_end(struct mapped_device *md, unsigned int zno,
-					struct orig_bio_details *orig_bio_details,
-					unsigned int nr_sectors)
-{
-	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
-
-	/* The clone BIO may already have been completed and failed */
-	if (zwp_offset == DM_ZONE_INVALID_WP_OFST)
-		return BLK_STS_IOERR;
-
-	/* Update the zone wp offset */
-	switch (orig_bio_details->op) {
-	case REQ_OP_ZONE_RESET:
-		WRITE_ONCE(md->zwp_offset[zno], 0);
-		return BLK_STS_OK;
-	case REQ_OP_ZONE_FINISH:
-		WRITE_ONCE(md->zwp_offset[zno],
-			   bdev_zone_sectors(md->disk->part0));
-		return BLK_STS_OK;
-	case REQ_OP_WRITE_ZEROES:
-	case REQ_OP_WRITE:
-		WRITE_ONCE(md->zwp_offset[zno], zwp_offset + nr_sectors);
-		return BLK_STS_OK;
-	case REQ_OP_ZONE_APPEND:
-		/*
-		 * Check that the target did not truncate the write operation
-		 * emulating a zone append.
-		 */
-		if (nr_sectors != orig_bio_details->nr_sectors) {
-			DMWARN_LIMIT("Truncated write for zone append");
-			return BLK_STS_IOERR;
-		}
-		WRITE_ONCE(md->zwp_offset[zno], zwp_offset + nr_sectors);
-		return BLK_STS_OK;
-	default:
-		DMWARN_LIMIT("Invalid BIO operation");
-		return BLK_STS_IOERR;
+	if (!md->disk->nr_zones) {
+		DMINFO("%s using %s zone append",
+		       md->disk->disk_name,
+		       queue_emulates_zone_append(q) ? "emulated" : "native");
 	}
-}
-
-static inline void dm_zone_lock(struct gendisk *disk, unsigned int zno,
-				struct bio *clone)
-{
-	if (WARN_ON_ONCE(bio_flagged(clone, BIO_ZONE_WRITE_LOCKED)))
-		return;
-
-	wait_on_bit_lock_io(disk->seq_zones_wlock, zno, TASK_UNINTERRUPTIBLE);
-	bio_set_flag(clone, BIO_ZONE_WRITE_LOCKED);
-}
-
-static inline void dm_zone_unlock(struct gendisk *disk, unsigned int zno,
-				  struct bio *clone)
-{
-	if (!bio_flagged(clone, BIO_ZONE_WRITE_LOCKED))
-		return;
-
-	WARN_ON_ONCE(!test_bit(zno, disk->seq_zones_wlock));
-	clear_bit_unlock(zno, disk->seq_zones_wlock);
-	smp_mb__after_atomic();
-	wake_up_bit(disk->seq_zones_wlock, zno);
-
-	bio_clear_flag(clone, BIO_ZONE_WRITE_LOCKED);
-}
 
-static bool dm_need_zone_wp_tracking(struct bio *bio)
-{
-	/*
-	 * Special processing is not needed for operations that do not need the
-	 * zone write lock, that is, all operations that target conventional
-	 * zones and all operations that do not modify directly a sequential
-	 * zone write pointer.
-	 */
-	if (op_is_flush(bio->bi_opf) && !bio_sectors(bio))
-		return false;
-	switch (bio_op(bio)) {
-	case REQ_OP_WRITE_ZEROES:
-	case REQ_OP_WRITE:
-	case REQ_OP_ZONE_RESET:
-	case REQ_OP_ZONE_FINISH:
-	case REQ_OP_ZONE_APPEND:
-		return bio_zone_is_seq(bio);
-	default:
-		return false;
-	}
-}
-
-/*
- * Special IO mapping for targets needing zone append emulation.
- */
-int dm_zone_map_bio(struct dm_target_io *tio)
-{
-	struct dm_io *io = tio->io;
-	struct dm_target *ti = tio->ti;
-	struct mapped_device *md = io->md;
-	struct bio *clone = &tio->clone;
-	struct orig_bio_details orig_bio_details;
-	unsigned int zno;
-	blk_status_t sts;
-	int r;
-
-	/*
-	 * IOs that do not change a zone write pointer do not need
-	 * any additional special processing.
-	 */
-	if (!dm_need_zone_wp_tracking(clone))
-		return ti->type->map(ti, clone);
-
-	/* Lock the target zone */
-	zno = bio_zone_no(clone);
-	dm_zone_lock(md->disk, zno, clone);
-
-	orig_bio_details.nr_sectors = bio_sectors(clone);
-	orig_bio_details.op = bio_op(clone);
-
-	/*
-	 * Check that the bio and the target zone write pointer offset are
-	 * both valid, and if the bio is a zone append, remap it to a write.
-	 */
-	if (!dm_zone_map_bio_begin(md, zno, clone)) {
-		dm_zone_unlock(md->disk, zno, clone);
-		return DM_MAPIO_KILL;
-	}
-
-	/* Let the target do its work */
-	r = ti->type->map(ti, clone);
-	switch (r) {
-	case DM_MAPIO_SUBMITTED:
-		/*
-		 * The target submitted the clone BIO. The target zone will
-		 * be unlocked on completion of the clone.
-		 */
-		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
-					  *tio->len_ptr);
-		break;
-	case DM_MAPIO_REMAPPED:
-		/*
-		 * The target only remapped the clone BIO. In case of error,
-		 * unlock the target zone here as the clone will not be
-		 * submitted.
-		 */
-		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
-					  *tio->len_ptr);
-		if (sts != BLK_STS_OK)
-			dm_zone_unlock(md->disk, zno, clone);
-		break;
-	case DM_MAPIO_REQUEUE:
-	case DM_MAPIO_KILL:
-	default:
-		dm_zone_unlock(md->disk, zno, clone);
-		sts = BLK_STS_IOERR;
-		break;
-	}
-
-	if (sts != BLK_STS_OK)
-		return DM_MAPIO_KILL;
-
-	return r;
+	return dm_revalidate_zones(md, t);
 }
 
 /*
@@ -587,61 +241,17 @@ void dm_zone_endio(struct dm_io *io, struct bio *clone)
 	struct mapped_device *md = io->md;
 	struct gendisk *disk = md->disk;
 	struct bio *orig_bio = io->orig_bio;
-	unsigned int zwp_offset;
-	unsigned int zno;
 
 	/*
-	 * For targets that do not emulate zone append, we only need to
-	 * handle native zone-append bios.
+	 * Get the offset within the zone of the written sector
+	 * and add that to the original bio sector position.
 	 */
-	if (!dm_emulate_zone_append(md)) {
-		/*
-		 * Get the offset within the zone of the written sector
-		 * and add that to the original bio sector position.
-		 */
-		if (clone->bi_status == BLK_STS_OK &&
-		    bio_op(clone) == REQ_OP_ZONE_APPEND) {
-			sector_t mask =
-				(sector_t)bdev_zone_sectors(disk->part0) - 1;
-
-			orig_bio->bi_iter.bi_sector +=
-				clone->bi_iter.bi_sector & mask;
-		}
-
-		return;
-	}
+	if (clone->bi_status == BLK_STS_OK &&
+	    bio_op(clone) == REQ_OP_ZONE_APPEND) {
+		sector_t mask = bdev_zone_sectors(disk->part0) - 1;
 
-	/*
-	 * For targets that do emulate zone append, if the clone BIO does not
-	 * own the target zone write lock, we have nothing to do.
-	 */
-	if (!bio_flagged(clone, BIO_ZONE_WRITE_LOCKED))
-		return;
-
-	zno = bio_zone_no(orig_bio);
-
-	if (clone->bi_status != BLK_STS_OK) {
-		/*
-		 * BIOs that modify a zone write pointer may leave the zone
-		 * in an unknown state in case of failure (e.g. the write
-		 * pointer was only partially advanced). In this case, set
-		 * the target zone write pointer as invalid unless it is
-		 * already being updated.
-		 */
-		WRITE_ONCE(md->zwp_offset[zno], DM_ZONE_INVALID_WP_OFST);
-	} else if (bio_op(orig_bio) == REQ_OP_ZONE_APPEND) {
-		/*
-		 * Get the written sector for zone append operation that were
-		 * emulated using regular write operations.
-		 */
-		zwp_offset = READ_ONCE(md->zwp_offset[zno]);
-		if (WARN_ON_ONCE(zwp_offset < bio_sectors(orig_bio)))
-			WRITE_ONCE(md->zwp_offset[zno],
-				   DM_ZONE_INVALID_WP_OFST);
-		else
-			orig_bio->bi_iter.bi_sector +=
-				zwp_offset - bio_sectors(orig_bio);
+		orig_bio->bi_iter.bi_sector += clone->bi_iter.bi_sector & mask;
 	}
 
-	dm_zone_unlock(disk, zno, clone);
+	return;
 }
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 56aa2a8b9d71..18abff72f206 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1422,25 +1422,12 @@ static void __map_bio(struct bio *clone)
 		down(&md->swap_bios_semaphore);
 	}
 
-	if (static_branch_unlikely(&zoned_enabled)) {
-		/*
-		 * Check if the IO needs a special mapping due to zone append
-		 * emulation on zoned target. In this case, dm_zone_map_bio()
-		 * calls the target map operation.
-		 */
-		if (unlikely(dm_emulate_zone_append(md)))
-			r = dm_zone_map_bio(tio);
-		else
-			goto do_map;
-	} else {
-do_map:
-		if (likely(ti->type->map == linear_map))
-			r = linear_map(ti, clone);
-		else if (ti->type->map == stripe_map)
-			r = stripe_map(ti, clone);
-		else
-			r = ti->type->map(ti, clone);
-	}
+	if (likely(ti->type->map == linear_map))
+		r = linear_map(ti, clone);
+	else if (ti->type->map == stripe_map)
+		r = stripe_map(ti, clone);
+	else
+		r = ti->type->map(ti, clone);
 
 	switch (r) {
 	case DM_MAPIO_SUBMITTED:
@@ -1768,6 +1755,35 @@ static void init_clone_info(struct clone_info *ci, struct dm_io *io,
 		ci->sector_count = 0;
 }
 
+#ifdef CONFIG_BLK_DEV_ZONED
+static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
+					   struct bio *bio)
+{
+	/*
+	 * For mapped device that need zone append emulation, we must
+	 * split any large BIO that straddles zone boundaries.
+	 */
+	return dm_emulate_zone_append(md) && bio_straddles_zones(bio) &&
+		!bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING);
+}
+static inline bool dm_zone_write_plug_bio(struct mapped_device *md,
+					  struct bio *bio)
+{
+	return dm_emulate_zone_append(md) && blk_zone_write_plug_bio(bio, 0);
+}
+#else
+static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
+					   struct bio *bio)
+{
+	return false;
+}
+static inline bool dm_zone_write_plug_bio(struct mapped_device *md,
+					  struct bio *bio)
+{
+	return false;
+}
+#endif
+
 /*
  * Entry point to split a bio into clones and submit them to the targets.
  */
@@ -1777,19 +1793,33 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	struct clone_info ci;
 	struct dm_io *io;
 	blk_status_t error = BLK_STS_OK;
-	bool is_abnormal;
+	bool is_abnormal, need_split;
 
-	is_abnormal = is_abnormal_io(bio);
-	if (unlikely(is_abnormal)) {
+	need_split = is_abnormal = is_abnormal_io(bio);
+	if (static_branch_unlikely(&zoned_enabled))
+		need_split = is_abnormal || dm_zone_bio_needs_split(md, bio);
+
+	if (unlikely(need_split)) {
 		/*
 		 * Use bio_split_to_limits() for abnormal IO (e.g. discard, etc)
 		 * otherwise associated queue_limits won't be imposed.
+		 * Also split the BIO for mapped devices needing zone append
+		 * emulation to ensure that the BIO does not cross zone
+		 * boundaries.
 		 */
 		bio = bio_split_to_limits(bio);
 		if (!bio)
 			return;
 	}
 
+	/*
+	 * Use the block layer zone write plugging for mapped devices that
+	 * need zone append emulation (e.g. dm-crypt).
+	 */
+	if (static_branch_unlikely(&zoned_enabled) &&
+	    dm_zone_write_plug_bio(md, bio))
+		return;
+
 	/* Only support nowait for normal IO */
 	if (unlikely(bio->bi_opf & REQ_NOWAIT) && !is_abnormal) {
 		io = alloc_io(md, bio, GFP_NOWAIT);
@@ -2010,7 +2040,6 @@ static void cleanup_mapped_device(struct mapped_device *md)
 		md->dax_dev = NULL;
 	}
 
-	dm_cleanup_zoned_dev(md);
 	if (md->disk) {
 		spin_lock(&_minor_lock);
 		md->disk->private_data = NULL;
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 7f1acbf6bd9e..6e951cd42074 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -104,14 +104,12 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t);
 int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q);
 void dm_zone_endio(struct dm_io *io, struct bio *clone);
 #ifdef CONFIG_BLK_DEV_ZONED
-void dm_cleanup_zoned_dev(struct mapped_device *md);
 int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
 			unsigned int nr_zones, report_zones_cb cb, void *data);
 bool dm_is_zone_write(struct mapped_device *md, struct bio *bio);
 int dm_zone_map_bio(struct dm_target_io *io);
 #else
-static inline void dm_cleanup_zoned_dev(struct mapped_device *md) {}
-#define dm_blk_report_zones	NULL
+#define dm_blk_report_zones    NULL
 static inline bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
 {
 	return false;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 15/30] scsi: sd: Use the block layer zone append emulation
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (13 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 14/30] dm: Use the block layer zone append emulation Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  4:50   ` Christoph Hellwig
                     ` (2 more replies)
  2024-03-28  0:43 ` [PATCH v3 16/30] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature Damien Le Moal
                   ` (15 subsequent siblings)
  30 siblings, 3 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Set the request queue of a TYPE_ZBC device as needing zone append
emulation by setting the device queue max_zone_append_sectors limit to
0. This enables the block layer generic implementation provided by zone
write plugging. With this, the sd driver will never see a
REQ_OP_ZONE_APPEND request and the zone append emulation code
implemented in sd_zbc.c can be removed.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/scsi/sd.c     |   8 -
 drivers/scsi/sd.h     |  19 ---
 drivers/scsi/sd_zbc.c | 335 ++----------------------------------------
 3 files changed, 10 insertions(+), 352 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ccff8f2e2e75..35a486232a76 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1260,12 +1260,6 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 		}
 	}
 
-	if (req_op(rq) == REQ_OP_ZONE_APPEND) {
-		ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
-		if (ret)
-			goto fail;
-	}
-
 	fua = rq->cmd_flags & REQ_FUA ? 0x8 : 0;
 	dix = scsi_prot_sg_count(cmd);
 	dif = scsi_host_dif_capable(cmd->device->host, sdkp->protection_type);
@@ -1348,7 +1342,6 @@ static blk_status_t sd_init_command(struct scsi_cmnd *cmd)
 		return sd_setup_flush_cmnd(cmd);
 	case REQ_OP_READ:
 	case REQ_OP_WRITE:
-	case REQ_OP_ZONE_APPEND:
 		return sd_setup_read_write_cmnd(cmd);
 	case REQ_OP_ZONE_RESET:
 		return sd_zbc_setup_zone_mgmt_cmnd(cmd, ZO_RESET_WRITE_POINTER,
@@ -3979,7 +3972,6 @@ static void scsi_disk_release(struct device *dev)
 	struct scsi_disk *sdkp = to_scsi_disk(dev);
 
 	ida_free(&sd_index_ida, sdkp->index);
-	sd_zbc_free_zone_info(sdkp);
 	put_device(&sdkp->device->sdev_gendev);
 	free_opal_dev(sdkp->opal_dev);
 
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 5c4285a582b2..49dd600bfa48 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -104,12 +104,6 @@ struct scsi_disk {
 	 * between zone starting LBAs is constant.
 	 */
 	u32		zone_starting_lba_gran;
-	u32		*zones_wp_offset;
-	spinlock_t	zones_wp_offset_lock;
-	u32		*rev_wp_offset;
-	struct mutex	rev_mutex;
-	struct work_struct zone_wp_offset_work;
-	char		*zone_wp_update_buf;
 #endif
 	atomic_t	openers;
 	sector_t	capacity;	/* size in logical blocks */
@@ -245,7 +239,6 @@ static inline int sd_is_zoned(struct scsi_disk *sdkp)
 
 #ifdef CONFIG_BLK_DEV_ZONED
 
-void sd_zbc_free_zone_info(struct scsi_disk *sdkp);
 int sd_zbc_read_zones(struct scsi_disk *sdkp, u8 buf[SD_BUF_SIZE]);
 int sd_zbc_revalidate_zones(struct scsi_disk *sdkp);
 blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
@@ -255,13 +248,8 @@ unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
 int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 		unsigned int nr_zones, report_zones_cb cb, void *data);
 
-blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba,
-				        unsigned int nr_blocks);
-
 #else /* CONFIG_BLK_DEV_ZONED */
 
-static inline void sd_zbc_free_zone_info(struct scsi_disk *sdkp) {}
-
 static inline int sd_zbc_read_zones(struct scsi_disk *sdkp, u8 buf[SD_BUF_SIZE])
 {
 	return 0;
@@ -285,13 +273,6 @@ static inline unsigned int sd_zbc_complete(struct scsi_cmnd *cmd,
 	return good_bytes;
 }
 
-static inline blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd,
-						      sector_t *lba,
-						      unsigned int nr_blocks)
-{
-	return BLK_STS_TARGET;
-}
-
 #define sd_zbc_report_zones NULL
 
 #endif /* CONFIG_BLK_DEV_ZONED */
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 26af5ab7d7c1..d0ead9858954 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -23,36 +23,6 @@
 #define CREATE_TRACE_POINTS
 #include "sd_trace.h"
 
-/**
- * sd_zbc_get_zone_wp_offset - Get zone write pointer offset.
- * @zone: Zone for which to return the write pointer offset.
- *
- * Return: offset of the write pointer from the start of the zone.
- */
-static unsigned int sd_zbc_get_zone_wp_offset(struct blk_zone *zone)
-{
-	if (zone->type == ZBC_ZONE_TYPE_CONV)
-		return 0;
-
-	switch (zone->cond) {
-	case BLK_ZONE_COND_IMP_OPEN:
-	case BLK_ZONE_COND_EXP_OPEN:
-	case BLK_ZONE_COND_CLOSED:
-		return zone->wp - zone->start;
-	case BLK_ZONE_COND_FULL:
-		return zone->len;
-	case BLK_ZONE_COND_EMPTY:
-	case BLK_ZONE_COND_OFFLINE:
-	case BLK_ZONE_COND_READONLY:
-	default:
-		/*
-		 * Offline and read-only zones do not have a valid
-		 * write pointer. Use 0 as for an empty zone.
-		 */
-		return 0;
-	}
-}
-
 /* Whether or not a SCSI zone descriptor describes a gap zone. */
 static bool sd_zbc_is_gap_zone(const u8 buf[64])
 {
@@ -121,9 +91,6 @@ static int sd_zbc_parse_report(struct scsi_disk *sdkp, const u8 buf[64],
 	if (ret)
 		return ret;
 
-	if (sdkp->rev_wp_offset)
-		sdkp->rev_wp_offset[idx] = sd_zbc_get_zone_wp_offset(&zone);
-
 	return 0;
 }
 
@@ -347,123 +314,6 @@ static blk_status_t sd_zbc_cmnd_checks(struct scsi_cmnd *cmd)
 	return BLK_STS_OK;
 }
 
-#define SD_ZBC_INVALID_WP_OFST	(~0u)
-#define SD_ZBC_UPDATING_WP_OFST	(SD_ZBC_INVALID_WP_OFST - 1)
-
-static int sd_zbc_update_wp_offset_cb(struct blk_zone *zone, unsigned int idx,
-				    void *data)
-{
-	struct scsi_disk *sdkp = data;
-
-	lockdep_assert_held(&sdkp->zones_wp_offset_lock);
-
-	sdkp->zones_wp_offset[idx] = sd_zbc_get_zone_wp_offset(zone);
-
-	return 0;
-}
-
-/*
- * An attempt to append a zone triggered an invalid write pointer error.
- * Reread the write pointer of the zone(s) in which the append failed.
- */
-static void sd_zbc_update_wp_offset_workfn(struct work_struct *work)
-{
-	struct scsi_disk *sdkp;
-	unsigned long flags;
-	sector_t zno;
-	int ret;
-
-	sdkp = container_of(work, struct scsi_disk, zone_wp_offset_work);
-
-	spin_lock_irqsave(&sdkp->zones_wp_offset_lock, flags);
-	for (zno = 0; zno < sdkp->zone_info.nr_zones; zno++) {
-		if (sdkp->zones_wp_offset[zno] != SD_ZBC_UPDATING_WP_OFST)
-			continue;
-
-		spin_unlock_irqrestore(&sdkp->zones_wp_offset_lock, flags);
-		ret = sd_zbc_do_report_zones(sdkp, sdkp->zone_wp_update_buf,
-					     SD_BUF_SIZE,
-					     zno * sdkp->zone_info.zone_blocks, true);
-		spin_lock_irqsave(&sdkp->zones_wp_offset_lock, flags);
-		if (!ret)
-			sd_zbc_parse_report(sdkp, sdkp->zone_wp_update_buf + 64,
-					    zno, sd_zbc_update_wp_offset_cb,
-					    sdkp);
-	}
-	spin_unlock_irqrestore(&sdkp->zones_wp_offset_lock, flags);
-
-	scsi_device_put(sdkp->device);
-}
-
-/**
- * sd_zbc_prepare_zone_append() - Prepare an emulated ZONE_APPEND command.
- * @cmd: the command to setup
- * @lba: the LBA to patch
- * @nr_blocks: the number of LBAs to be written
- *
- * Called from sd_setup_read_write_cmnd() for REQ_OP_ZONE_APPEND.
- * @sd_zbc_prepare_zone_append() handles the necessary zone wrote locking and
- * patching of the lba for an emulated ZONE_APPEND command.
- *
- * In case the cached write pointer offset is %SD_ZBC_INVALID_WP_OFST it will
- * schedule a REPORT ZONES command and return BLK_STS_IOERR.
- */
-blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba,
-					unsigned int nr_blocks)
-{
-	struct request *rq = scsi_cmd_to_rq(cmd);
-	struct scsi_disk *sdkp = scsi_disk(rq->q->disk);
-	unsigned int wp_offset, zno = blk_rq_zone_no(rq);
-	unsigned long flags;
-	blk_status_t ret;
-
-	ret = sd_zbc_cmnd_checks(cmd);
-	if (ret != BLK_STS_OK)
-		return ret;
-
-	if (!blk_rq_zone_is_seq(rq))
-		return BLK_STS_IOERR;
-
-	/* Unlock of the write lock will happen in sd_zbc_complete() */
-	if (!blk_req_zone_write_trylock(rq))
-		return BLK_STS_ZONE_RESOURCE;
-
-	spin_lock_irqsave(&sdkp->zones_wp_offset_lock, flags);
-	wp_offset = sdkp->zones_wp_offset[zno];
-	switch (wp_offset) {
-	case SD_ZBC_INVALID_WP_OFST:
-		/*
-		 * We are about to schedule work to update a zone write pointer
-		 * offset, which will cause the zone append command to be
-		 * requeued. So make sure that the scsi device does not go away
-		 * while the work is being processed.
-		 */
-		if (scsi_device_get(sdkp->device)) {
-			ret = BLK_STS_IOERR;
-			break;
-		}
-		sdkp->zones_wp_offset[zno] = SD_ZBC_UPDATING_WP_OFST;
-		schedule_work(&sdkp->zone_wp_offset_work);
-		fallthrough;
-	case SD_ZBC_UPDATING_WP_OFST:
-		ret = BLK_STS_DEV_RESOURCE;
-		break;
-	default:
-		wp_offset = sectors_to_logical(sdkp->device, wp_offset);
-		if (wp_offset + nr_blocks > sdkp->zone_info.zone_blocks) {
-			ret = BLK_STS_IOERR;
-			break;
-		}
-
-		trace_scsi_prepare_zone_append(cmd, *lba, wp_offset);
-		*lba += wp_offset;
-	}
-	spin_unlock_irqrestore(&sdkp->zones_wp_offset_lock, flags);
-	if (ret)
-		blk_req_zone_write_unlock(rq);
-	return ret;
-}
-
 /**
  * sd_zbc_setup_zone_mgmt_cmnd - Prepare a zone ZBC_OUT command. The operations
  *			can be RESET WRITE POINTER, OPEN, CLOSE or FINISH.
@@ -504,96 +354,6 @@ blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
 	return BLK_STS_OK;
 }
 
-static bool sd_zbc_need_zone_wp_update(struct request *rq)
-{
-	switch (req_op(rq)) {
-	case REQ_OP_ZONE_APPEND:
-	case REQ_OP_ZONE_FINISH:
-	case REQ_OP_ZONE_RESET:
-	case REQ_OP_ZONE_RESET_ALL:
-		return true;
-	case REQ_OP_WRITE:
-	case REQ_OP_WRITE_ZEROES:
-		return blk_rq_zone_is_seq(rq);
-	default:
-		return false;
-	}
-}
-
-/**
- * sd_zbc_zone_wp_update - Update cached zone write pointer upon cmd completion
- * @cmd: Completed command
- * @good_bytes: Command reply bytes
- *
- * Called from sd_zbc_complete() to handle the update of the cached zone write
- * pointer value in case an update is needed.
- */
-static unsigned int sd_zbc_zone_wp_update(struct scsi_cmnd *cmd,
-					  unsigned int good_bytes)
-{
-	int result = cmd->result;
-	struct request *rq = scsi_cmd_to_rq(cmd);
-	struct scsi_disk *sdkp = scsi_disk(rq->q->disk);
-	unsigned int zno = blk_rq_zone_no(rq);
-	enum req_op op = req_op(rq);
-	unsigned long flags;
-
-	/*
-	 * If we got an error for a command that needs updating the write
-	 * pointer offset cache, we must mark the zone wp offset entry as
-	 * invalid to force an update from disk the next time a zone append
-	 * command is issued.
-	 */
-	spin_lock_irqsave(&sdkp->zones_wp_offset_lock, flags);
-
-	if (result && op != REQ_OP_ZONE_RESET_ALL) {
-		if (op == REQ_OP_ZONE_APPEND) {
-			/* Force complete completion (no retry) */
-			good_bytes = 0;
-			scsi_set_resid(cmd, blk_rq_bytes(rq));
-		}
-
-		/*
-		 * Force an update of the zone write pointer offset on
-		 * the next zone append access.
-		 */
-		if (sdkp->zones_wp_offset[zno] != SD_ZBC_UPDATING_WP_OFST)
-			sdkp->zones_wp_offset[zno] = SD_ZBC_INVALID_WP_OFST;
-		goto unlock_wp_offset;
-	}
-
-	switch (op) {
-	case REQ_OP_ZONE_APPEND:
-		trace_scsi_zone_wp_update(cmd, rq->__sector,
-				  sdkp->zones_wp_offset[zno], good_bytes);
-		rq->__sector += sdkp->zones_wp_offset[zno];
-		fallthrough;
-	case REQ_OP_WRITE_ZEROES:
-	case REQ_OP_WRITE:
-		if (sdkp->zones_wp_offset[zno] < sd_zbc_zone_sectors(sdkp))
-			sdkp->zones_wp_offset[zno] +=
-						good_bytes >> SECTOR_SHIFT;
-		break;
-	case REQ_OP_ZONE_RESET:
-		sdkp->zones_wp_offset[zno] = 0;
-		break;
-	case REQ_OP_ZONE_FINISH:
-		sdkp->zones_wp_offset[zno] = sd_zbc_zone_sectors(sdkp);
-		break;
-	case REQ_OP_ZONE_RESET_ALL:
-		memset(sdkp->zones_wp_offset, 0,
-		       sdkp->zone_info.nr_zones * sizeof(unsigned int));
-		break;
-	default:
-		break;
-	}
-
-unlock_wp_offset:
-	spin_unlock_irqrestore(&sdkp->zones_wp_offset_lock, flags);
-
-	return good_bytes;
-}
-
 /**
  * sd_zbc_complete - ZBC command post processing.
  * @cmd: Completed command
@@ -619,11 +379,7 @@ unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
 		 * so be quiet about the error.
 		 */
 		rq->rq_flags |= RQF_QUIET;
-	} else if (sd_zbc_need_zone_wp_update(rq))
-		good_bytes = sd_zbc_zone_wp_update(cmd, good_bytes);
-
-	if (req_op(rq) == REQ_OP_ZONE_APPEND)
-		blk_req_zone_write_unlock(rq);
+	}
 
 	return good_bytes;
 }
@@ -780,46 +536,6 @@ static void sd_zbc_print_zones(struct scsi_disk *sdkp)
 			  sdkp->zone_info.zone_blocks);
 }
 
-static int sd_zbc_init_disk(struct scsi_disk *sdkp)
-{
-	sdkp->zones_wp_offset = NULL;
-	spin_lock_init(&sdkp->zones_wp_offset_lock);
-	sdkp->rev_wp_offset = NULL;
-	mutex_init(&sdkp->rev_mutex);
-	INIT_WORK(&sdkp->zone_wp_offset_work, sd_zbc_update_wp_offset_workfn);
-	sdkp->zone_wp_update_buf = kzalloc(SD_BUF_SIZE, GFP_KERNEL);
-	if (!sdkp->zone_wp_update_buf)
-		return -ENOMEM;
-
-	return 0;
-}
-
-void sd_zbc_free_zone_info(struct scsi_disk *sdkp)
-{
-	if (!sdkp->zone_wp_update_buf)
-		return;
-
-	/* Serialize against revalidate zones */
-	mutex_lock(&sdkp->rev_mutex);
-
-	kvfree(sdkp->zones_wp_offset);
-	sdkp->zones_wp_offset = NULL;
-	kfree(sdkp->zone_wp_update_buf);
-	sdkp->zone_wp_update_buf = NULL;
-
-	sdkp->early_zone_info = (struct zoned_disk_info){ };
-	sdkp->zone_info = (struct zoned_disk_info){ };
-
-	mutex_unlock(&sdkp->rev_mutex);
-}
-
-static void sd_zbc_revalidate_zones_cb(struct gendisk *disk)
-{
-	struct scsi_disk *sdkp = scsi_disk(disk);
-
-	swap(sdkp->zones_wp_offset, sdkp->rev_wp_offset);
-}
-
 /*
  * Call blk_revalidate_disk_zones() if any of the zoned disk properties have
  * changed that make it necessary to call that function. Called by
@@ -831,18 +547,8 @@ int sd_zbc_revalidate_zones(struct scsi_disk *sdkp)
 	struct request_queue *q = disk->queue;
 	u32 zone_blocks = sdkp->early_zone_info.zone_blocks;
 	unsigned int nr_zones = sdkp->early_zone_info.nr_zones;
-	int ret = 0;
 	unsigned int flags;
-
-	/*
-	 * For all zoned disks, initialize zone append emulation data if not
-	 * already done.
-	 */
-	if (sd_is_zoned(sdkp) && !sdkp->zone_wp_update_buf) {
-		ret = sd_zbc_init_disk(sdkp);
-		if (ret)
-			return ret;
-	}
+	int ret;
 
 	/*
 	 * There is nothing to do for regular disks, including host-aware disks
@@ -851,50 +557,32 @@ int sd_zbc_revalidate_zones(struct scsi_disk *sdkp)
 	if (!blk_queue_is_zoned(q))
 		return 0;
 
-	/*
-	 * Make sure revalidate zones are serialized to ensure exclusive
-	 * updates of the scsi disk data.
-	 */
-	mutex_lock(&sdkp->rev_mutex);
-
 	if (sdkp->zone_info.zone_blocks == zone_blocks &&
 	    sdkp->zone_info.nr_zones == nr_zones &&
 	    disk->nr_zones == nr_zones)
-		goto unlock;
+		return 0;
 
-	flags = memalloc_noio_save();
 	sdkp->zone_info.zone_blocks = zone_blocks;
 	sdkp->zone_info.nr_zones = nr_zones;
-	sdkp->rev_wp_offset = kvcalloc(nr_zones, sizeof(u32), GFP_KERNEL);
-	if (!sdkp->rev_wp_offset) {
-		ret = -ENOMEM;
-		memalloc_noio_restore(flags);
-		goto unlock;
-	}
 
 	blk_queue_chunk_sectors(q,
 			logical_to_sectors(sdkp->device, zone_blocks));
-	blk_queue_max_zone_append_sectors(q,
-			q->limits.max_segments << PAGE_SECTORS_SHIFT);
 
-	ret = blk_revalidate_disk_zones(disk, sd_zbc_revalidate_zones_cb);
+	/* Enable block layer zone append emulation */
+	blk_queue_max_zone_append_sectors(q, 0);
 
+	flags = memalloc_noio_save();
+	ret = blk_revalidate_disk_zones(disk, NULL);
 	memalloc_noio_restore(flags);
-	kvfree(sdkp->rev_wp_offset);
-	sdkp->rev_wp_offset = NULL;
-
 	if (ret) {
 		sdkp->zone_info = (struct zoned_disk_info){ };
 		sdkp->capacity = 0;
-		goto unlock;
+		return ret;
 	}
 
 	sd_zbc_print_zones(sdkp);
 
-unlock:
-	mutex_unlock(&sdkp->rev_mutex);
-
-	return ret;
+	return 0;
 }
 
 /**
@@ -917,10 +605,8 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, u8 buf[SD_BUF_SIZE])
 	if (!sd_is_zoned(sdkp)) {
 		/*
 		 * Device managed or normal SCSI disk, no special handling
-		 * required. Nevertheless, free the disk zone information in
-		 * case the device type changed.
+		 * required.
 		 */
-		sd_zbc_free_zone_info(sdkp);
 		return 0;
 	}
 
@@ -941,7 +627,6 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, u8 buf[SD_BUF_SIZE])
 
 	/* The drive satisfies the kernel restrictions: set it up */
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
-	blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE);
 	if (sdkp->zones_max_open == U32_MAX)
 		disk_set_max_open_zones(disk, 0);
 	else
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 16/30] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (14 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 15/30] scsi: sd: " Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  4:50   ` Christoph Hellwig
  2024-03-29 21:28   ` Bart Van Assche
  2024-03-28  0:43 ` [PATCH v3 17/30] null_blk: " Damien Le Moal
                   ` (14 subsequent siblings)
  30 siblings, 2 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

With zone write plugging enabled at the block layer level, any zone
device can only ever see at most a single write operation per zone.
There is thus no need to request a block scheduler with strick per-zone
sequential write ordering control through the ELEVATOR_F_ZBD_SEQ_WRITE
feature. Removing this allows using a zoned ublk device with any
scheduler, including "none".

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/ublk_drv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index bea3d5cf8a83..ab6af84e327c 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -249,8 +249,7 @@ static int ublk_dev_param_zoned_validate(const struct ublk_device *ub)
 static void ublk_dev_param_zoned_apply(struct ublk_device *ub)
 {
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, ub->ub_disk->queue);
-	blk_queue_required_elevator_features(ub->ub_disk->queue,
-					     ELEVATOR_F_ZBD_SEQ_WRITE);
+
 	ub->ub_disk->nr_zones = ublk_get_nr_zones(ub);
 }
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 17/30] null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (15 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 16/30] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  4:51   ` Christoph Hellwig
                     ` (2 more replies)
  2024-03-28  0:43 ` [PATCH v3 18/30] null_blk: Introduce zone_append_max_sectors attribute Damien Le Moal
                   ` (13 subsequent siblings)
  30 siblings, 3 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

With zone write plugging enabled at the block layer level, any zone
device can only ever see at most a single write operation per zone.
There is thus no need to request a block scheduler with strick per-zone
sequential write ordering control through the ELEVATOR_F_ZBD_SEQ_WRITE
feature. Removing this allows using a zoned null_blk device with any
scheduler, including "none".

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/null_blk/zoned.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 1689e2584104..8e217f8fadcd 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -165,7 +165,6 @@ int null_register_zoned_dev(struct nullb *nullb)
 	struct request_queue *q = nullb->q;
 
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
-	blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE);
 	nullb->disk->nr_zones = bdev_nr_zones(nullb->disk->part0);
 	return blk_revalidate_disk_zones(nullb->disk, NULL);
 }
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 18/30] null_blk: Introduce zone_append_max_sectors attribute
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (16 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 17/30] null_blk: " Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  4:51   ` Christoph Hellwig
                     ` (2 more replies)
  2024-03-28  0:43 ` [PATCH v3 19/30] null_blk: Introduce fua attribute Damien Le Moal
                   ` (12 subsequent siblings)
  30 siblings, 3 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Add the zone_append_max_sectors configfs attribute and module parameter
to allow configuring the maximum number of 512B sectors of zone append
operations. This attribute is meaningful only for zoned null block
devices.

If not specified, the default is unchanged and the zoned device max
append sectors limit is set to the device max sectors limit.
If a non 0 value is used for this attribute, which is the default,
then native support for zone append operations is enabled.
Setting a 0 value disables native zone append operations support to
instead use the block layer emulation.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/null_blk/main.c     | 12 ++++++++++--
 drivers/block/null_blk/null_blk.h |  1 +
 drivers/block/null_blk/zoned.c    | 22 +++++++++++++++++++---
 3 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 71c39bcd872c..cd6519360dbc 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -253,6 +253,11 @@ static unsigned int g_zone_max_active;
 module_param_named(zone_max_active, g_zone_max_active, uint, 0444);
 MODULE_PARM_DESC(zone_max_active, "Maximum number of active zones when block device is zoned. Default: 0 (no limit)");
 
+static int g_zone_append_max_sectors = INT_MAX;
+module_param_named(zone_append_max_sectors, g_zone_append_max_sectors, int, 0444);
+MODULE_PARM_DESC(zone_append_max_sectors,
+		 "Maximum size of a zone append command (in 512B sectors). Specify 0 for zone append emulation");
+
 static struct nullb_device *null_alloc_dev(void);
 static void null_free_dev(struct nullb_device *dev);
 static void null_del_dev(struct nullb *nullb);
@@ -436,6 +441,7 @@ NULLB_DEVICE_ATTR(zone_capacity, ulong, NULL);
 NULLB_DEVICE_ATTR(zone_nr_conv, uint, NULL);
 NULLB_DEVICE_ATTR(zone_max_open, uint, NULL);
 NULLB_DEVICE_ATTR(zone_max_active, uint, NULL);
+NULLB_DEVICE_ATTR(zone_append_max_sectors, uint, NULL);
 NULLB_DEVICE_ATTR(virt_boundary, bool, NULL);
 NULLB_DEVICE_ATTR(no_sched, bool, NULL);
 NULLB_DEVICE_ATTR(shared_tags, bool, NULL);
@@ -580,6 +586,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
 	&nullb_device_attr_zone_nr_conv,
 	&nullb_device_attr_zone_max_open,
 	&nullb_device_attr_zone_max_active,
+	&nullb_device_attr_zone_append_max_sectors,
 	&nullb_device_attr_zone_readonly,
 	&nullb_device_attr_zone_offline,
 	&nullb_device_attr_virt_boundary,
@@ -671,7 +678,7 @@ static ssize_t memb_group_features_show(struct config_item *item, char *page)
 			"shared_tags,size,submit_queues,use_per_node_hctx,"
 			"virt_boundary,zoned,zone_capacity,zone_max_active,"
 			"zone_max_open,zone_nr_conv,zone_offline,zone_readonly,"
-			"zone_size\n");
+			"zone_size,zone_append_max_sectors\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -751,6 +758,7 @@ static struct nullb_device *null_alloc_dev(void)
 	dev->zone_nr_conv = g_zone_nr_conv;
 	dev->zone_max_open = g_zone_max_open;
 	dev->zone_max_active = g_zone_max_active;
+	dev->zone_append_max_sectors = g_zone_append_max_sectors;
 	dev->virt_boundary = g_virt_boundary;
 	dev->no_sched = g_no_sched;
 	dev->shared_tags = g_shared_tags;
@@ -1953,7 +1961,7 @@ static int null_add_dev(struct nullb_device *dev)
 
 	rv = add_disk(nullb->disk);
 	if (rv)
-		goto out_ida_free;
+		goto out_cleanup_zone;
 
 	mutex_lock(&lock);
 	list_add_tail(&nullb->list, &nullb_list);
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 477b97746823..a9c5df650ddb 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -82,6 +82,7 @@ struct nullb_device {
 	unsigned int zone_nr_conv; /* number of conventional zones */
 	unsigned int zone_max_open; /* max number of open zones */
 	unsigned int zone_max_active; /* max number of active zones */
+	unsigned int zone_append_max_sectors; /* Max sectors per zone append command */
 	unsigned int submit_queues; /* number of submission queues */
 	unsigned int prev_submit_queues; /* number of submission queues before change */
 	unsigned int poll_queues; /* number of IOPOLL submission queues */
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 8e217f8fadcd..159746b0661c 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -62,6 +62,7 @@ int null_init_zoned_dev(struct nullb_device *dev,
 			struct queue_limits *lim)
 {
 	sector_t dev_capacity_sects, zone_capacity_sects;
+	sector_t zone_append_max_bytes;
 	struct nullb_zone *zone;
 	sector_t sector = 0;
 	unsigned int i;
@@ -103,6 +104,12 @@ int null_init_zoned_dev(struct nullb_device *dev,
 			dev->zone_nr_conv);
 	}
 
+	zone_append_max_bytes =
+		ALIGN_DOWN(dev->zone_append_max_sectors << SECTOR_SHIFT,
+			   dev->blocksize);
+	dev->zone_append_max_sectors =
+		min(zone_append_max_bytes >> SECTOR_SHIFT, zone_capacity_sects);
+
 	/* Max active zones has to be < nbr of seq zones in order to be enforceable */
 	if (dev->zone_max_active >= dev->nr_zones - dev->zone_nr_conv) {
 		dev->zone_max_active = 0;
@@ -154,7 +161,7 @@ int null_init_zoned_dev(struct nullb_device *dev,
 
 	lim->zoned = true;
 	lim->chunk_sectors = dev->zone_size_sects;
-	lim->max_zone_append_sectors = dev->zone_size_sects;
+	lim->max_zone_append_sectors = dev->zone_append_max_sectors;
 	lim->max_open_zones = dev->zone_max_open;
 	lim->max_active_zones = dev->zone_max_active;
 	return 0;
@@ -163,10 +170,16 @@ int null_init_zoned_dev(struct nullb_device *dev,
 int null_register_zoned_dev(struct nullb *nullb)
 {
 	struct request_queue *q = nullb->q;
+	struct gendisk *disk = nullb->disk;
 
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
-	nullb->disk->nr_zones = bdev_nr_zones(nullb->disk->part0);
-	return blk_revalidate_disk_zones(nullb->disk, NULL);
+	disk->nr_zones = bdev_nr_zones(disk->part0);
+
+	pr_info("%s: using %s zone append\n",
+		disk->disk_name,
+		queue_emulates_zone_append(q) ? "emulated" : "native");
+
+	return blk_revalidate_disk_zones(disk, NULL);
 }
 
 void null_free_zoned_dev(struct nullb_device *dev)
@@ -365,6 +378,9 @@ static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
 
 	trace_nullb_zone_op(cmd, zno, zone->cond);
 
+	if (WARN_ON_ONCE(append && !dev->zone_append_max_sectors))
+		return BLK_STS_IOERR;
+
 	if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL) {
 		if (append)
 			return BLK_STS_IOERR;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 19/30] null_blk: Introduce fua attribute
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (17 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 18/30] null_blk: Introduce zone_append_max_sectors attribute Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-03-28  4:52   ` Christoph Hellwig
                     ` (2 more replies)
  2024-03-28  0:43 ` [PATCH v3 20/30] nvmet: zns: Do not reference the gendisk conv_zones_bitmap Damien Le Moal
                   ` (11 subsequent siblings)
  30 siblings, 3 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Add the fua configfs attribute and module parameter to allow
configuring if the device supports FUA or not. Using this attribute
has an effect on the null_blk device only if memory backing is enabled
together with a write cache (cache_size option).

This new attribute allows configuring a null_blk device with a write
cache but without FUA support. This is convenient to test the block
layer flush machinery.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/null_blk/main.c     | 12 ++++++++++--
 drivers/block/null_blk/null_blk.h |  1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index cd6519360dbc..47e1bbb5b24e 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -225,6 +225,10 @@ static unsigned long g_cache_size;
 module_param_named(cache_size, g_cache_size, ulong, 0444);
 MODULE_PARM_DESC(mbps, "Cache size in MiB for memory-backed device. Default: 0 (none)");
 
+static bool g_fua = true;
+module_param_named(fua, g_fua, bool, S_IRUGO);
+MODULE_PARM_DESC(zoned, "Enable/disable FUA support when cache_size is used. Default: true");
+
 static unsigned int g_mbps;
 module_param_named(mbps, g_mbps, uint, 0444);
 MODULE_PARM_DESC(mbps, "Limit maximum bandwidth (in MiB/s). Default: 0 (no limit)");
@@ -446,6 +450,7 @@ NULLB_DEVICE_ATTR(virt_boundary, bool, NULL);
 NULLB_DEVICE_ATTR(no_sched, bool, NULL);
 NULLB_DEVICE_ATTR(shared_tags, bool, NULL);
 NULLB_DEVICE_ATTR(shared_tag_bitmap, bool, NULL);
+NULLB_DEVICE_ATTR(fua, bool, NULL);
 
 static ssize_t nullb_device_power_show(struct config_item *item, char *page)
 {
@@ -593,6 +598,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
 	&nullb_device_attr_no_sched,
 	&nullb_device_attr_shared_tags,
 	&nullb_device_attr_shared_tag_bitmap,
+	&nullb_device_attr_fua,
 	NULL,
 };
 
@@ -671,7 +677,7 @@ nullb_group_drop_item(struct config_group *group, struct config_item *item)
 static ssize_t memb_group_features_show(struct config_item *item, char *page)
 {
 	return snprintf(page, PAGE_SIZE,
-			"badblocks,blocking,blocksize,cache_size,"
+			"badblocks,blocking,blocksize,cache_size,fua,"
 			"completion_nsec,discard,home_node,hw_queue_depth,"
 			"irqmode,max_sectors,mbps,memory_backed,no_sched,"
 			"poll_queues,power,queue_mode,shared_tag_bitmap,"
@@ -763,6 +769,8 @@ static struct nullb_device *null_alloc_dev(void)
 	dev->no_sched = g_no_sched;
 	dev->shared_tags = g_shared_tags;
 	dev->shared_tag_bitmap = g_shared_tag_bitmap;
+	dev->fua = g_fua;
+
 	return dev;
 }
 
@@ -1920,7 +1928,7 @@ static int null_add_dev(struct nullb_device *dev)
 
 	if (dev->cache_size > 0) {
 		set_bit(NULLB_DEV_FL_CACHE, &nullb->dev->flags);
-		blk_queue_write_cache(nullb->q, true, true);
+		blk_queue_write_cache(nullb->q, true, dev->fua);
 	}
 
 	nullb->q->queuedata = nullb;
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index a9c5df650ddb..3234e6c85eed 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -105,6 +105,7 @@ struct nullb_device {
 	bool no_sched; /* no IO scheduler for the device */
 	bool shared_tags; /* share tag set between devices for blk-mq */
 	bool shared_tag_bitmap; /* use hostwide shared tags */
+	bool fua; /* Support FUA */
 };
 
 struct nullb {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 20/30] nvmet: zns: Do not reference the gendisk conv_zones_bitmap
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (18 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 19/30] null_blk: Introduce fua attribute Damien Le Moal
@ 2024-03-28  0:43 ` Damien Le Moal
  2024-04-02  6:45   ` Chaitanya Kulkarni
  2024-03-28  0:44 ` [PATCH v3 21/30] block: Remove BLK_STS_ZONE_RESOURCE Damien Le Moal
                   ` (10 subsequent siblings)
  30 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:43 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

The gendisk conventional zone bitmap is going away. So to check for the
presence of conventional zones on a zoned target device, always use
report zones.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/target/zns.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
index 3148d9f1bde6..0021d06041c1 100644
--- a/drivers/nvme/target/zns.c
+++ b/drivers/nvme/target/zns.c
@@ -52,14 +52,10 @@ bool nvmet_bdev_zns_enable(struct nvmet_ns *ns)
 	if (get_capacity(bd_disk) & (bdev_zone_sectors(ns->bdev) - 1))
 		return false;
 	/*
-	 * ZNS does not define a conventional zone type. If the underlying
-	 * device has a bitmap set indicating the existence of conventional
-	 * zones, reject the device. Otherwise, use report zones to detect if
-	 * the device has conventional zones.
+	 * ZNS does not define a conventional zone type. Use report zones
+	 * to detect if the device has conventional zones and reject it if
+	 * it does.
 	 */
-	if (ns->bdev->bd_disk->conv_zones_bitmap)
-		return false;
-
 	ret = blkdev_report_zones(ns->bdev, 0, bdev_nr_zones(ns->bdev),
 				  validate_conv_zones_cb, NULL);
 	if (ret < 0)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 21/30] block: Remove BLK_STS_ZONE_RESOURCE
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (19 preceding siblings ...)
  2024-03-28  0:43 ` [PATCH v3 20/30] nvmet: zns: Do not reference the gendisk conv_zones_bitmap Damien Le Moal
@ 2024-03-28  0:44 ` Damien Le Moal
  2024-03-29 21:37   ` Bart Van Assche
  2024-03-28  0:44 ` [PATCH v3 22/30] block: Simplify blk_revalidate_disk_zones() interface Damien Le Moal
                   ` (9 subsequent siblings)
  30 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:44 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

The zone append emulation of the scsi disk driver was the only driver
using BLK_STS_ZONE_RESOURCE. With this code removed,
BLK_STS_ZONE_RESOURCE is now unused. Remove this macro definition and
simplify blk_mq_dispatch_rq_list() where this status code was handled.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c            | 26 --------------------------
 drivers/scsi/scsi_lib.c   |  1 -
 include/linux/blk_types.h | 20 ++++----------------
 3 files changed, 4 insertions(+), 43 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5299a7ed8fec..770d636707dc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1923,19 +1923,6 @@ static void blk_mq_handle_dev_resource(struct request *rq,
 	__blk_mq_requeue_request(rq);
 }
 
-static void blk_mq_handle_zone_resource(struct request *rq,
-					struct list_head *zone_list)
-{
-	/*
-	 * If we end up here it is because we cannot dispatch a request to a
-	 * specific zone due to LLD level zone-write locking or other zone
-	 * related resource not being available. In this case, set the request
-	 * aside in zone_list for retrying it later.
-	 */
-	list_add(&rq->queuelist, zone_list);
-	__blk_mq_requeue_request(rq);
-}
-
 enum prep_dispatch {
 	PREP_DISPATCH_OK,
 	PREP_DISPATCH_NO_TAG,
@@ -2021,7 +2008,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
 	struct request *rq;
 	int queued;
 	blk_status_t ret = BLK_STS_OK;
-	LIST_HEAD(zone_list);
 	bool needs_resource = false;
 
 	if (list_empty(list))
@@ -2063,23 +2049,11 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
 		case BLK_STS_DEV_RESOURCE:
 			blk_mq_handle_dev_resource(rq, list);
 			goto out;
-		case BLK_STS_ZONE_RESOURCE:
-			/*
-			 * Move the request to zone_list and keep going through
-			 * the dispatch list to find more requests the drive can
-			 * accept.
-			 */
-			blk_mq_handle_zone_resource(rq, &zone_list);
-			needs_resource = true;
-			break;
 		default:
 			blk_mq_end_request(rq, ret);
 		}
 	} while (!list_empty(list));
 out:
-	if (!list_empty(&zone_list))
-		list_splice_tail_init(&zone_list, list);
-
 	/* If we didn't flush the entire list, we could have told the driver
 	 * there was more coming, but that turned out to be a lie.
 	 */
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2e28e2360c85..9ca96116bd33 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1870,7 +1870,6 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
-	case BLK_STS_ZONE_RESOURCE:
 		if (scsi_device_blocked(sdev))
 			ret = BLK_STS_DEV_RESOURCE;
 		break;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 29b3170431e7..ffe0c112b128 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -136,18 +136,6 @@ typedef u16 blk_short_t;
  */
 #define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
 
-/*
- * BLK_STS_ZONE_RESOURCE is returned from the driver to the block layer if zone
- * related resources are unavailable, but the driver can guarantee the queue
- * will be rerun in the future once the resources become available again.
- *
- * This is different from BLK_STS_DEV_RESOURCE in that it explicitly references
- * a zone specific resource and IO to a different zone on the same device could
- * still be served. Examples of that are zones that are write-locked, but a read
- * to the same zone could be served.
- */
-#define BLK_STS_ZONE_RESOURCE	((__force blk_status_t)14)
-
 /*
  * BLK_STS_ZONE_OPEN_RESOURCE is returned from the driver in the completion
  * path if the device returns a status indicating that too many zone resources
@@ -155,7 +143,7 @@ typedef u16 blk_short_t;
  * after the number of open zones decreases below the device's limits, which is
  * reported in the request_queue's max_open_zones.
  */
-#define BLK_STS_ZONE_OPEN_RESOURCE	((__force blk_status_t)15)
+#define BLK_STS_ZONE_OPEN_RESOURCE	((__force blk_status_t)14)
 
 /*
  * BLK_STS_ZONE_ACTIVE_RESOURCE is returned from the driver in the completion
@@ -164,20 +152,20 @@ typedef u16 blk_short_t;
  * after the number of active zones decreases below the device's limits, which
  * is reported in the request_queue's max_active_zones.
  */
-#define BLK_STS_ZONE_ACTIVE_RESOURCE	((__force blk_status_t)16)
+#define BLK_STS_ZONE_ACTIVE_RESOURCE	((__force blk_status_t)15)
 
 /*
  * BLK_STS_OFFLINE is returned from the driver when the target device is offline
  * or is being taken offline. This could help differentiate the case where a
  * device is intentionally being shut down from a real I/O error.
  */
-#define BLK_STS_OFFLINE		((__force blk_status_t)17)
+#define BLK_STS_OFFLINE		((__force blk_status_t)16)
 
 /*
  * BLK_STS_DURATION_LIMIT is returned from the driver when the target device
  * aborted the command because it exceeded one of its Command Duration Limits.
  */
-#define BLK_STS_DURATION_LIMIT	((__force blk_status_t)18)
+#define BLK_STS_DURATION_LIMIT	((__force blk_status_t)17)
 
 /**
  * blk_path_error - returns true if error may be path related
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 22/30] block: Simplify blk_revalidate_disk_zones() interface
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (20 preceding siblings ...)
  2024-03-28  0:44 ` [PATCH v3 21/30] block: Remove BLK_STS_ZONE_RESOURCE Damien Le Moal
@ 2024-03-28  0:44 ` Damien Le Moal
  2024-03-29 21:41   ` Bart Van Assche
  2024-03-28  0:44 ` [PATCH v3 23/30] block: mq-deadline: Remove support for zone write locking Damien Le Moal
                   ` (8 subsequent siblings)
  30 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:44 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

The only user of blk_revalidate_disk_zones() second argument was the
SCSI disk driver (sd). Now that this driver does not require this
update_driver_data argument, remove it to simplify the interface of
blk_revalidate_disk_zones(). Also update the function kdoc comment to
be more accurate (i.e. there is no gendisk ->revalidate method).

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-zoned.c              | 16 +++++-----------
 drivers/block/null_blk/zoned.c |  2 +-
 drivers/block/ublk_drv.c       |  2 +-
 drivers/block/virtio_blk.c     |  2 +-
 drivers/md/dm-zone.c           |  2 +-
 drivers/nvme/host/core.c       |  2 +-
 drivers/scsi/sd_zbc.c          |  2 +-
 include/linux/blkdev.h         |  3 +--
 8 files changed, 12 insertions(+), 19 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index c41ac1519818..d0549b85f281 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1795,21 +1795,17 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 /**
  * blk_revalidate_disk_zones - (re)allocate and initialize zone bitmaps
  * @disk:	Target disk
- * @update_driver_data:	Callback to update driver data on the frozen disk
  *
- * Helper function for low-level device drivers to check and (re) allocate and
- * initialize a disk request queue zone bitmaps. This functions should normally
- * be called within the disk ->revalidate method for blk-mq based drivers.
+ * Helper function for low-level device drivers to check, (re) allocate and
+ * initialize resources used for managing zoned disks. This function should
+ * normally be called by blk-mq based drivers when a zoned gendisk is probed
+ * and when the zone configuration of the gendisk changes (e.g. after a format).
  * Before calling this function, the device driver must already have set the
  * device zone size (chunk_sector limit) and the max zone append limit.
  * BIO based drivers can also use this function as long as the device queue
  * can be safely frozen.
- * If the @update_driver_data callback function is not NULL, the callback is
- * executed with the device request queue frozen after all zones have been
- * checked.
  */
-int blk_revalidate_disk_zones(struct gendisk *disk,
-			      void (*update_driver_data)(struct gendisk *disk))
+int blk_revalidate_disk_zones(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
 	sector_t zone_sectors = q->limits.chunk_sectors;
@@ -1881,8 +1877,6 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 		disk->zone_capacity = args.zone_capacity;
 		swap(disk->seq_zones_wlock, args.seq_zones_wlock);
 		swap(disk->conv_zones_bitmap, args.conv_zones_bitmap);
-		if (update_driver_data)
-			update_driver_data(disk);
 		ret = 0;
 	} else {
 		pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 159746b0661c..34f4d273df38 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -179,7 +179,7 @@ int null_register_zoned_dev(struct nullb *nullb)
 		disk->disk_name,
 		queue_emulates_zone_append(q) ? "emulated" : "native");
 
-	return blk_revalidate_disk_zones(disk, NULL);
+	return blk_revalidate_disk_zones(disk);
 }
 
 void null_free_zoned_dev(struct nullb_device *dev)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index ab6af84e327c..851c78913de2 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -221,7 +221,7 @@ static int ublk_get_nr_zones(const struct ublk_device *ub)
 
 static int ublk_revalidate_disk_zones(struct ublk_device *ub)
 {
-	return blk_revalidate_disk_zones(ub->ub_disk, NULL);
+	return blk_revalidate_disk_zones(ub->ub_disk);
 }
 
 static int ublk_dev_param_zoned_validate(const struct ublk_device *ub)
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 42dea7601d87..c1af0a7d56c8 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -1543,7 +1543,7 @@ static int virtblk_probe(struct virtio_device *vdev)
 	 */
 	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && lim.zoned) {
 		blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, vblk->disk->queue);
-		err = blk_revalidate_disk_zones(vblk->disk, NULL);
+		err = blk_revalidate_disk_zones(vblk->disk);
 		if (err)
 			goto out_cleanup_disk;
 	}
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 174fda0a301c..99d27fba01d3 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -169,7 +169,7 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
 	 * our table for dm_blk_report_zones() to use directly.
 	 */
 	md->zone_revalidate_map = t;
-	ret = blk_revalidate_disk_zones(disk, NULL);
+	ret = blk_revalidate_disk_zones(disk);
 	md->zone_revalidate_map = NULL;
 
 	if (ret) {
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 943d72bdd794..c9955ecd1790 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2150,7 +2150,7 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 	blk_mq_unfreeze_queue(ns->disk->queue);
 
 	if (blk_queue_is_zoned(ns->queue)) {
-		ret = blk_revalidate_disk_zones(ns->disk, NULL);
+		ret = blk_revalidate_disk_zones(ns->disk);
 		if (ret && !nvme_first_scan(ns->disk))
 			goto out;
 	}
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index d0ead9858954..806036e48abe 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -572,7 +572,7 @@ int sd_zbc_revalidate_zones(struct scsi_disk *sdkp)
 	blk_queue_max_zone_append_sectors(q, 0);
 
 	flags = memalloc_noio_save();
-	ret = blk_revalidate_disk_zones(disk, NULL);
+	ret = blk_revalidate_disk_zones(disk);
 	memalloc_noio_restore(flags);
 	if (ret) {
 		sdkp->zone_info = (struct zoned_disk_info){ };
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 45def924f7c1..d93005ca59e8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -337,8 +337,7 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
 		unsigned int nr_zones, report_zones_cb cb, void *data);
 int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
 		sector_t sectors, sector_t nr_sectors);
-int blk_revalidate_disk_zones(struct gendisk *disk,
-		void (*update_driver_data)(struct gendisk *disk));
+int blk_revalidate_disk_zones(struct gendisk *disk);
 
 /*
  * Independent access ranges: struct blk_independent_access_range describes
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 23/30] block: mq-deadline: Remove support for zone write locking
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (21 preceding siblings ...)
  2024-03-28  0:44 ` [PATCH v3 22/30] block: Simplify blk_revalidate_disk_zones() interface Damien Le Moal
@ 2024-03-28  0:44 ` Damien Le Moal
  2024-03-28  4:52   ` Christoph Hellwig
  2024-03-29 21:43   ` Bart Van Assche
  2024-03-28  0:44 ` [PATCH v3 24/30] block: Remove elevator required features Damien Le Moal
                   ` (7 subsequent siblings)
  30 siblings, 2 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:44 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

With the block layer generic plugging of write operations for zoned
block devices, mq-deadline, or any other scheduler, can only ever
see at most one write operation per zone at any time. There is thus no
sequentiality requirements for these writes and thus no need to tightly
control the dispatching of write requests using zone write locking.

Remove all the code that implement this control in the mq-deadline
scheduler and remove advertizing support for the
ELEVATOR_F_ZBD_SEQ_WRITE elevator feature.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 176 ++------------------------------------------
 1 file changed, 6 insertions(+), 170 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 02a916ba62ee..dce8d746b5bd 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -102,7 +102,6 @@ struct deadline_data {
 	int prio_aging_expire;
 
 	spinlock_t lock;
-	spinlock_t zone_lock;
 };
 
 /* Maps an I/O priority class to a deadline scheduler priority. */
@@ -157,8 +156,7 @@ deadline_latter_request(struct request *rq)
 }
 
 /*
- * Return the first request for which blk_rq_pos() >= @pos. For zoned devices,
- * return the first request after the start of the zone containing @pos.
+ * Return the first request for which blk_rq_pos() >= @pos.
  */
 static inline struct request *deadline_from_pos(struct dd_per_prio *per_prio,
 				enum dd_data_dir data_dir, sector_t pos)
@@ -170,14 +168,6 @@ static inline struct request *deadline_from_pos(struct dd_per_prio *per_prio,
 		return NULL;
 
 	rq = rb_entry_rq(node);
-	/*
-	 * A zoned write may have been requeued with a starting position that
-	 * is below that of the most recently dispatched request. Hence, for
-	 * zoned writes, start searching from the start of a zone.
-	 */
-	if (blk_rq_is_seq_zoned_write(rq))
-		pos = round_down(pos, rq->q->limits.chunk_sectors);
-
 	while (node) {
 		rq = rb_entry_rq(node);
 		if (blk_rq_pos(rq) >= pos) {
@@ -308,36 +298,6 @@ static inline bool deadline_check_fifo(struct dd_per_prio *per_prio,
 	return time_is_before_eq_jiffies((unsigned long)rq->fifo_time);
 }
 
-/*
- * Check if rq has a sequential request preceding it.
- */
-static bool deadline_is_seq_write(struct deadline_data *dd, struct request *rq)
-{
-	struct request *prev = deadline_earlier_request(rq);
-
-	if (!prev)
-		return false;
-
-	return blk_rq_pos(prev) + blk_rq_sectors(prev) == blk_rq_pos(rq);
-}
-
-/*
- * Skip all write requests that are sequential from @rq, even if we cross
- * a zone boundary.
- */
-static struct request *deadline_skip_seq_writes(struct deadline_data *dd,
-						struct request *rq)
-{
-	sector_t pos = blk_rq_pos(rq);
-
-	do {
-		pos += blk_rq_sectors(rq);
-		rq = deadline_latter_request(rq);
-	} while (rq && blk_rq_pos(rq) == pos);
-
-	return rq;
-}
-
 /*
  * For the specified data direction, return the next request to
  * dispatch using arrival ordered lists.
@@ -346,40 +306,10 @@ static struct request *
 deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 		      enum dd_data_dir data_dir)
 {
-	struct request *rq, *rb_rq, *next;
-	unsigned long flags;
-
 	if (list_empty(&per_prio->fifo_list[data_dir]))
 		return NULL;
 
-	rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
-	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
-		return rq;
-
-	/*
-	 * Look for a write request that can be dispatched, that is one with
-	 * an unlocked target zone. For some HDDs, breaking a sequential
-	 * write stream can lead to lower throughput, so make sure to preserve
-	 * sequential write streams, even if that stream crosses into the next
-	 * zones and these zones are unlocked.
-	 */
-	spin_lock_irqsave(&dd->zone_lock, flags);
-	list_for_each_entry_safe(rq, next, &per_prio->fifo_list[DD_WRITE],
-				 queuelist) {
-		/* Check whether a prior request exists for the same zone. */
-		rb_rq = deadline_from_pos(per_prio, data_dir, blk_rq_pos(rq));
-		if (rb_rq && blk_rq_pos(rb_rq) < blk_rq_pos(rq))
-			rq = rb_rq;
-		if (blk_req_can_dispatch_to_zone(rq) &&
-		    (blk_queue_nonrot(rq->q) ||
-		     !deadline_is_seq_write(dd, rq)))
-			goto out;
-	}
-	rq = NULL;
-out:
-	spin_unlock_irqrestore(&dd->zone_lock, flags);
-
-	return rq;
+	return rq_entry_fifo(per_prio->fifo_list[data_dir].next);
 }
 
 /*
@@ -390,36 +320,8 @@ static struct request *
 deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 		      enum dd_data_dir data_dir)
 {
-	struct request *rq;
-	unsigned long flags;
-
-	rq = deadline_from_pos(per_prio, data_dir,
-			       per_prio->latest_pos[data_dir]);
-	if (!rq)
-		return NULL;
-
-	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
-		return rq;
-
-	/*
-	 * Look for a write request that can be dispatched, that is one with
-	 * an unlocked target zone. For some HDDs, breaking a sequential
-	 * write stream can lead to lower throughput, so make sure to preserve
-	 * sequential write streams, even if that stream crosses into the next
-	 * zones and these zones are unlocked.
-	 */
-	spin_lock_irqsave(&dd->zone_lock, flags);
-	while (rq) {
-		if (blk_req_can_dispatch_to_zone(rq))
-			break;
-		if (blk_queue_nonrot(rq->q))
-			rq = deadline_latter_request(rq);
-		else
-			rq = deadline_skip_seq_writes(dd, rq);
-	}
-	spin_unlock_irqrestore(&dd->zone_lock, flags);
-
-	return rq;
+	return deadline_from_pos(per_prio, data_dir,
+				 per_prio->latest_pos[data_dir]);
 }
 
 /*
@@ -525,10 +427,6 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 		rq = next_rq;
 	}
 
-	/*
-	 * For a zoned block device, if we only have writes queued and none of
-	 * them can be dispatched, rq will be NULL.
-	 */
 	if (!rq)
 		return NULL;
 
@@ -549,10 +447,6 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	prio = ioprio_class_to_prio[ioprio_class];
 	dd->per_prio[prio].latest_pos[data_dir] = blk_rq_pos(rq);
 	dd->per_prio[prio].stats.dispatched++;
-	/*
-	 * If the request needs its target zone locked, do it.
-	 */
-	blk_req_zone_write_lock(rq);
 	rq->rq_flags |= RQF_STARTED;
 	return rq;
 }
@@ -722,7 +616,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->fifo_batch = fifo_batch;
 	dd->prio_aging_expire = prio_aging_expire;
 	spin_lock_init(&dd->lock);
-	spin_lock_init(&dd->zone_lock);
 
 	/* We dispatch from request queue wide instead of hw queue */
 	blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
@@ -804,12 +697,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 
 	lockdep_assert_held(&dd->lock);
 
-	/*
-	 * This may be a requeue of a write request that has locked its
-	 * target zone. If it is the case, this releases the zone lock.
-	 */
-	blk_req_zone_write_unlock(rq);
-
 	prio = ioprio_class_to_prio[ioprio_class];
 	per_prio = &dd->per_prio[prio];
 	if (!rq->elv.priv[0]) {
@@ -841,18 +728,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 		 */
 		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
 		insert_before = &per_prio->fifo_list[data_dir];
-#ifdef CONFIG_BLK_DEV_ZONED
-		/*
-		 * Insert zoned writes such that requests are sorted by
-		 * position per zone.
-		 */
-		if (blk_rq_is_seq_zoned_write(rq)) {
-			struct request *rq2 = deadline_latter_request(rq);
-
-			if (rq2 && blk_rq_zone_no(rq2) == blk_rq_zone_no(rq))
-				insert_before = &rq2->queuelist;
-		}
-#endif
 		list_add_tail(&rq->queuelist, insert_before);
 	}
 }
@@ -887,33 +762,8 @@ static void dd_prepare_request(struct request *rq)
 	rq->elv.priv[0] = NULL;
 }
 
-static bool dd_has_write_work(struct blk_mq_hw_ctx *hctx)
-{
-	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
-	enum dd_prio p;
-
-	for (p = 0; p <= DD_PRIO_MAX; p++)
-		if (!list_empty_careful(&dd->per_prio[p].fifo_list[DD_WRITE]))
-			return true;
-
-	return false;
-}
-
 /*
  * Callback from inside blk_mq_free_request().
- *
- * For zoned block devices, write unlock the target zone of
- * completed write requests. Do this while holding the zone lock
- * spinlock so that the zone is never unlocked while deadline_fifo_request()
- * or deadline_next_request() are executing. This function is called for
- * all requests, whether or not these requests complete successfully.
- *
- * For a zoned block device, __dd_dispatch_request() may have stopped
- * dispatching requests if all the queued requests are write requests directed
- * at zones that are already locked due to on-going write requests. To ensure
- * write request dispatch progress in this case, mark the queue as needing a
- * restart to ensure that the queue is run again after completion of the
- * request and zones being unlocked.
  */
 static void dd_finish_request(struct request *rq)
 {
@@ -928,21 +778,8 @@ static void dd_finish_request(struct request *rq)
 	 * called dd_insert_requests(). Skip requests that bypassed I/O
 	 * scheduling. See also blk_mq_request_bypass_insert().
 	 */
-	if (!rq->elv.priv[0])
-		return;
-
-	atomic_inc(&per_prio->stats.completed);
-
-	if (blk_queue_is_zoned(q)) {
-		unsigned long flags;
-
-		spin_lock_irqsave(&dd->zone_lock, flags);
-		blk_req_zone_write_unlock(rq);
-		spin_unlock_irqrestore(&dd->zone_lock, flags);
-
-		if (dd_has_write_work(rq->mq_hctx))
-			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
-	}
+	if (rq->elv.priv[0])
+		atomic_inc(&per_prio->stats.completed);
 }
 
 static bool dd_has_work_for_prio(struct dd_per_prio *per_prio)
@@ -1266,7 +1103,6 @@ static struct elevator_type mq_deadline = {
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "mq-deadline",
 	.elevator_alias = "deadline",
-	.elevator_features = ELEVATOR_F_ZBD_SEQ_WRITE,
 	.elevator_owner = THIS_MODULE,
 };
 MODULE_ALIAS("mq-deadline-iosched");
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 24/30] block: Remove elevator required features
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (22 preceding siblings ...)
  2024-03-28  0:44 ` [PATCH v3 23/30] block: mq-deadline: Remove support for zone write locking Damien Le Moal
@ 2024-03-28  0:44 ` Damien Le Moal
  2024-03-29 21:44   ` Bart Van Assche
  2024-03-28  0:44 ` [PATCH v3 25/30] block: Do not check zone type in blk_check_zone_append() Damien Le Moal
                   ` (6 subsequent siblings)
  30 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:44 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

The only elevator feature ever implemented is ELEVATOR_F_ZBD_SEQ_WRITE
for signaling that a scheduler implements zone write locking to tightly
control the dispatching order of write operations to zoned block
devices. With the removal of zone write locking support in mq-deadline
and the reliance of all block device drivers on the block layer zone
write plugging to control ordering of write operations to zones, the
elevator feature ELEVATOR_F_ZBD_SEQ_WRITE is completely unused.
Remove it, and also remove the now unused code for filtering the
possible schedulers for a block device based on required features.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-settings.c   | 16 ---------------
 block/elevator.c       | 46 +++++-------------------------------------
 block/elevator.h       |  1 -
 include/linux/blkdev.h | 10 ---------
 4 files changed, 5 insertions(+), 68 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 82c61d2e4bb8..31d882ac72d4 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -1053,22 +1053,6 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
 }
 EXPORT_SYMBOL_GPL(blk_queue_write_cache);
 
-/**
- * blk_queue_required_elevator_features - Set a queue required elevator features
- * @q:		the request queue for the target device
- * @features:	Required elevator features OR'ed together
- *
- * Tell the block layer that for the device controlled through @q, only the
- * only elevators that can be used are those that implement at least the set of
- * features specified by @features.
- */
-void blk_queue_required_elevator_features(struct request_queue *q,
-					  unsigned int features)
-{
-	q->required_elevator_features = features;
-}
-EXPORT_SYMBOL_GPL(blk_queue_required_elevator_features);
-
 /**
  * blk_queue_can_use_dma_map_merging - configure queue for merging segments.
  * @q:		the request queue for the device
diff --git a/block/elevator.c b/block/elevator.c
index 5ff093cb3cf8..f64ebd726e58 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -83,13 +83,6 @@ bool elv_bio_merge_ok(struct request *rq, struct bio *bio)
 }
 EXPORT_SYMBOL(elv_bio_merge_ok);
 
-static inline bool elv_support_features(struct request_queue *q,
-		const struct elevator_type *e)
-{
-	return (q->required_elevator_features & e->elevator_features) ==
-		q->required_elevator_features;
-}
-
 /**
  * elevator_match - Check whether @e's name or alias matches @name
  * @e: Scheduler to test
@@ -120,7 +113,7 @@ static struct elevator_type *elevator_find_get(struct request_queue *q,
 
 	spin_lock(&elv_list_lock);
 	e = __elevator_find(name);
-	if (e && (!elv_support_features(q, e) || !elevator_tryget(e)))
+	if (e && (!elevator_tryget(e)))
 		e = NULL;
 	spin_unlock(&elv_list_lock);
 	return e;
@@ -580,34 +573,8 @@ static struct elevator_type *elevator_get_default(struct request_queue *q)
 }
 
 /*
- * Get the first elevator providing the features required by the request queue.
- * Default to "none" if no matching elevator is found.
- */
-static struct elevator_type *elevator_get_by_features(struct request_queue *q)
-{
-	struct elevator_type *e, *found = NULL;
-
-	spin_lock(&elv_list_lock);
-
-	list_for_each_entry(e, &elv_list, list) {
-		if (elv_support_features(q, e)) {
-			found = e;
-			break;
-		}
-	}
-
-	if (found && !elevator_tryget(found))
-		found = NULL;
-
-	spin_unlock(&elv_list_lock);
-	return found;
-}
-
-/*
- * For a device queue that has no required features, use the default elevator
- * settings. Otherwise, use the first elevator available matching the required
- * features. If no suitable elevator is find or if the chosen elevator
- * initialization fails, fall back to the "none" elevator (no elevator).
+ * Use the default elevator settings. If the chosen elevator initialization
+ * fails, fall back to the "none" elevator (no elevator).
  */
 void elevator_init_mq(struct request_queue *q)
 {
@@ -622,10 +589,7 @@ void elevator_init_mq(struct request_queue *q)
 	if (unlikely(q->elevator))
 		return;
 
-	if (!q->required_elevator_features)
-		e = elevator_get_default(q);
-	else
-		e = elevator_get_by_features(q);
+	e = elevator_get_default(q);
 	if (!e)
 		return;
 
@@ -781,7 +745,7 @@ ssize_t elv_iosched_show(struct request_queue *q, char *name)
 	list_for_each_entry(e, &elv_list, list) {
 		if (e == cur)
 			len += sprintf(name+len, "[%s] ", e->elevator_name);
-		else if (elv_support_features(q, e))
+		else
 			len += sprintf(name+len, "%s ", e->elevator_name);
 	}
 	spin_unlock(&elv_list_lock);
diff --git a/block/elevator.h b/block/elevator.h
index 7ca3d7b6ed82..e9a050a96e53 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -74,7 +74,6 @@ struct elevator_type
 	struct elv_fs_entry *elevator_attrs;
 	const char *elevator_name;
 	const char *elevator_alias;
-	const unsigned int elevator_features;
 	struct module *elevator_owner;
 #ifdef CONFIG_BLK_DEBUG_FS
 	const struct blk_mq_debugfs_attr *queue_debugfs_attrs;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d93005ca59e8..8798435d22c4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -454,8 +454,6 @@ struct request_queue {
 
 	atomic_t		nr_active_requests_shared_tags;
 
-	unsigned int		required_elevator_features;
-
 	struct blk_mq_tags	*sched_shared_tags;
 
 	struct list_head	icq_list;
@@ -960,14 +958,6 @@ disk_alloc_independent_access_ranges(struct gendisk *disk, int nr_ia_ranges);
 void disk_set_independent_access_ranges(struct gendisk *disk,
 				struct blk_independent_access_ranges *iars);
 
-/*
- * Elevator features for blk_queue_required_elevator_features:
- */
-/* Supports zoned block devices sequential write constraint */
-#define ELEVATOR_F_ZBD_SEQ_WRITE	(1U << 0)
-
-extern void blk_queue_required_elevator_features(struct request_queue *q,
-						 unsigned int features);
 extern bool blk_queue_can_use_dma_map_merging(struct request_queue *q,
 					      struct device *dev);
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 25/30] block: Do not check zone type in blk_check_zone_append()
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (23 preceding siblings ...)
  2024-03-28  0:44 ` [PATCH v3 24/30] block: Remove elevator required features Damien Le Moal
@ 2024-03-28  0:44 ` Damien Le Moal
  2024-03-29 21:45   ` Bart Van Assche
  2024-03-28  0:44 ` [PATCH v3 26/30] block: Move zone related debugfs attribute to blk-zoned.c Damien Le Moal
                   ` (5 subsequent siblings)
  30 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:44 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Zone append operations are only allowed to target sequential write
required zones. blk_check_zone_append() uses bio_zone_is_seq() to check
this. However, this check is not necessary because:
1) For NVMe ZNS namespace devices, only sequential write required zones
   exist, making the zone type check useless.
2) For null_blk, the driver will fail the request anyway, thus notifying
   the user that a conventional zone was targeted.
3) For all other zoned devices, zone append is now emulated using zone
   write plugging, which checks that a zone append operation does not
   target a conventional zone.

In preparation for the removal of zone write locking and its
conventional zone bitmap (used by bio_zone_is_seq()), remove the
bio_zone_is_seq() call from blk_check_zone_append().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3bf28149e104..e1a5344c2257 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -589,8 +589,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 		return BLK_STS_NOTSUPP;
 
 	/* The bio sector must point to the start of a sequential zone */
-	if (!bdev_is_zone_start(bio->bi_bdev, bio->bi_iter.bi_sector) ||
-	    !bio_zone_is_seq(bio))
+	if (!bdev_is_zone_start(bio->bi_bdev, bio->bi_iter.bi_sector))
 		return BLK_STS_IOERR;
 
 	/*
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 26/30] block: Move zone related debugfs attribute to blk-zoned.c
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (24 preceding siblings ...)
  2024-03-28  0:44 ` [PATCH v3 25/30] block: Do not check zone type in blk_check_zone_append() Damien Le Moal
@ 2024-03-28  0:44 ` Damien Le Moal
  2024-03-28  4:52   ` Christoph Hellwig
  2024-03-29 19:00   ` Bart Van Assche
  2024-03-28  0:44 ` [PATCH v3 27/30] block: Replace zone_wlock debugfs entry with zone_wplugs entry Damien Le Moal
                   ` (4 subsequent siblings)
  30 siblings, 2 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:44 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

block/blk-mq-debugfs-zone.c contains a single debugfs attribute
function. Defining this outside of block/blk-zoned.c does not really
help in any way, so move this zone related debugfs attribute to
block/blk-zoned.c and delete block/blk-mq-debugfs-zone.c.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/Kconfig                |  4 ----
 block/Makefile               |  1 -
 block/blk-mq-debugfs-zoned.c | 22 ----------------------
 block/blk-mq-debugfs.h       |  2 +-
 block/blk-zoned.c            | 20 ++++++++++++++++++++
 5 files changed, 21 insertions(+), 28 deletions(-)
 delete mode 100644 block/blk-mq-debugfs-zoned.c

diff --git a/block/Kconfig b/block/Kconfig
index 1de4682d48cc..9f647149fbee 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -198,10 +198,6 @@ config BLK_DEBUG_FS
 	Unless you are building a kernel for a tiny system, you should
 	say Y here.
 
-config BLK_DEBUG_FS_ZONED
-       bool
-       default BLK_DEBUG_FS && BLK_DEV_ZONED
-
 config BLK_SED_OPAL
 	bool "Logic for interfacing with Opal enabled SEDs"
 	depends on KEYS
diff --git a/block/Makefile b/block/Makefile
index 46ada9dc8bbf..168150b9c510 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -33,7 +33,6 @@ obj-$(CONFIG_BLK_MQ_VIRTIO)	+= blk-mq-virtio.o
 obj-$(CONFIG_BLK_DEV_ZONED)	+= blk-zoned.o
 obj-$(CONFIG_BLK_WBT)		+= blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS)	+= blk-mq-debugfs.o
-obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
 obj-$(CONFIG_BLK_SED_OPAL)	+= sed-opal.o
 obj-$(CONFIG_BLK_PM)		+= blk-pm.o
 obj-$(CONFIG_BLK_INLINE_ENCRYPTION)	+= blk-crypto.o blk-crypto-profile.o \
diff --git a/block/blk-mq-debugfs-zoned.c b/block/blk-mq-debugfs-zoned.c
deleted file mode 100644
index a77b099c34b7..000000000000
--- a/block/blk-mq-debugfs-zoned.c
+++ /dev/null
@@ -1,22 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * Copyright (C) 2017 Western Digital Corporation or its affiliates.
- */
-
-#include <linux/blkdev.h>
-#include "blk-mq-debugfs.h"
-
-int queue_zone_wlock_show(void *data, struct seq_file *m)
-{
-	struct request_queue *q = data;
-	unsigned int i;
-
-	if (!q->disk->seq_zones_wlock)
-		return 0;
-
-	for (i = 0; i < q->disk->nr_zones; i++)
-		if (test_bit(i, q->disk->seq_zones_wlock))
-			seq_printf(m, "%u\n", i);
-
-	return 0;
-}
diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index 9c7d4b6117d4..3ebe2c29b624 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -83,7 +83,7 @@ static inline void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
 }
 #endif
 
-#ifdef CONFIG_BLK_DEBUG_FS_ZONED
+#if defined(CONFIG_BLK_DEV_ZONED) && defined(CONFIG_BLK_DEBUG_FS)
 int queue_zone_wlock_show(void *data, struct seq_file *m);
 #else
 static inline int queue_zone_wlock_show(void *data, struct seq_file *m)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index d0549b85f281..93aa994824d1 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -22,6 +22,7 @@
 
 #include "blk.h"
 #include "blk-mq-sched.h"
+#include "blk-mq-debugfs.h"
 
 #define ZONE_COND_NAME(name) [BLK_ZONE_COND_##name] = #name
 static const char *const zone_cond_name[] = {
@@ -1890,3 +1891,22 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
+
+#ifdef CONFIG_BLK_DEBUG_FS
+
+int queue_zone_wlock_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+	unsigned int i;
+
+	if (!q->disk->seq_zones_wlock)
+		return 0;
+
+	for (i = 0; i < q->disk->nr_zones; i++)
+		if (test_bit(i, q->disk->seq_zones_wlock))
+			seq_printf(m, "%u\n", i);
+
+	return 0;
+}
+
+#endif
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 27/30] block: Replace zone_wlock debugfs entry with zone_wplugs entry
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (25 preceding siblings ...)
  2024-03-28  0:44 ` [PATCH v3 26/30] block: Move zone related debugfs attribute to blk-zoned.c Damien Le Moal
@ 2024-03-28  0:44 ` Damien Le Moal
  2024-03-28  4:53   ` Christoph Hellwig
  2024-03-29 18:54   ` Bart Van Assche
  2024-03-28  0:44 ` [PATCH v3 28/30] block: Remove zone write locking Damien Le Moal
                   ` (3 subsequent siblings)
  30 siblings, 2 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:44 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

In preparation to completely remove zone write locking, replace the
"zone_wlock" mq-debugfs entry that was listing zones that are
write-locked with the zone_wplugs entry which lists the zones that
currently have a write plug allocated.

The write plug information provided is: the zone number, the zone write
plug flags, the zone write plug write pointer offset and the number of
BIOs currently waiting for execution in the zone write plug BIO list.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-mq-debugfs.c |  2 +-
 block/blk-mq-debugfs.h |  4 ++--
 block/blk-zoned.c      | 31 ++++++++++++++++++++++++-------
 3 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 94668e72ab09..ca1f2b9422d5 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -160,7 +160,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
 	{ "requeue_list", 0400, .seq_ops = &queue_requeue_list_seq_ops },
 	{ "pm_only", 0600, queue_pm_only_show, NULL },
 	{ "state", 0600, queue_state_show, queue_state_write },
-	{ "zone_wlock", 0400, queue_zone_wlock_show, NULL },
+	{ "zone_wplugs", 0400, queue_zone_wplugs_show, NULL },
 	{ },
 };
 
diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index 3ebe2c29b624..c80e453e3014 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -84,9 +84,9 @@ static inline void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
 #endif
 
 #if defined(CONFIG_BLK_DEV_ZONED) && defined(CONFIG_BLK_DEBUG_FS)
-int queue_zone_wlock_show(void *data, struct seq_file *m);
+int queue_zone_wplugs_show(void *data, struct seq_file *m);
 #else
-static inline int queue_zone_wlock_show(void *data, struct seq_file *m)
+static inline int queue_zone_wplugs_show(void *data, struct seq_file *m)
 {
 	return 0;
 }
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 93aa994824d1..d540688c30f9 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1894,17 +1894,34 @@ EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
 
 #ifdef CONFIG_BLK_DEBUG_FS
 
-int queue_zone_wlock_show(void *data, struct seq_file *m)
+int queue_zone_wplugs_show(void *data, struct seq_file *m)
 {
 	struct request_queue *q = data;
-	unsigned int i;
+	struct gendisk *disk = q->disk;
+	struct blk_zone_wplug *zwplug;
+	unsigned int zwp_wp_offset, zwp_flags;
+	unsigned int zwp_zone_no, zwp_ref;
+	unsigned int zwp_bio_list_size, i;
+	unsigned long flags;
 
-	if (!q->disk->seq_zones_wlock)
-		return 0;
+	rcu_read_lock();
+	for (i = 0; i < disk_zone_wplugs_hash_size(disk); i++) {
+		hlist_for_each_entry_rcu(zwplug,
+					 &disk->zone_wplugs_hash[i], node) {
+			spin_lock_irqsave(&zwplug->lock, flags);
+			zwp_zone_no = zwplug->zone_no;
+			zwp_flags = zwplug->flags;
+			zwp_ref = atomic_read(&zwplug->ref);
+			zwp_wp_offset = zwplug->wp_offset;
+			zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
+			spin_unlock_irqrestore(&zwplug->lock, flags);
 
-	for (i = 0; i < q->disk->nr_zones; i++)
-		if (test_bit(i, q->disk->seq_zones_wlock))
-			seq_printf(m, "%u\n", i);
+			seq_printf(m, "%u 0x%x %u %u %u\n",
+				   zwp_zone_no, zwp_flags, zwp_ref,
+				   zwp_wp_offset, zwp_bio_list_size);
+		}
+	}
+	rcu_read_unlock();
 
 	return 0;
 }
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 28/30] block: Remove zone write locking
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (26 preceding siblings ...)
  2024-03-28  0:44 ` [PATCH v3 27/30] block: Replace zone_wlock debugfs entry with zone_wplugs entry Damien Le Moal
@ 2024-03-28  0:44 ` Damien Le Moal
  2024-03-29 18:57   ` Bart Van Assche
  2024-03-28  0:44 ` [PATCH v3 29/30] block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED Damien Le Moal
                   ` (2 subsequent siblings)
  30 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:44 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Zone write locking is now unused and replaced with zone write plugging.
Remove all code that was implementing zone write locking, that is, the
various helper functions controlling request zone write locking and
the gendisk attached zone bitmaps.

The "zone_wlock" mq-debugfs entry that was listing zones that are
write-locked is replaced with the zone_wplugs entry which lists
the zones that currently have a write plug allocated. The information
provided is: the zone number, the zone write plug flags, the zone write
plug write pointer offset and the number of BIOs currently waiting for
execution in the zone write plug BIO list.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-mq-debugfs.c    |  1 -
 block/blk-zoned.c         | 66 ++-----------------------------
 include/linux/blk-mq.h    | 83 ---------------------------------------
 include/linux/blk_types.h |  1 -
 include/linux/blkdev.h    | 35 ++---------------
 5 files changed, 7 insertions(+), 179 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index ca1f2b9422d5..770c0c2b72fa 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -256,7 +256,6 @@ static const char *const rqf_name[] = {
 	RQF_NAME(HASHED),
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
-	RQF_NAME(ZONE_WRITE_LOCKED),
 	RQF_NAME(TIMED_OUT),
 	RQF_NAME(RESV),
 };
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index d540688c30f9..df67b8b22f99 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -119,52 +119,6 @@ const char *blk_zone_cond_str(enum blk_zone_cond zone_cond)
 }
 EXPORT_SYMBOL_GPL(blk_zone_cond_str);
 
-/*
- * Return true if a request is a write requests that needs zone write locking.
- */
-bool blk_req_needs_zone_write_lock(struct request *rq)
-{
-	if (!rq->q->disk->seq_zones_wlock)
-		return false;
-
-	return blk_rq_is_seq_zoned_write(rq);
-}
-EXPORT_SYMBOL_GPL(blk_req_needs_zone_write_lock);
-
-bool blk_req_zone_write_trylock(struct request *rq)
-{
-	unsigned int zno = blk_rq_zone_no(rq);
-
-	if (test_and_set_bit(zno, rq->q->disk->seq_zones_wlock))
-		return false;
-
-	WARN_ON_ONCE(rq->rq_flags & RQF_ZONE_WRITE_LOCKED);
-	rq->rq_flags |= RQF_ZONE_WRITE_LOCKED;
-
-	return true;
-}
-EXPORT_SYMBOL_GPL(blk_req_zone_write_trylock);
-
-void __blk_req_zone_write_lock(struct request *rq)
-{
-	if (WARN_ON_ONCE(test_and_set_bit(blk_rq_zone_no(rq),
-					  rq->q->disk->seq_zones_wlock)))
-		return;
-
-	WARN_ON_ONCE(rq->rq_flags & RQF_ZONE_WRITE_LOCKED);
-	rq->rq_flags |= RQF_ZONE_WRITE_LOCKED;
-}
-EXPORT_SYMBOL_GPL(__blk_req_zone_write_lock);
-
-void __blk_req_zone_write_unlock(struct request *rq)
-{
-	rq->rq_flags &= ~RQF_ZONE_WRITE_LOCKED;
-	if (rq->q->disk->seq_zones_wlock)
-		WARN_ON_ONCE(!test_and_clear_bit(blk_rq_zone_no(rq),
-						 rq->q->disk->seq_zones_wlock));
-}
-EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
-
 /**
  * bdev_nr_zones - Get number of zones
  * @bdev:	Target device
@@ -1600,9 +1554,6 @@ void disk_free_zone_resources(struct gendisk *disk)
 
 	kfree(disk->conv_zones_bitmap);
 	disk->conv_zones_bitmap = NULL;
-	kfree(disk->seq_zones_wlock);
-	disk->seq_zones_wlock = NULL;
-
 	disk->zone_capacity = 0;
 	disk->nr_zones = 0;
 }
@@ -1670,7 +1621,6 @@ static int disk_revalidate_zone_resources(struct gendisk *disk,
 struct blk_revalidate_zone_args {
 	struct gendisk	*disk;
 	unsigned long	*conv_zones_bitmap;
-	unsigned long	*seq_zones_wlock;
 	unsigned int	nr_zones;
 	unsigned int	zone_capacity;
 	sector_t	sector;
@@ -1746,13 +1696,6 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 		set_bit(idx, args->conv_zones_bitmap);
 		break;
 	case BLK_ZONE_TYPE_SEQWRITE_REQ:
-		if (!args->seq_zones_wlock) {
-			args->seq_zones_wlock =
-				blk_alloc_zone_bitmap(q->node, args->nr_zones);
-			if (!args->seq_zones_wlock)
-				return -ENOMEM;
-		}
-
 		/*
 		 * Remember the capacity of the first sequential zone and check
 		 * if it is constant for all zones.
@@ -1794,7 +1737,7 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 }
 
 /**
- * blk_revalidate_disk_zones - (re)allocate and initialize zone bitmaps
+ * blk_revalidate_disk_zones - (re)allocate and initialize zone write plugs
  * @disk:	Target disk
  *
  * Helper function for low-level device drivers to check, (re) allocate and
@@ -1868,15 +1811,13 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 	}
 
 	/*
-	 * Install the new bitmaps and update nr_zones only once the queue is
-	 * stopped and all I/Os are completed (i.e. a scheduler is not
-	 * referencing the bitmaps).
+	 * Set the new disk zone parameters only once the queue is frozen and
+	 * all I/Os are completed.
 	 */
 	blk_mq_freeze_queue(q);
 	if (ret > 0) {
 		disk->nr_zones = args.nr_zones;
 		disk->zone_capacity = args.zone_capacity;
-		swap(disk->seq_zones_wlock, args.seq_zones_wlock);
 		swap(disk->conv_zones_bitmap, args.conv_zones_bitmap);
 		ret = 0;
 	} else {
@@ -1885,7 +1826,6 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 	}
 	blk_mq_unfreeze_queue(q);
 
-	kfree(args.seq_zones_wlock);
 	kfree(args.conv_zones_bitmap);
 
 	return ret;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 60090c8366fb..89ba6b16fe8b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -54,8 +54,6 @@ typedef __u32 __bitwise req_flags_t;
 /* Look at ->special_vec for the actual data payload instead of the
    bio chain. */
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
-/* The per-zone write lock is held for this request */
-#define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
 /* The request completion needs to be signaled to zone write pluging. */
 #define RQF_ZONE_WRITE_PLUGGING	((__force req_flags_t)(1 << 20))
 /* ->timeout has been called, don't expire again */
@@ -1152,85 +1150,4 @@ static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 }
 void blk_dump_rq_flags(struct request *, char *);
 
-#ifdef CONFIG_BLK_DEV_ZONED
-static inline unsigned int blk_rq_zone_no(struct request *rq)
-{
-	return disk_zone_no(rq->q->disk, blk_rq_pos(rq));
-}
-
-static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
-{
-	return disk_zone_is_seq(rq->q->disk, blk_rq_pos(rq));
-}
-
-/**
- * blk_rq_is_seq_zoned_write() - Check if @rq requires write serialization.
- * @rq: Request to examine.
- *
- * Note: REQ_OP_ZONE_APPEND requests do not require serialization.
- */
-static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
-{
-	return op_needs_zoned_write_locking(req_op(rq)) &&
-		blk_rq_zone_is_seq(rq);
-}
-
-bool blk_req_needs_zone_write_lock(struct request *rq);
-bool blk_req_zone_write_trylock(struct request *rq);
-void __blk_req_zone_write_lock(struct request *rq);
-void __blk_req_zone_write_unlock(struct request *rq);
-
-static inline void blk_req_zone_write_lock(struct request *rq)
-{
-	if (blk_req_needs_zone_write_lock(rq))
-		__blk_req_zone_write_lock(rq);
-}
-
-static inline void blk_req_zone_write_unlock(struct request *rq)
-{
-	if (rq->rq_flags & RQF_ZONE_WRITE_LOCKED)
-		__blk_req_zone_write_unlock(rq);
-}
-
-static inline bool blk_req_zone_is_write_locked(struct request *rq)
-{
-	return rq->q->disk->seq_zones_wlock &&
-		test_bit(blk_rq_zone_no(rq), rq->q->disk->seq_zones_wlock);
-}
-
-static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
-{
-	if (!blk_req_needs_zone_write_lock(rq))
-		return true;
-	return !blk_req_zone_is_write_locked(rq);
-}
-#else /* CONFIG_BLK_DEV_ZONED */
-static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
-{
-	return false;
-}
-
-static inline bool blk_req_needs_zone_write_lock(struct request *rq)
-{
-	return false;
-}
-
-static inline void blk_req_zone_write_lock(struct request *rq)
-{
-}
-
-static inline void blk_req_zone_write_unlock(struct request *rq)
-{
-}
-static inline bool blk_req_zone_is_write_locked(struct request *rq)
-{
-	return false;
-}
-
-static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
-{
-	return true;
-}
-#endif /* CONFIG_BLK_DEV_ZONED */
-
 #endif /* BLK_MQ_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index ffe0c112b128..5751292fee6a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -297,7 +297,6 @@ enum {
 	BIO_QOS_THROTTLED,	/* bio went through rq_qos throttle path */
 	BIO_QOS_MERGED,		/* but went through rq_qos merge path */
 	BIO_REMAPPED,
-	BIO_ZONE_WRITE_LOCKED,	/* Owns a zoned device zone write lock */
 	BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
 	BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
 	BIO_FLAG_LAST
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8798435d22c4..d2b8d7761269 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -177,23 +177,14 @@ struct gendisk {
 
 #ifdef CONFIG_BLK_DEV_ZONED
 	/*
-	 * Zoned block device information for request dispatch control.
-	 * nr_zones is the total number of zones of the device. This is always
-	 * 0 for regular block devices. conv_zones_bitmap is a bitmap of nr_zones
-	 * bits which indicates if a zone is conventional (bit set) or
-	 * sequential (bit clear). seq_zones_wlock is a bitmap of nr_zones
-	 * bits which indicates if a zone is write locked, that is, if a write
-	 * request targeting the zone was dispatched.
-	 *
-	 * Reads of this information must be protected with blk_queue_enter() /
-	 * blk_queue_exit(). Modifying this information is only allowed while
-	 * no requests are being processed. See also blk_mq_freeze_queue() and
-	 * blk_mq_unfreeze_queue().
+	 * Zoned block device information. Reads of this information must be
+	 * protected with blk_queue_enter() / blk_queue_exit(). Modifying this
+	 * information is only allowed while no requests are being processed.
+	 * See also blk_mq_freeze_queue() and blk_mq_unfreeze_queue().
 	 */
 	unsigned int		nr_zones;
 	unsigned int		zone_capacity;
 	unsigned long		*conv_zones_bitmap;
-	unsigned long		*seq_zones_wlock;
 	unsigned int		zone_wplugs_max_nr;
 	unsigned int            zone_wplugs_hash_bits;
 	spinlock_t              zone_wplugs_lock;
@@ -636,15 +627,6 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
 	return sector >> ilog2(disk->queue->limits.chunk_sectors);
 }
 
-static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
-{
-	if (!blk_queue_is_zoned(disk->queue))
-		return false;
-	if (!disk->conv_zones_bitmap)
-		return true;
-	return !test_bit(disk_zone_no(disk, sector), disk->conv_zones_bitmap);
-}
-
 static inline void disk_set_max_open_zones(struct gendisk *disk,
 		unsigned int max_open_zones)
 {
@@ -678,10 +660,6 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
 {
 	return 0;
 }
-static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
-{
-	return false;
-}
 static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
 {
 	return 0;
@@ -871,11 +849,6 @@ static inline bool bio_straddles_zones(struct bio *bio)
 		disk_zone_no(bio->bi_bdev->bd_disk, bio_end_sector(bio) - 1);
 }
 
-static inline unsigned int bio_zone_is_seq(struct bio *bio)
-{
-	return disk_zone_is_seq(bio->bi_bdev->bd_disk, bio->bi_iter.bi_sector);
-}
-
 /*
  * Return how much of the chunk is left to be used for I/O at a given offset.
  */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 29/30] block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (27 preceding siblings ...)
  2024-03-28  0:44 ` [PATCH v3 28/30] block: Remove zone write locking Damien Le Moal
@ 2024-03-28  0:44 ` Damien Le Moal
  2024-03-28  4:53   ` Christoph Hellwig
  2024-03-28  0:44 ` [PATCH v3 30/30] block: Do not special-case plugging of zone write operations Damien Le Moal
  2024-03-28 23:05 ` (subset) [PATCH v3 00/30] Zone write plugging Jens Axboe
  30 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:44 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Now that zone block device write ordering control does not depend
anymore on mq-deadline and zone write locking, there is no need to force
select the mq-deadline scheduler when CONFIG_BLK_DEV_ZONED is enabled.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/Kconfig b/block/Kconfig
index 9f647149fbee..d47398ae9824 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -100,7 +100,6 @@ config BLK_DEV_WRITE_MOUNTED
 
 config BLK_DEV_ZONED
 	bool "Zoned block device support"
-	select MQ_IOSCHED_DEADLINE
 	help
 	Block layer zoned block device support. This option enables
 	support for ZAC/ZBC/ZNS host-managed and host-aware zoned block
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [PATCH v3 30/30] block: Do not special-case plugging of zone write operations
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (28 preceding siblings ...)
  2024-03-28  0:44 ` [PATCH v3 29/30] block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED Damien Le Moal
@ 2024-03-28  0:44 ` Damien Le Moal
  2024-03-28  4:54   ` Christoph Hellwig
  2024-03-29 18:58   ` Bart Van Assche
  2024-03-28 23:05 ` (subset) [PATCH v3 00/30] Zone write plugging Jens Axboe
  30 siblings, 2 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  0:44 UTC (permalink / raw)
  To: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

With the block layer zone write plugging being automatically done for
any write operation to a zone of a zoned block device, a regular request
plugging handled through current->plug can only ever see at most a
single write request per zone. In such case, any potential reordering
of the plugged requests will be harmless. We can thus remove the special
casing for write operations to zones and have these requests plugged as
well. This allows removing the function blk_mq_plug and instead directly
using current->plug where needed.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 block/blk-core.c       |  6 ------
 block/blk-merge.c      |  3 +--
 block/blk-mq.c         |  7 +------
 block/blk-mq.h         | 31 -------------------------------
 include/linux/blkdev.h | 12 ------------
 5 files changed, 2 insertions(+), 57 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e1a5344c2257..47400a4fe851 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -907,12 +907,6 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags)
 	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
 		return 0;
 
-	/*
-	 * As the requests that require a zone lock are not plugged in the
-	 * first place, directly accessing the plug instead of using
-	 * blk_mq_plug() should not have any consequences during flushing for
-	 * zoned devices.
-	 */
 	blk_flush_plug(current->plug, false);
 
 	/*
diff --git a/block/blk-merge.c b/block/blk-merge.c
index b96466d2ba94..1a9a424212ee 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -1112,10 +1112,9 @@ static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 		unsigned int nr_segs)
 {
-	struct blk_plug *plug;
+	struct blk_plug *plug = current->plug;
 	struct request *rq;
 
-	plug = blk_mq_plug(bio);
 	if (!plug || rq_list_empty(plug->mq_list))
 		return false;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 770d636707dc..823ce64610e0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1332,11 +1332,6 @@ void blk_execute_rq_nowait(struct request *rq, bool at_head)
 
 	blk_account_io_start(rq);
 
-	/*
-	 * As plugging can be enabled for passthrough requests on a zoned
-	 * device, directly accessing the plug instead of using blk_mq_plug()
-	 * should not have any consequences.
-	 */
 	if (current->plug && !at_head) {
 		blk_add_rq_to_plug(current->plug, rq);
 		return;
@@ -2924,7 +2919,7 @@ static void blk_mq_use_cached_rq(struct request *rq, struct blk_plug *plug,
 void blk_mq_submit_bio(struct bio *bio)
 {
 	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
-	struct blk_plug *plug = blk_mq_plug(bio);
+	struct blk_plug *plug = current->plug;
 	const int is_sync = op_is_sync(bio->bi_opf);
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int nr_segs = 1;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index f75a9ecfebde..260beea8e332 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -365,37 +365,6 @@ static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
 		qmap->mq_map[cpu] = 0;
 }
 
-/*
- * blk_mq_plug() - Get caller context plug
- * @bio : the bio being submitted by the caller context
- *
- * Plugging, by design, may delay the insertion of BIOs into the elevator in
- * order to increase BIO merging opportunities. This however can cause BIO
- * insertion order to change from the order in which submit_bio() is being
- * executed in the case of multiple contexts concurrently issuing BIOs to a
- * device, even if these context are synchronized to tightly control BIO issuing
- * order. While this is not a problem with regular block devices, this ordering
- * change can cause write BIO failures with zoned block devices as these
- * require sequential write patterns to zones. Prevent this from happening by
- * ignoring the plug state of a BIO issuing context if it is for a zoned block
- * device and the BIO to plug is a write operation.
- *
- * Return current->plug if the bio can be plugged and NULL otherwise
- */
-static inline struct blk_plug *blk_mq_plug( struct bio *bio)
-{
-	/* Zoned block device write operation case: do not plug the BIO */
-	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
-	    bdev_op_is_zoned_write(bio->bi_bdev, bio_op(bio)))
-		return NULL;
-
-	/*
-	 * For regular block devices or read operations, use the context plug
-	 * which may be NULL if blk_start_plug() was not executed.
-	 */
-	return current->plug;
-}
-
 /* Free all requests on the list */
 static inline void blk_mq_free_requests(struct list_head *list)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d2b8d7761269..022d78c5136f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1301,18 +1301,6 @@ static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
 	return disk_zone_no(bdev->bd_disk, sec);
 }
 
-/* Whether write serialization is required for @op on zoned devices. */
-static inline bool op_needs_zoned_write_locking(enum req_op op)
-{
-	return op == REQ_OP_WRITE || op == REQ_OP_WRITE_ZEROES;
-}
-
-static inline bool bdev_op_is_zoned_write(struct block_device *bdev,
-					  enum req_op op)
-{
-	return bdev_is_zoned(bdev) && op_needs_zoned_write_locking(op);
-}
-
 static inline sector_t bdev_zone_sectors(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 01/30] block: Do not force full zone append completion in req_bio_endio()
  2024-03-28  0:43 ` [PATCH v3 01/30] block: Do not force full zone append completion in req_bio_endio() Damien Le Moal
@ 2024-03-28  4:10   ` Christoph Hellwig
  2024-03-28 18:14   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:10 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 03/30] block: Remove req_bio_endio()
  2024-03-28  0:43 ` [PATCH v3 03/30] block: Remove req_bio_endio() Damien Le Moal
@ 2024-03-28  4:13   ` Christoph Hellwig
  2024-03-28 21:28   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:13 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

On Thu, Mar 28, 2024 at 09:43:42AM +0900, Damien Le Moal wrote:
> Moving req_bio_endio() code into its only caller, blk_update_request(),
> allows reducing accesses to and tests of bio and request fields. Also,
> given that partial completions of zone append operations is not
> possible and that zone append operations cannot be merged, the update
> of the BIO sector using the request sector for these operations can be
> moved directly before the call to bio_endio().

Note that we should actually be able to support merging zone appends.

And this patch actually moves the assignment to the right place to be
able to deal with it, we'll just need to track the start offset to the
start of the request so that we can assign the right bi_sector.

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 04/30] block: Introduce blk_zone_update_request_bio()
  2024-03-28  0:43 ` [PATCH v3 04/30] block: Introduce blk_zone_update_request_bio() Damien Le Moal
@ 2024-03-28  4:14   ` Christoph Hellwig
  2024-03-28  5:20     ` Damien Le Moal
  2024-03-28 21:31   ` Bart Van Assche
  1 sibling, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:14 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

> -		if (req_op(req) == REQ_OP_ZONE_APPEND)
> -			bio->bi_iter.bi_sector = req->__sector;
> -
> -		if (!is_flush)
> +		if (!is_flush) {
> +			blk_zone_update_request_bio(req, bio);
>  			bio_endio(bio);
> +		}

As noted by Bart last time around, the blk_zone_update_request_bio
really should stay out of the !is_flush check, as otherwise we'd
break zone appends going through the flush state machine.

Otherwise this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28  0:43 ` [PATCH v3 09/30] block: Pre-allocate zone write plugs Damien Le Moal
@ 2024-03-28  4:30   ` Christoph Hellwig
  2024-03-28  5:28     ` Damien Le Moal
  2024-03-28 22:25     ` Bart Van Assche
  2024-03-28 22:29   ` Bart Van Assche
  1 sibling, 2 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:30 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

I think this should go into the previous patch, splitting it
out just causes confusion.

> +static void disk_free_zone_wplug(struct blk_zone_wplug *zwplug)
> +{
> +	struct gendisk *disk = zwplug->disk;
> +	unsigned long flags;
> +
> +	if (zwplug->flags & BLK_ZONE_WPLUG_NEEDS_FREE) {
> +		kfree(zwplug);
> +		return;
> +	}
> +
> +	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
> +	list_add_tail(&zwplug->link, &disk->zone_wplugs_free_list);
> +	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
> +}
> +
>  static bool disk_insert_zone_wplug(struct gendisk *disk,
>  				   struct blk_zone_wplug *zwplug)
>  {
> @@ -630,18 +665,24 @@ static struct blk_zone_wplug *disk_get_zone_wplug(struct gendisk *disk,
>  	return zwplug;
>  }
>  
> +static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head)
> +{
> +	struct blk_zone_wplug *zwplug =
> +		container_of(rcu_head, struct blk_zone_wplug, rcu_head);
> +
> +	disk_free_zone_wplug(zwplug);
> +}

Please verify my idea carefully, but I think we can do without the
RCU grace period and thus the rcu_head in struct blk_zone_wplug:

When the zwplug is removed from the hash, we set the
BLK_ZONE_WPLUG_UNHASHED flag under disk->zone_wplugs_lock.  Once
caller see that flag any lookup that modifies the structure
will fail/wait.  If we then just clear BLK_ZONE_WPLUG_UNHASHED after
the final put in disk_put_zone_wplug when we know the bio list is
empty and no other state is kept (if there might be flags left
we should clear them before), it is perfectly fine for the
zwplug to get reused for another zone at this point.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 08/30] block: Introduce zone write plugging
  2024-03-28  0:43 ` [PATCH v3 08/30] block: Introduce zone write plugging Damien Le Moal
@ 2024-03-28  4:48   ` Christoph Hellwig
  2024-03-28 22:20   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:48 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

> +	spin_lock_irqsave(&zwplug->lock, flags);
> +	while (zwplug->wp_offset < zone_capacity &&
> +	       (bio = bio_list_peek(&zwplug->bio_list))) {

Nit: I find this mix of a comparism and a combined assignment / NULL
check just a bit too dense.  I'd move one them out of the loop condition
and to use a separate break to make things a little easier to digest.

Otherwise this looks good modulo the freelist discissoon in the
next patch.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 10/30] block: Fake max open zones limit when there is no limit
  2024-03-28  0:43 ` [PATCH v3 10/30] block: Fake max open zones limit when there is no limit Damien Le Moal
@ 2024-03-28  4:49   ` Christoph Hellwig
  2024-03-29 20:37   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:49 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

On Thu, Mar 28, 2024 at 09:43:49AM +0900, Damien Le Moal wrote:
> For a zoned block device that has no limit on the number of open zones
> and no limit on the number of active zones, the zone write plug free
> list is initialized with 128 zone write plugs. For such case, set the
> device max_open_zones queue limit to this value to indicate to the user
> the potential performance penalty that may happen when writing
> simultaneously to more zones than the free list size.

Setting max_open_zone needs to go through the queue limits API and
be done on a frozen queue (probabaly my merging it into the other
assignments done later in zone revalidation that are done with
the queue frozen).

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 11/30] block: Allow zero value of max_zone_append_sectors queue limit
  2024-03-28  0:43 ` [PATCH v3 11/30] block: Allow zero value of max_zone_append_sectors queue limit Damien Le Moal
@ 2024-03-28  4:49   ` Christoph Hellwig
  2024-03-29 20:50   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:49 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 12/30] block: Implement zone append emulation
  2024-03-28  0:43 ` [PATCH v3 12/30] block: Implement zone append emulation Damien Le Moal
@ 2024-03-28  4:50   ` Christoph Hellwig
  2024-03-29 21:22   ` Bart Van Assche
  2024-03-29 21:26   ` Bart Van Assche
  2 siblings, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:50 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 15/30] scsi: sd: Use the block layer zone append emulation
  2024-03-28  0:43 ` [PATCH v3 15/30] scsi: sd: " Damien Le Moal
@ 2024-03-28  4:50   ` Christoph Hellwig
  2024-03-28 10:49   ` Johannes Thumshirn
  2024-03-29 21:27   ` Bart Van Assche
  2 siblings, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:50 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 16/30] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  2024-03-28  0:43 ` [PATCH v3 16/30] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature Damien Le Moal
@ 2024-03-28  4:50   ` Christoph Hellwig
  2024-03-29 21:28   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:50 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 17/30] null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  2024-03-28  0:43 ` [PATCH v3 17/30] null_blk: " Damien Le Moal
@ 2024-03-28  4:51   ` Christoph Hellwig
  2024-03-29 21:29   ` Bart Van Assche
  2024-04-02  6:43   ` Chaitanya Kulkarni
  2 siblings, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:51 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 18/30] null_blk: Introduce zone_append_max_sectors attribute
  2024-03-28  0:43 ` [PATCH v3 18/30] null_blk: Introduce zone_append_max_sectors attribute Damien Le Moal
@ 2024-03-28  4:51   ` Christoph Hellwig
  2024-03-29 21:35   ` Bart Van Assche
  2024-04-02  6:44   ` Chaitanya Kulkarni
  2 siblings, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:51 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 19/30] null_blk: Introduce fua attribute
  2024-03-28  0:43 ` [PATCH v3 19/30] null_blk: Introduce fua attribute Damien Le Moal
@ 2024-03-28  4:52   ` Christoph Hellwig
  2024-03-29 21:36   ` Bart Van Assche
  2024-04-02  6:42   ` Chaitanya Kulkarni
  2 siblings, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:52 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 23/30] block: mq-deadline: Remove support for zone write locking
  2024-03-28  0:44 ` [PATCH v3 23/30] block: mq-deadline: Remove support for zone write locking Damien Le Moal
@ 2024-03-28  4:52   ` Christoph Hellwig
  2024-03-29 21:43   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:52 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 26/30] block: Move zone related debugfs attribute to blk-zoned.c
  2024-03-28  0:44 ` [PATCH v3 26/30] block: Move zone related debugfs attribute to blk-zoned.c Damien Le Moal
@ 2024-03-28  4:52   ` Christoph Hellwig
  2024-03-29 19:00   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:52 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 27/30] block: Replace zone_wlock debugfs entry with zone_wplugs entry
  2024-03-28  0:44 ` [PATCH v3 27/30] block: Replace zone_wlock debugfs entry with zone_wplugs entry Damien Le Moal
@ 2024-03-28  4:53   ` Christoph Hellwig
  2024-03-29 18:54   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:53 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 29/30] block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED
  2024-03-28  0:44 ` [PATCH v3 29/30] block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED Damien Le Moal
@ 2024-03-28  4:53   ` Christoph Hellwig
  0 siblings, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:53 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 30/30] block: Do not special-case plugging of zone write operations
  2024-03-28  0:44 ` [PATCH v3 30/30] block: Do not special-case plugging of zone write operations Damien Le Moal
@ 2024-03-28  4:54   ` Christoph Hellwig
  2024-03-28  6:43     ` Damien Le Moal
  2024-03-29 18:58   ` Bart Van Assche
  1 sibling, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  4:54 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

On Thu, Mar 28, 2024 at 09:44:09AM +0900, Damien Le Moal wrote:
> With the block layer zone write plugging being automatically done for
> any write operation to a zone of a zoned block device, a regular request
> plugging handled through current->plug can only ever see at most a
> single write request per zone. In such case, any potential reordering
> of the plugged requests will be harmless. We can thus remove the special
> casing for write operations to zones and have these requests plugged as
> well. This allows removing the function blk_mq_plug and instead directly
> using current->plug where needed.

This looks good in general:

Reviewed-by: Christoph Hellwig <hch@lst.de>

But IIRC we recently had a report that plus reorder I/Os, which would
be grave for the extent layout if we haven't fixed that yet, so we
should probably look into it first.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 04/30] block: Introduce blk_zone_update_request_bio()
  2024-03-28  4:14   ` Christoph Hellwig
@ 2024-03-28  5:20     ` Damien Le Moal
  2024-03-28  5:42       ` Christoph Hellwig
  0 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  5:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch

On 3/28/24 13:14, Christoph Hellwig wrote:
>> -		if (req_op(req) == REQ_OP_ZONE_APPEND)
>> -			bio->bi_iter.bi_sector = req->__sector;
>> -
>> -		if (!is_flush)
>> +		if (!is_flush) {
>> +			blk_zone_update_request_bio(req, bio);
>>  			bio_endio(bio);
>> +		}
> 
> As noted by Bart last time around, the blk_zone_update_request_bio
> really should stay out of the !is_flush check, as otherwise we'd
> break zone appends going through the flush state machine.

I do not think that is corect. Because is_flush indicates that RQF_FLUSH_SEQ is
set, that is, we are in the middle of a flush sequence. And flush sequence
progression is handled at the request level, not BIOs. Once the sequence
finishes, then and only then the BIO original endio should be done, meaning that
we will then take this path and actually do blk_zone_update_request_bio() and
bio_endio(). So I still think this is correct.

> 
> Otherwise this looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28  4:30   ` Christoph Hellwig
@ 2024-03-28  5:28     ` Damien Le Moal
  2024-03-28  5:46       ` Christoph Hellwig
  2024-03-28 22:25     ` Bart Van Assche
  1 sibling, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  5:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch

On 3/28/24 13:30, Christoph Hellwig wrote:
> I think this should go into the previous patch, splitting it
> out just causes confusion.
> 
>> +static void disk_free_zone_wplug(struct blk_zone_wplug *zwplug)
>> +{
>> +	struct gendisk *disk = zwplug->disk;
>> +	unsigned long flags;
>> +
>> +	if (zwplug->flags & BLK_ZONE_WPLUG_NEEDS_FREE) {
>> +		kfree(zwplug);
>> +		return;
>> +	}
>> +
>> +	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>> +	list_add_tail(&zwplug->link, &disk->zone_wplugs_free_list);
>> +	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
>> +}
>> +
>>  static bool disk_insert_zone_wplug(struct gendisk *disk,
>>  				   struct blk_zone_wplug *zwplug)
>>  {
>> @@ -630,18 +665,24 @@ static struct blk_zone_wplug *disk_get_zone_wplug(struct gendisk *disk,
>>  	return zwplug;
>>  }
>>  
>> +static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head)
>> +{
>> +	struct blk_zone_wplug *zwplug =
>> +		container_of(rcu_head, struct blk_zone_wplug, rcu_head);
>> +
>> +	disk_free_zone_wplug(zwplug);
>> +}
> 
> Please verify my idea carefully, but I think we can do without the
> RCU grace period and thus the rcu_head in struct blk_zone_wplug:
> 
> When the zwplug is removed from the hash, we set the
> BLK_ZONE_WPLUG_UNHASHED flag under disk->zone_wplugs_lock.  Once
> caller see that flag any lookup that modifies the structure
> will fail/wait.  If we then just clear BLK_ZONE_WPLUG_UNHASHED after
> the final put in disk_put_zone_wplug when we know the bio list is
> empty and no other state is kept (if there might be flags left
> we should clear them before), it is perfectly fine for the
> zwplug to get reused for another zone at this point.

That was my thinking initially as well, which is why I did not have the grace
period. However, getting a reference on a plug is a not done under
disk->zone_wplugs_lock and is thus racy, albeit with a super tiny time window:
the hash table lookup may "see" a plug that has already been removed and has a
refcount dropped to 0 already. The use of atomic_inc_not_zero() prevents us from
trying to keep using that stale plug, but we *are* referencing it. So without
the grace period, I think there is a risk (again, super tiny window) that we
start reusing the plug, or kfree it while atomic_inc_not_zero() is executing...
I am overthinking this ?

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 04/30] block: Introduce blk_zone_update_request_bio()
  2024-03-28  5:20     ` Damien Le Moal
@ 2024-03-28  5:42       ` Christoph Hellwig
  2024-03-28  5:54         ` Damien Le Moal
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  5:42 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch

On Thu, Mar 28, 2024 at 02:20:17PM +0900, Damien Le Moal wrote:
> I do not think that is corect. Because is_flush indicates that RQF_FLUSH_SEQ is
> set, that is, we are in the middle of a flush sequence. And flush sequence
> progression is handled at the request level, not BIOs. Once the sequence
> finishes, then and only then the BIO original endio should be done, meaning that
> we will then take this path and actually do blk_zone_update_request_bio() and
> bio_endio(). So I still think this is correct.

Well.

lk_flush_restore_request with the previous patch now restores rq->__sector,
and the blk_mq_end_request call following it will propagate it to the
original bio.  But blk_flush_restore_request grabs the sector from
rq->bio->bi_iter.bi_sector, and we need to actually get it there first,
which is done by the data I/O completion that has RQF_FLUSH_SEQ set.

I think we really need a good test case for zone append and FUA,
i.e. we need the append op for zonefs, which should exercise the
fua code if O_SYNC/O_DSYNC is set.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28  5:28     ` Damien Le Moal
@ 2024-03-28  5:46       ` Christoph Hellwig
  2024-03-28  6:02         ` Damien Le Moal
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  5:46 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch

On Thu, Mar 28, 2024 at 02:28:40PM +0900, Damien Le Moal wrote:
> That was my thinking initially as well, which is why I did not have the
> grace period. However, getting a reference on a plug is a not done under
> disk->zone_wplugs_lock and is thus racy, albeit with a super tiny time
> window: the hash table lookup may "see" a plug that has already been
> removed and has a refcount dropped to 0 already. The use of
> atomic_inc_not_zero() prevents us from trying to keep using that stale
> plug, but we *are* referencing it. So without the grace period, I think
> there is a risk (again, super tiny window) that we start reusing the
> plug, or kfree it while atomic_inc_not_zero() is executing...
> I am overthinking this ?

Well.  All the lookups fail (or should fail) when BLK_ZONE_WPLUG_UNHASHED
is set, probably even before even trying to grab a reference.  So all
the lookups for a zone that is beeing torn down will fail.  Now once
the actual final reference is dropped, we'll now need to clear
BLK_ZONE_WPLUG_UNHASHED and lookup can happe again.  We'd have a race
window there, but I guess we can plug it by checking for the right
zone number?  If we it while it already got reduce that'll still fail
the lookup.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 04/30] block: Introduce blk_zone_update_request_bio()
  2024-03-28  5:42       ` Christoph Hellwig
@ 2024-03-28  5:54         ` Damien Le Moal
  0 siblings, 0 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  5:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch

On 3/28/24 14:42, Christoph Hellwig wrote:
> On Thu, Mar 28, 2024 at 02:20:17PM +0900, Damien Le Moal wrote:
>> I do not think that is corect. Because is_flush indicates that RQF_FLUSH_SEQ is
>> set, that is, we are in the middle of a flush sequence. And flush sequence
>> progression is handled at the request level, not BIOs. Once the sequence
>> finishes, then and only then the BIO original endio should be done, meaning that
>> we will then take this path and actually do blk_zone_update_request_bio() and
>> bio_endio(). So I still think this is correct.
> 
> Well.
> 
> lk_flush_restore_request with the previous patch now restores rq->__sector,
> and the blk_mq_end_request call following it will propagate it to the
> original bio.  But blk_flush_restore_request grabs the sector from
> rq->bio->bi_iter.bi_sector, and we need to actually get it there first,
> which is done by the data I/O completion that has RQF_FLUSH_SEQ set.

Ah. Yes. There is no issue with the current code for regular writes, but we
would get the original sector and not the written sector in the case of zone
append. Will make the change.

> I think we really need a good test case for zone append and FUA,
> i.e. we need the append op for zonefs, which should exercise the
> fua code if O_SYNC/O_DSYNC is set.

Yep. There is currently no issuer of zone append + FUA. But once I get to add
that for zonefs and block dev files, we indeed will have good tstbed.

> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28  5:46       ` Christoph Hellwig
@ 2024-03-28  6:02         ` Damien Le Moal
  2024-03-28  6:03           ` Christoph Hellwig
  0 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  6:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch

On 3/28/24 14:46, Christoph Hellwig wrote:
> On Thu, Mar 28, 2024 at 02:28:40PM +0900, Damien Le Moal wrote:
>> That was my thinking initially as well, which is why I did not have the
>> grace period. However, getting a reference on a plug is a not done under
>> disk->zone_wplugs_lock and is thus racy, albeit with a super tiny time
>> window: the hash table lookup may "see" a plug that has already been
>> removed and has a refcount dropped to 0 already. The use of
>> atomic_inc_not_zero() prevents us from trying to keep using that stale
>> plug, but we *are* referencing it. So without the grace period, I think
>> there is a risk (again, super tiny window) that we start reusing the
>> plug, or kfree it while atomic_inc_not_zero() is executing...
>> I am overthinking this ?
> 
> Well.  All the lookups fail (or should fail) when BLK_ZONE_WPLUG_UNHASHED
> is set, probably even before even trying to grab a reference.  So all> the lookups for a zone that is beeing torn down will fail.  Now once
> the actual final reference is dropped, we'll now need to clear
> BLK_ZONE_WPLUG_UNHASHED and lookup can happe again.  We'd have a race
> window there, but I guess we can plug it by checking for the right
> zone number?  If we it while it already got reduce that'll still fail
> the lookup.

But that is the problem: "checking the zone number again" means referencing the
plug struct again from the lookup context while the last ref drop context is
freeing the plug. That race can be lost by the lookup context and lead to
referencing freed memory. So your solution would be OK for pre-allocated plugs
only. For kmalloc-ed() plugs, we still need the rcu grace period for free. So we
can only optimize for the pre-allocated plugs...

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28  6:02         ` Damien Le Moal
@ 2024-03-28  6:03           ` Christoph Hellwig
  2024-03-28  6:18             ` Damien Le Moal
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  6:03 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch

On Thu, Mar 28, 2024 at 03:02:54PM +0900, Damien Le Moal wrote:
> But that is the problem: "checking the zone number again" means referencing the
> plug struct again from the lookup context while the last ref drop context is
> freeing the plug. That race can be lost by the lookup context and lead to
> referencing freed memory. So your solution would be OK for pre-allocated plugs
> only.

Not if it is done in the Rcu critical section.

> For kmalloc-ed() plugs, we still need the rcu grace period for free. So we
> can only optimize for the pre-allocated plugs...

Yes, bt it can use kfree_rcu which doesn't need the rcu_head in the
zwplug.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28  6:03           ` Christoph Hellwig
@ 2024-03-28  6:18             ` Damien Le Moal
  2024-03-28  6:22               ` Christoph Hellwig
  0 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  6:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch

On 3/28/24 15:03, Christoph Hellwig wrote:
> On Thu, Mar 28, 2024 at 03:02:54PM +0900, Damien Le Moal wrote:
>> But that is the problem: "checking the zone number again" means referencing the
>> plug struct again from the lookup context while the last ref drop context is
>> freeing the plug. That race can be lost by the lookup context and lead to
>> referencing freed memory. So your solution would be OK for pre-allocated plugs
>> only.
> 
> Not if it is done in the Rcu critical section.
> 
>> For kmalloc-ed() plugs, we still need the rcu grace period for free. So we
>> can only optimize for the pre-allocated plugs...
> 
> Yes, bt it can use kfree_rcu which doesn't need the rcu_head in the
> zwplug.

Unfortunately, it does. kfree_rcu() is a 2 argument macro: address and rcu head
to use... The only thing we could drop from the plug struct is the gendisk pointer.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28  6:18             ` Damien Le Moal
@ 2024-03-28  6:22               ` Christoph Hellwig
  2024-03-28  6:33                 ` Damien Le Moal
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  6:22 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch

On Thu, Mar 28, 2024 at 03:18:46PM +0900, Damien Le Moal wrote:
> > Yes, bt it can use kfree_rcu which doesn't need the rcu_head in the
> > zwplug.
> 
> Unfortunately, it does. kfree_rcu() is a 2 argument macro: address and rcu head
> to use... The only thing we could drop from the plug struct is the gendisk pointer.

It used to have a one argument version.  Oh, that recently got renamed
to kfree_rcu_mightsleep.  Which seems like a somewhat odd name, but
it's still there and what i meant.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28  6:22               ` Christoph Hellwig
@ 2024-03-28  6:33                 ` Damien Le Moal
  2024-03-28  6:38                   ` Christoph Hellwig
  0 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  6:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch

On 3/28/24 15:22, Christoph Hellwig wrote:
> On Thu, Mar 28, 2024 at 03:18:46PM +0900, Damien Le Moal wrote:
>>> Yes, bt it can use kfree_rcu which doesn't need the rcu_head in the
>>> zwplug.
>>
>> Unfortunately, it does. kfree_rcu() is a 2 argument macro: address and rcu head
>> to use... The only thing we could drop from the plug struct is the gendisk pointer.
> 
> It used to have a one argument version.  Oh, that recently got renamed
> to kfree_rcu_mightsleep.  Which seems like a somewhat odd name, but
> it's still there and what i meant.

Ha. OK. I did not see that one. But that means that the plug kfree() can then
block the caller. Given that the last ref drop may happen from BIO completion
context (when the last write to a zone making the zone full complete), I do not
think we can use this function...

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28  6:33                 ` Damien Le Moal
@ 2024-03-28  6:38                   ` Christoph Hellwig
  2024-03-28  6:51                     ` Damien Le Moal
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  6:38 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch

On Thu, Mar 28, 2024 at 03:33:13PM +0900, Damien Le Moal wrote:
> Ha. OK. I did not see that one. But that means that the plug kfree() can then
> block the caller. Given that the last ref drop may happen from BIO completion
> context (when the last write to a zone making the zone full complete), I do not
> think we can use this function...

Ah, damn.  So yes, we probably still need the rcu head.  We can kill
the gendisk pointer, though.  Or just stick with the existing version
and don't bother with the micro-optimization, at which point the
mempool might actually be the simpler implementation?

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 30/30] block: Do not special-case plugging of zone write operations
  2024-03-28  4:54   ` Christoph Hellwig
@ 2024-03-28  6:43     ` Damien Le Moal
  2024-03-28  6:51       ` Christoph Hellwig
  0 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  6:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch

On 3/28/24 13:54, Christoph Hellwig wrote:
> On Thu, Mar 28, 2024 at 09:44:09AM +0900, Damien Le Moal wrote:
>> With the block layer zone write plugging being automatically done for
>> any write operation to a zone of a zoned block device, a regular request
>> plugging handled through current->plug can only ever see at most a
>> single write request per zone. In such case, any potential reordering
>> of the plugged requests will be harmless. We can thus remove the special
>> casing for write operations to zones and have these requests plugged as
>> well. This allows removing the function blk_mq_plug and instead directly
>> using current->plug where needed.
> 
> This looks good in general:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> But IIRC we recently had a report that plus reorder I/Os, which would
> be grave for the extent layout if we haven't fixed that yet, so we
> should probably look into it first.

That is indeed not great, but irrelevant for zone writes as the regular BIO plug
is after the zone write plugging. So the regular BIO plug can only see at most
one write request per zone. Even if that order changes, that will not result in
unaligned write errors like before. But the reordering may still be bad for
performance though, especially on HDD, so yes, we should definitely look into this.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28  6:38                   ` Christoph Hellwig
@ 2024-03-28  6:51                     ` Damien Le Moal
  2024-03-28  6:52                       ` Christoph Hellwig
  0 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  6:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch

On 3/28/24 15:38, Christoph Hellwig wrote:
> On Thu, Mar 28, 2024 at 03:33:13PM +0900, Damien Le Moal wrote:
>> Ha. OK. I did not see that one. But that means that the plug kfree() can then
>> block the caller. Given that the last ref drop may happen from BIO completion
>> context (when the last write to a zone making the zone full complete), I do not
>> think we can use this function...
> 
> Ah, damn.  So yes, we probably still need the rcu head.  We can kill
> the gendisk pointer, though.  Or just stick with the existing version
> and don't bother with the micro-optimization, at which point the
> mempool might actually be the simpler implementation?

I am all for not micro-optimizing the free path right now.
I am not so sure about the mempool being simpler... And I do see some
improvements in perf for SMR HDDs with the free list. Could be noise though but
it feels a little more solid perf-wise. I have not seen any benefit for faster
devices with the free list though...

If you prefer the mempool, I can go back to using it though, not a big deal.

For other micro-optimizations worth looking at later would be to try out the new
low latency workqueues for the plug BIO work.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 30/30] block: Do not special-case plugging of zone write operations
  2024-03-28  6:43     ` Damien Le Moal
@ 2024-03-28  6:51       ` Christoph Hellwig
  2024-03-28  6:54         ` Damien Le Moal
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  6:51 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch

On Thu, Mar 28, 2024 at 03:43:02PM +0900, Damien Le Moal wrote:
> That is indeed not great, but irrelevant for zone writes as the regular BIO plug
> is after the zone write plugging. So the regular BIO plug can only see at most
> one write request per zone. Even if that order changes, that will not result in
> unaligned write errors like before. But the reordering may still be bad for
> performance though, especially on HDD, so yes, we should definitely look into this.

Irrelevant is not how I would frame it.  Yes, it will not affect
correctness.  But it will affect performance not just for the write
itself, but also in the long run as it affects the on-disk extent
layout.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28  6:51                     ` Damien Le Moal
@ 2024-03-28  6:52                       ` Christoph Hellwig
  2024-03-28  6:53                         ` Damien Le Moal
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2024-03-28  6:52 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch

On Thu, Mar 28, 2024 at 03:51:09PM +0900, Damien Le Moal wrote:
> I am all for not micro-optimizing the free path right now.
> I am not so sure about the mempool being simpler... And I do see some
> improvements in perf for SMR HDDs with the free list. Could be noise though but
> it feels a little more solid perf-wise. I have not seen any benefit for faster
> devices with the free list though...
> 
> If you prefer the mempool, I can go back to using it though, not a big deal.

A capped free list + dynamic allocation beyond it is exactly what the
mempool is, so reimplementing seems a bit silly.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28  6:52                       ` Christoph Hellwig
@ 2024-03-28  6:53                         ` Damien Le Moal
  0 siblings, 0 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  6:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch

On 3/28/24 15:52, Christoph Hellwig wrote:
> On Thu, Mar 28, 2024 at 03:51:09PM +0900, Damien Le Moal wrote:
>> I am all for not micro-optimizing the free path right now.
>> I am not so sure about the mempool being simpler... And I do see some
>> improvements in perf for SMR HDDs with the free list. Could be noise though but
>> it feels a little more solid perf-wise. I have not seen any benefit for faster
>> devices with the free list though...
>>
>> If you prefer the mempool, I can go back to using it though, not a big deal.
> 
> A capped free list + dynamic allocation beyond it is exactly what the
> mempool is, so reimplementing seems a bit silly.

OK. Putting it back then.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 30/30] block: Do not special-case plugging of zone write operations
  2024-03-28  6:51       ` Christoph Hellwig
@ 2024-03-28  6:54         ` Damien Le Moal
  0 siblings, 0 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28  6:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch

On 3/28/24 15:51, Christoph Hellwig wrote:
> On Thu, Mar 28, 2024 at 03:43:02PM +0900, Damien Le Moal wrote:
>> That is indeed not great, but irrelevant for zone writes as the regular BIO plug
>> is after the zone write plugging. So the regular BIO plug can only see at most
>> one write request per zone. Even if that order changes, that will not result in
>> unaligned write errors like before. But the reordering may still be bad for
>> performance though, especially on HDD, so yes, we should definitely look into this.
> 
> Irrelevant is not how I would frame it.  Yes, it will not affect
> correctness.  But it will affect performance not just for the write
> itself, but also in the long run as it affects the on-disk extent
> layout.

Agreed.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 15/30] scsi: sd: Use the block layer zone append emulation
  2024-03-28  0:43 ` [PATCH v3 15/30] scsi: sd: " Damien Le Moal
  2024-03-28  4:50   ` Christoph Hellwig
@ 2024-03-28 10:49   ` Johannes Thumshirn
  2024-03-29 21:27   ` Bart Van Assche
  2 siblings, 0 replies; 109+ messages in thread
From: Johannes Thumshirn @ 2024-03-28 10:49 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

So long and thanks for all the fish ;( :D

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 01/30] block: Do not force full zone append completion in req_bio_endio()
  2024-03-28  0:43 ` [PATCH v3 01/30] block: Do not force full zone append completion in req_bio_endio() Damien Le Moal
  2024-03-28  4:10   ` Christoph Hellwig
@ 2024-03-28 18:14   ` Bart Van Assche
  2024-03-28 22:43     ` Damien Le Moal
  1 sibling, 1 reply; 109+ messages in thread
From: Bart Van Assche @ 2024-03-28 18:14 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 17:43, Damien Le Moal wrote:
> This reverts commit 748dc0b65ec2b4b7b3dbd7befcc4a54fdcac7988.
> 
> Partial zone append completions cannot be supported as there is no
> guarantees that the fragmented data will be written sequentially in the
> same manner as with a full command. Commit 748dc0b65ec2 ("block: fix
> partial zone append completion handling in req_bio_endio()") changed
> req_bio_endio() to always advance a partially failed BIO by its full
> length, but this can lead to incorrect accounting. So revert this
> change and let low level device drivers handle this case by always
> failing completely zone append operations. With this revert, users will
> still see an IO error for a partially completed zone append BIO.
> 
> Fixes: 748dc0b65ec2 ("block: fix partial zone append completion handling in req_bio_endio()")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-mq.c | 9 ++-------
>   1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 555ada922cf0..32afb87efbd0 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -770,16 +770,11 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
>   		/*
>   		 * Partial zone append completions cannot be supported as the
>   		 * BIO fragments may end up not being written sequentially.
> -		 * For such case, force the completed nbytes to be equal to
> -		 * the BIO size so that bio_advance() sets the BIO remaining
> -		 * size to 0 and we end up calling bio_endio() before returning.
>   		 */
> -		if (bio->bi_iter.bi_size != nbytes) {
> +		if (bio->bi_iter.bi_size != nbytes)
>   			bio->bi_status = BLK_STS_IOERR;
> -			nbytes = bio->bi_iter.bi_size;
> -		} else {
> +		else
>   			bio->bi_iter.bi_sector = rq->__sector;
> -		}
>   	}
>   
>   	bio_advance(bio, nbytes);

Hi Damien,

This patch looks good to me but shouldn't it be separated from this
patch series? I think that will help this patch to get merged sooner.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 03/30] block: Remove req_bio_endio()
  2024-03-28  0:43 ` [PATCH v3 03/30] block: Remove req_bio_endio() Damien Le Moal
  2024-03-28  4:13   ` Christoph Hellwig
@ 2024-03-28 21:28   ` Bart Van Assche
  2024-03-28 22:42     ` Damien Le Moal
  1 sibling, 1 reply; 109+ messages in thread
From: Bart Van Assche @ 2024-03-28 21:28 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> Moving req_bio_endio() code into its only caller, blk_update_request(),
> allows reducing accesses to and tests of bio and request fields. Also,
> given that partial completions of zone append operations is not
> possible and that zone append operations cannot be merged, the update
> of the BIO sector using the request sector for these operations can be
> moved directly before the call to bio_endio().

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

> -	if (unlikely(error && !blk_rq_is_passthrough(req) &&
> -		     !(req->rq_flags & RQF_QUIET)) &&
> -		     !test_bit(GD_DEAD, &req->q->disk->state)) {
> +	if (unlikely(error && !blk_rq_is_passthrough(req) && !quiet) &&
> +	    !test_bit(GD_DEAD, &req->q->disk->state)) {

A question that is independent of this patch series: is it a bug or is
it a feature that the GD_DEAD bit test is not marked as "unlikely"?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 04/30] block: Introduce blk_zone_update_request_bio()
  2024-03-28  0:43 ` [PATCH v3 04/30] block: Introduce blk_zone_update_request_bio() Damien Le Moal
  2024-03-28  4:14   ` Christoph Hellwig
@ 2024-03-28 21:31   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-28 21:31 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> On completion of a zone append request, the request sector indicates the
> location of the written data. This value must be returned to the user
> through the BIO iter sector. This is done in 2 places: in
> blk_complete_request() and in blk_update_request(). Introduce the inline
> helper function blk_zone_update_request_bio() to avoid duplicating
> this BIO update for zone append requests, and to compile out this
> helper call when CONFIG_BLK_DEV_ZONED is not enabled.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 05/30] block: Introduce bio_straddles_zones() and bio_offset_from_zone_start()
  2024-03-28  0:43 ` [PATCH v3 05/30] block: Introduce bio_straddles_zones() and bio_offset_from_zone_start() Damien Le Moal
@ 2024-03-28 21:32   ` Bart Van Assche
  0 siblings, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-28 21:32 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> Implement the inline helper functions bio_straddles_zones() and
> bio_offset_from_zone_start() to respectively test if a BIO crosses a
> zone boundary (the start sector and last sector belong to different
> zones) and to obtain the offset of a BIO from the start sector of its
> target zone.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 07/30] block: Remember zone capacity when revalidating zones
  2024-03-28  0:43 ` [PATCH v3 07/30] block: Remember zone capacity when revalidating zones Damien Le Moal
@ 2024-03-28 21:38   ` Bart Van Assche
  2024-03-28 22:40     ` Damien Le Moal
  0 siblings, 1 reply; 109+ messages in thread
From: Bart Van Assche @ 2024-03-28 21:38 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> +		/*
> +		 * Remember the capacity of the first sequential zone and check
> +		 * if it is constant for all zones.
> +		 */
> +		if (!args->zone_capacity)
> +			args->zone_capacity = zone->capacity;
> +		if (zone->capacity != args->zone_capacity) {
> +			pr_warn("%s: Invalid variable zone capacity\n",
> +				disk->disk_name);
> +			return -ENODEV;
> +		}

SMR disks may have a smaller last zone. Does the above code handle such
SMR disks correctly?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 08/30] block: Introduce zone write plugging
  2024-03-28  0:43 ` [PATCH v3 08/30] block: Introduce zone write plugging Damien Le Moal
  2024-03-28  4:48   ` Christoph Hellwig
@ 2024-03-28 22:20   ` Bart Van Assche
  2024-03-28 22:38     ` Damien Le Moal
  1 sibling, 1 reply; 109+ messages in thread
From: Bart Van Assche @ 2024-03-28 22:20 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> +	/*
> +	 * Initialize the zoen write plug with an extra reference so that
> +	 * it is not freed when the zone write plug becomes idle without
> +	 * the zone being full.
> +	 */

zoen -> zone

> +static void disk_zone_wplugs_work(struct work_struct *work)
> +{
> +	struct gendisk *disk =
> +		container_of(work, struct gendisk, zone_wplugs_work);
> +	struct blk_zone_wplug *zwplug;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
> +
> +	while (!list_empty(&disk->zone_wplugs_err_list)) {
> +		zwplug = list_first_entry(&disk->zone_wplugs_err_list,
> +					  struct blk_zone_wplug, link);
> +		list_del_init(&zwplug->link);
> +		blk_get_zone_wplug(zwplug);
> +		spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
> +
> +		disk_zone_wplug_handle_error(disk, zwplug);
> +		disk_put_zone_wplug(zwplug);
> +
> +		spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
> +	}
> +
> +	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
> +}

What is the maximum number of iterations the above while loop can 
perform? I'm wondering whether a cond_resched() call should be added.

> +
> +	/* Wait for the zone write plugs to be RCU-freed. */
> +	rcu_barrier();
> +}

It is not clear to me why the above rcu_barrier() call is necessary? I'm
not aware of any other kernel code where kfree_rcu() is followed by an
rcu_barrier() call.

> +static int disk_alloc_zone_resources(struct gendisk *disk,
> +				     unsigned int max_nr_zwplugs)
> +{
> +	unsigned int i;
> +
> +	disk->zone_wplugs_hash_bits =
> +		min(ilog2(max_nr_zwplugs) + 1, BLK_ZONE_MAX_WPLUG_HASH_BITS);

If max_nr_zwplugs is a power of two, the above formula will result in a
hash table with a size that is twice the size of max_nr_zwplugs.
Shouldn't ilog2(max_nr_zwplugs) + 1 be changed into
ilog2(roundup_pow_of_two(max_nr_zwplugs))?

Otherwise this patch looks fine to me.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28  4:30   ` Christoph Hellwig
  2024-03-28  5:28     ` Damien Le Moal
@ 2024-03-28 22:25     ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-28 22:25 UTC (permalink / raw)
  To: Christoph Hellwig, Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch

On 3/27/24 9:30 PM, Christoph Hellwig wrote:
> Please verify my idea carefully, but I think we can do without the
> RCU grace period and thus the rcu_head in struct blk_zone_wplug:
> 
> When the zwplug is removed from the hash, we set the
> BLK_ZONE_WPLUG_UNHASHED flag under disk->zone_wplugs_lock.  Once
> caller see that flag any lookup that modifies the structure
> will fail/wait.  If we then just clear BLK_ZONE_WPLUG_UNHASHED after
> the final put in disk_put_zone_wplug when we know the bio list is
> empty and no other state is kept (if there might be flags left
> we should clear them before), it is perfectly fine for the
> zwplug to get reused for another zone at this point.

Hi Christoph,

I don't think this is allowed without grace period between kfree()
and reusing a zwplug because another thread might be iterating over
the hlist while only holding an RCU reader lock.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28  0:43 ` [PATCH v3 09/30] block: Pre-allocate zone write plugs Damien Le Moal
  2024-03-28  4:30   ` Christoph Hellwig
@ 2024-03-28 22:29   ` Bart Van Assche
  2024-03-28 22:33     ` Damien Le Moal
  1 sibling, 1 reply; 109+ messages in thread
From: Bart Van Assche @ 2024-03-28 22:29 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> Allocating zone write plugs using kmalloc() does not guarantee that
> enough write plugs can be allocated to simultaneously write up to
> the maximum number of active zones or maximum number of open zones of
> a zoned block device.
> 
> Avoid any issue with memory allocation by pre-allocating zone write
> plugs up to the disk maximum number of open zones or maximum number of
> active zones, whichever is larger. For zoned devices that do not have
> open or active zone limits, the default 128 is used as the number of
> write plugs to pre-allocate.
> 
> Pre-allocated zone write plugs are managed using a free list. If a
> change to the device zone limits is detected, the disk free list is
> grown if needed when blk_revalidate_disk_zones() is executed.

Is there a way to retry bio submission if allocating a zone write plug
fails? Would that make it possible to drop this patch?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 09/30] block: Pre-allocate zone write plugs
  2024-03-28 22:29   ` Bart Van Assche
@ 2024-03-28 22:33     ` Damien Le Moal
  0 siblings, 0 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28 22:33 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/29/24 07:29, Bart Van Assche wrote:
> On 3/27/24 5:43 PM, Damien Le Moal wrote:
>> Allocating zone write plugs using kmalloc() does not guarantee that
>> enough write plugs can be allocated to simultaneously write up to
>> the maximum number of active zones or maximum number of open zones of
>> a zoned block device.
>>
>> Avoid any issue with memory allocation by pre-allocating zone write
>> plugs up to the disk maximum number of open zones or maximum number of
>> active zones, whichever is larger. For zoned devices that do not have
>> open or active zone limits, the default 128 is used as the number of
>> write plugs to pre-allocate.
>>
>> Pre-allocated zone write plugs are managed using a free list. If a
>> change to the device zone limits is detected, the disk free list is
>> grown if needed when blk_revalidate_disk_zones() is executed.
> 
> Is there a way to retry bio submission if allocating a zone write plug
> fails? Would that make it possible to drop this patch?

This patch is merged into the main zone write plugging patch in v4 (about to
post it) and the free list is replaced with a mempool.
Note that for BIOs that do not have REQ_NOWAIT, the allocation is done with
GFP_NIO. If that fails, the OOM killer is probably already wreaking the system...

> 
> Thanks,
> 
> Bart.
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 08/30] block: Introduce zone write plugging
  2024-03-28 22:20   ` Bart Van Assche
@ 2024-03-28 22:38     ` Damien Le Moal
  2024-03-29 18:20       ` Bart Van Assche
  0 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28 22:38 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/29/24 07:20, Bart Van Assche wrote:
>> +static void disk_zone_wplugs_work(struct work_struct *work)
>> +{
>> +	struct gendisk *disk =
>> +		container_of(work, struct gendisk, zone_wplugs_work);
>> +	struct blk_zone_wplug *zwplug;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>> +
>> +	while (!list_empty(&disk->zone_wplugs_err_list)) {
>> +		zwplug = list_first_entry(&disk->zone_wplugs_err_list,
>> +					  struct blk_zone_wplug, link);
>> +		list_del_init(&zwplug->link);
>> +		blk_get_zone_wplug(zwplug);
>> +		spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
>> +
>> +		disk_zone_wplug_handle_error(disk, zwplug);
>> +		disk_put_zone_wplug(zwplug);
>> +
>> +		spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>> +	}
>> +
>> +	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
>> +}
> 
> What is the maximum number of iterations the above while loop can 
> perform? I'm wondering whether a cond_resched() call should be added.

The loop will go on as long as there are plugged BIOs and they can be merged,
that is, as long as the request does not exceed the queue limits. So this is all
limited naturally by the queue limits.

>> +	/* Wait for the zone write plugs to be RCU-freed. */
>> +	rcu_barrier();
>> +}
> 
> It is not clear to me why the above rcu_barrier() call is necessary? I'm
> not aware of any other kernel code where kfree_rcu() is followed by an
> rcu_barrier() call.

Right after that, the mempool (in v4, free list here) is destroyed. So the
rcu_barrier() is needed to ensure that the grace period is past and that all
plugs are back in the pool/freelist. Without this, I saw problems/crashes when
removing devices.

>> +static int disk_alloc_zone_resources(struct gendisk *disk,
>> +				     unsigned int max_nr_zwplugs)
>> +{
>> +	unsigned int i;
>> +
>> +	disk->zone_wplugs_hash_bits =
>> +		min(ilog2(max_nr_zwplugs) + 1, BLK_ZONE_MAX_WPLUG_HASH_BITS);
> 
> If max_nr_zwplugs is a power of two, the above formula will result in a
> hash table with a size that is twice the size of max_nr_zwplugs.

Yes, that is in purpose, to avoid hash collisions as much as possible.

> Shouldn't ilog2(max_nr_zwplugs) + 1 be changed into
> ilog2(roundup_pow_of_two(max_nr_zwplugs))?

I think it should be:

ilog2(roundup_pow_of_two(max_nr_zwplugs)) + 1

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 07/30] block: Remember zone capacity when revalidating zones
  2024-03-28 21:38   ` Bart Van Assche
@ 2024-03-28 22:40     ` Damien Le Moal
  0 siblings, 0 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28 22:40 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/29/24 06:38, Bart Van Assche wrote:
> On 3/27/24 5:43 PM, Damien Le Moal wrote:
>> +		/*
>> +		 * Remember the capacity of the first sequential zone and check
>> +		 * if it is constant for all zones.
>> +		 */
>> +		if (!args->zone_capacity)
>> +			args->zone_capacity = zone->capacity;
>> +		if (zone->capacity != args->zone_capacity) {
>> +			pr_warn("%s: Invalid variable zone capacity\n",
>> +				disk->disk_name);
>> +			return -ENODEV;
>> +		}
> 
> SMR disks may have a smaller last zone. Does the above code handle such
> SMR disks correctly?

SMR drives known to have a smaller last zone have a smaller conventional zone,
not a sequential zone. But good point, I will handle that on the check for
conventional zones.

> 
> Thanks,
> 
> Bart.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 03/30] block: Remove req_bio_endio()
  2024-03-28 21:28   ` Bart Van Assche
@ 2024-03-28 22:42     ` Damien Le Moal
  0 siblings, 0 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28 22:42 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/29/24 06:28, Bart Van Assche wrote:
> On 3/27/24 5:43 PM, Damien Le Moal wrote:
>> Moving req_bio_endio() code into its only caller, blk_update_request(),
>> allows reducing accesses to and tests of bio and request fields. Also,
>> given that partial completions of zone append operations is not
>> possible and that zone append operations cannot be merged, the update
>> of the BIO sector using the request sector for these operations can be
>> moved directly before the call to bio_endio().
> 
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> 
>> -	if (unlikely(error && !blk_rq_is_passthrough(req) &&
>> -		     !(req->rq_flags & RQF_QUIET)) &&
>> -		     !test_bit(GD_DEAD, &req->q->disk->state)) {
>> +	if (unlikely(error && !blk_rq_is_passthrough(req) && !quiet) &&
>> +	    !test_bit(GD_DEAD, &req->q->disk->state)) {
> 
> A question that is independent of this patch series: is it a bug or is
> it a feature that the GD_DEAD bit test is not marked as "unlikely"?

likely/unlikely are optimizations... I guess that bit test could be under
unlikely() as well. Though if we are dealing with a removable media device, this
may not be appropriate, which may be why it is not under unlikely(). Not sure.

> 
> Thanks,
> 
> Bart.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 01/30] block: Do not force full zone append completion in req_bio_endio()
  2024-03-28 18:14   ` Bart Van Assche
@ 2024-03-28 22:43     ` Damien Le Moal
  2024-03-28 23:03       ` Jens Axboe
  0 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28 22:43 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/29/24 03:14, Bart Van Assche wrote:
> On 3/27/24 17:43, Damien Le Moal wrote:
>> This reverts commit 748dc0b65ec2b4b7b3dbd7befcc4a54fdcac7988.
>>
>> Partial zone append completions cannot be supported as there is no
>> guarantees that the fragmented data will be written sequentially in the
>> same manner as with a full command. Commit 748dc0b65ec2 ("block: fix
>> partial zone append completion handling in req_bio_endio()") changed
>> req_bio_endio() to always advance a partially failed BIO by its full
>> length, but this can lead to incorrect accounting. So revert this
>> change and let low level device drivers handle this case by always
>> failing completely zone append operations. With this revert, users will
>> still see an IO error for a partially completed zone append BIO.
>>
>> Fixes: 748dc0b65ec2 ("block: fix partial zone append completion handling in req_bio_endio()")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>> ---
>>   block/blk-mq.c | 9 ++-------
>>   1 file changed, 2 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 555ada922cf0..32afb87efbd0 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -770,16 +770,11 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
>>   		/*
>>   		 * Partial zone append completions cannot be supported as the
>>   		 * BIO fragments may end up not being written sequentially.
>> -		 * For such case, force the completed nbytes to be equal to
>> -		 * the BIO size so that bio_advance() sets the BIO remaining
>> -		 * size to 0 and we end up calling bio_endio() before returning.
>>   		 */
>> -		if (bio->bi_iter.bi_size != nbytes) {
>> +		if (bio->bi_iter.bi_size != nbytes)
>>   			bio->bi_status = BLK_STS_IOERR;
>> -			nbytes = bio->bi_iter.bi_size;
>> -		} else {
>> +		else
>>   			bio->bi_iter.bi_sector = rq->__sector;
>> -		}
>>   	}
>>   
>>   	bio_advance(bio, nbytes);
> 
> Hi Damien,
> 
> This patch looks good to me but shouldn't it be separated from this
> patch series? I think that will help this patch to get merged sooner.

Yes, it can go on its own. But patch 3 depends on it so I kept it in the series.

Jens,

How would you like to proceed with this one ?

> 
> Thanks,
> 
> Bart.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 01/30] block: Do not force full zone append completion in req_bio_endio()
  2024-03-28 22:43     ` Damien Le Moal
@ 2024-03-28 23:03       ` Jens Axboe
  0 siblings, 0 replies; 109+ messages in thread
From: Jens Axboe @ 2024-03-28 23:03 UTC (permalink / raw)
  To: Damien Le Moal, Bart Van Assche, linux-block, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/28/24 4:43 PM, Damien Le Moal wrote:
> On 3/29/24 03:14, Bart Van Assche wrote:
>> On 3/27/24 17:43, Damien Le Moal wrote:
>>> This reverts commit 748dc0b65ec2b4b7b3dbd7befcc4a54fdcac7988.
>>>
>>> Partial zone append completions cannot be supported as there is no
>>> guarantees that the fragmented data will be written sequentially in the
>>> same manner as with a full command. Commit 748dc0b65ec2 ("block: fix
>>> partial zone append completion handling in req_bio_endio()") changed
>>> req_bio_endio() to always advance a partially failed BIO by its full
>>> length, but this can lead to incorrect accounting. So revert this
>>> change and let low level device drivers handle this case by always
>>> failing completely zone append operations. With this revert, users will
>>> still see an IO error for a partially completed zone append BIO.
>>>
>>> Fixes: 748dc0b65ec2 ("block: fix partial zone append completion handling in req_bio_endio()")
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>>> ---
>>>   block/blk-mq.c | 9 ++-------
>>>   1 file changed, 2 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 555ada922cf0..32afb87efbd0 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -770,16 +770,11 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
>>>   		/*
>>>   		 * Partial zone append completions cannot be supported as the
>>>   		 * BIO fragments may end up not being written sequentially.
>>> -		 * For such case, force the completed nbytes to be equal to
>>> -		 * the BIO size so that bio_advance() sets the BIO remaining
>>> -		 * size to 0 and we end up calling bio_endio() before returning.
>>>   		 */
>>> -		if (bio->bi_iter.bi_size != nbytes) {
>>> +		if (bio->bi_iter.bi_size != nbytes)
>>>   			bio->bi_status = BLK_STS_IOERR;
>>> -			nbytes = bio->bi_iter.bi_size;
>>> -		} else {
>>> +		else
>>>   			bio->bi_iter.bi_sector = rq->__sector;
>>> -		}
>>>   	}
>>>   
>>>   	bio_advance(bio, nbytes);
>>
>> Hi Damien,
>>
>> This patch looks good to me but shouldn't it be separated from this
>> patch series? I think that will help this patch to get merged sooner.
> 
> Yes, it can go on its own. But patch 3 depends on it so I kept it in the series.
> 
> Jens,
> 
> How would you like to proceed with this one ?

I can just pick it up separately.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: (subset) [PATCH v3 00/30] Zone write plugging
  2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
                   ` (29 preceding siblings ...)
  2024-03-28  0:44 ` [PATCH v3 30/30] block: Do not special-case plugging of zone write operations Damien Le Moal
@ 2024-03-28 23:05 ` Jens Axboe
  2024-03-28 23:13   ` Damien Le Moal
  30 siblings, 1 reply; 109+ messages in thread
From: Jens Axboe @ 2024-03-28 23:05 UTC (permalink / raw)
  To: linux-block, linux-scsi, Martin K . Petersen, dm-devel,
	Mike Snitzer, linux-nvme, Keith Busch, Christoph Hellwig,
	Damien Le Moal


On Thu, 28 Mar 2024 09:43:39 +0900, Damien Le Moal wrote:
> The patch series introduces zone write plugging (ZWP) as the new
> mechanism to control the ordering of writes to zoned block devices.
> ZWP replaces zone write locking (ZWL) which is implemented only by
> mq-deadline today. ZWP also allows emulating zone append operations
> using regular writes for zoned devices that do not natively support this
> operation (e.g. SMR HDDs). This patch series removes the scsi disk
> driver and device mapper zone append emulation to use ZWP emulation.
> 
> [...]

Applied, thanks!

[01/30] block: Do not force full zone append completion in req_bio_endio()
        commit: 55251fbdf0146c252ceff146a1bb145546f3e034

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: (subset) [PATCH v3 00/30] Zone write plugging
  2024-03-28 23:05 ` (subset) [PATCH v3 00/30] Zone write plugging Jens Axboe
@ 2024-03-28 23:13   ` Damien Le Moal
  2024-03-28 23:27     ` Jens Axboe
  0 siblings, 1 reply; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28 23:13 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

On 3/29/24 08:05, Jens Axboe wrote:
> 
> On Thu, 28 Mar 2024 09:43:39 +0900, Damien Le Moal wrote:
>> The patch series introduces zone write plugging (ZWP) as the new
>> mechanism to control the ordering of writes to zoned block devices.
>> ZWP replaces zone write locking (ZWL) which is implemented only by
>> mq-deadline today. ZWP also allows emulating zone append operations
>> using regular writes for zoned devices that do not natively support this
>> operation (e.g. SMR HDDs). This patch series removes the scsi disk
>> driver and device mapper zone append emulation to use ZWP emulation.
>>
>> [...]
> 
> Applied, thanks!
> 
> [01/30] block: Do not force full zone append completion in req_bio_endio()
>         commit: 55251fbdf0146c252ceff146a1bb145546f3e034
> 
> Best regards,

Thanks Jens. Will this also be in your block/for-next branch ?
Otherwise, the series will have a conflict in patch 3.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: (subset) [PATCH v3 00/30] Zone write plugging
  2024-03-28 23:13   ` Damien Le Moal
@ 2024-03-28 23:27     ` Jens Axboe
  2024-03-28 23:33       ` Damien Le Moal
  0 siblings, 1 reply; 109+ messages in thread
From: Jens Axboe @ 2024-03-28 23:27 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

On 3/28/24 5:13 PM, Damien Le Moal wrote:
> On 3/29/24 08:05, Jens Axboe wrote:
>>
>> On Thu, 28 Mar 2024 09:43:39 +0900, Damien Le Moal wrote:
>>> The patch series introduces zone write plugging (ZWP) as the new
>>> mechanism to control the ordering of writes to zoned block devices.
>>> ZWP replaces zone write locking (ZWL) which is implemented only by
>>> mq-deadline today. ZWP also allows emulating zone append operations
>>> using regular writes for zoned devices that do not natively support this
>>> operation (e.g. SMR HDDs). This patch series removes the scsi disk
>>> driver and device mapper zone append emulation to use ZWP emulation.
>>>
>>> [...]
>>
>> Applied, thanks!
>>
>> [01/30] block: Do not force full zone append completion in req_bio_endio()
>>         commit: 55251fbdf0146c252ceff146a1bb145546f3e034
>>
>> Best regards,
> 
> Thanks Jens. Will this also be in your block/for-next branch ?
> Otherwise, the series will have a conflict in patch 3.

It'll go into 6.9, and I'll rebase the for-6.10/block branch once -rc2
is out. That should take care of the dependency.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: (subset) [PATCH v3 00/30] Zone write plugging
  2024-03-28 23:27     ` Jens Axboe
@ 2024-03-28 23:33       ` Damien Le Moal
  0 siblings, 0 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-28 23:33 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

On 3/29/24 08:27, Jens Axboe wrote:
> On 3/28/24 5:13 PM, Damien Le Moal wrote:
>> On 3/29/24 08:05, Jens Axboe wrote:
>>>
>>> On Thu, 28 Mar 2024 09:43:39 +0900, Damien Le Moal wrote:
>>>> The patch series introduces zone write plugging (ZWP) as the new
>>>> mechanism to control the ordering of writes to zoned block devices.
>>>> ZWP replaces zone write locking (ZWL) which is implemented only by
>>>> mq-deadline today. ZWP also allows emulating zone append operations
>>>> using regular writes for zoned devices that do not natively support this
>>>> operation (e.g. SMR HDDs). This patch series removes the scsi disk
>>>> driver and device mapper zone append emulation to use ZWP emulation.
>>>>
>>>> [...]
>>>
>>> Applied, thanks!
>>>
>>> [01/30] block: Do not force full zone append completion in req_bio_endio()
>>>         commit: 55251fbdf0146c252ceff146a1bb145546f3e034
>>>
>>> Best regards,
>>
>> Thanks Jens. Will this also be in your block/for-next branch ?
>> Otherwise, the series will have a conflict in patch 3.
> 
> It'll go into 6.9, and I'll rebase the for-6.10/block branch once -rc2
> is out. That should take care of the dependency.

OK. Thanks. I will wait for next week to send out v4 then.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 08/30] block: Introduce zone write plugging
  2024-03-28 22:38     ` Damien Le Moal
@ 2024-03-29 18:20       ` Bart Van Assche
  0 siblings, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 18:20 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/28/24 3:38 PM, Damien Le Moal wrote:
> On 3/29/24 07:20, Bart Van Assche wrote:
>>> +	/* Wait for the zone write plugs to be RCU-freed. */
>>> +	rcu_barrier();
>>> +}
>>
>> It is not clear to me why the above rcu_barrier() call is necessary? I'm
>> not aware of any other kernel code where kfree_rcu() is followed by an
>> rcu_barrier() call.
> 
> Right after that, the mempool (in v4, free list here) is destroyed. So the
> rcu_barrier() is needed to ensure that the grace period is past and that all
> plugs are back in the pool/freelist. Without this, I saw problems/crashes when
> removing devices.

This patch would be easier to read if the rcu_barrier() call would be
moved out of disk_free_zone_wplugs() and into its caller.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 27/30] block: Replace zone_wlock debugfs entry with zone_wplugs entry
  2024-03-28  0:44 ` [PATCH v3 27/30] block: Replace zone_wlock debugfs entry with zone_wplugs entry Damien Le Moal
  2024-03-28  4:53   ` Christoph Hellwig
@ 2024-03-29 18:54   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 18:54 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:44 PM, Damien Le Moal wrote:
> In preparation to completely remove zone write locking, replace the
> "zone_wlock" mq-debugfs entry that was listing zones that are
> write-locked with the zone_wplugs entry which lists the zones that
> currently have a write plug allocated.
> 
> The write plug information provided is: the zone number, the zone write
> plug flags, the zone write plug write pointer offset and the number of
> BIOs currently waiting for execution in the zone write plug BIO list.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 28/30] block: Remove zone write locking
  2024-03-28  0:44 ` [PATCH v3 28/30] block: Remove zone write locking Damien Le Moal
@ 2024-03-29 18:57   ` Bart Van Assche
  0 siblings, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 18:57 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:44 PM, Damien Le Moal wrote:
> Zone write locking is now unused and replaced with zone write plugging.
> Remove all code that was implementing zone write locking, that is, the
> various helper functions controlling request zone write locking and
> the gendisk attached zone bitmaps.
> 
> The "zone_wlock" mq-debugfs entry that was listing zones that are
> write-locked is replaced with the zone_wplugs entry which lists
> the zones that currently have a write plug allocated. The information
> provided is: the zone number, the zone write plug flags, the zone write
> plug write pointer offset and the number of BIOs currently waiting for
> execution in the zone write plug BIO list.

Shouldn't the second paragraph of the description go into patch 27/30?

Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 30/30] block: Do not special-case plugging of zone write operations
  2024-03-28  0:44 ` [PATCH v3 30/30] block: Do not special-case plugging of zone write operations Damien Le Moal
  2024-03-28  4:54   ` Christoph Hellwig
@ 2024-03-29 18:58   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 18:58 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:44 PM, Damien Le Moal wrote:
> With the block layer zone write plugging being automatically done for
> any write operation to a zone of a zoned block device, a regular request
> plugging handled through current->plug can only ever see at most a
> single write request per zone. In such case, any potential reordering
> of the plugged requests will be harmless. We can thus remove the special
> casing for write operations to zones and have these requests plugged as
> well. This allows removing the function blk_mq_plug and instead directly
> using current->plug where needed.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 26/30] block: Move zone related debugfs attribute to blk-zoned.c
  2024-03-28  0:44 ` [PATCH v3 26/30] block: Move zone related debugfs attribute to blk-zoned.c Damien Le Moal
  2024-03-28  4:52   ` Christoph Hellwig
@ 2024-03-29 19:00   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 19:00 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:44 PM, Damien Le Moal wrote:
> block/blk-mq-debugfs-zone.c contains a single debugfs attribute
> function. Defining this outside of block/blk-zoned.c does not really
> help in any way, so move this zone related debugfs attribute to
> block/blk-zoned.c and delete block/blk-mq-debugfs-zone.c.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 10/30] block: Fake max open zones limit when there is no limit
  2024-03-28  0:43 ` [PATCH v3 10/30] block: Fake max open zones limit when there is no limit Damien Le Moal
  2024-03-28  4:49   ` Christoph Hellwig
@ 2024-03-29 20:37   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 20:37 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> +	 * zone write plugsso that the user is aware of the potential
                       ^^^^^^^

A space is missing.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 11/30] block: Allow zero value of max_zone_append_sectors queue limit
  2024-03-28  0:43 ` [PATCH v3 11/30] block: Allow zero value of max_zone_append_sectors queue limit Damien Le Moal
  2024-03-28  4:49   ` Christoph Hellwig
@ 2024-03-29 20:50   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 20:50 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> In preparation for adding a generic zone append emulation using zone
> write plugging, allow device drivers supporting zoned block device to
> set a the max_zone_append_sectors queue limit of a device to 0 to
> indicate the lack of native support for zone append operations and that
> the block layer should emulate these operations using regular write
> operations.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 12/30] block: Implement zone append emulation
  2024-03-28  0:43 ` [PATCH v3 12/30] block: Implement zone append emulation Damien Le Moal
  2024-03-28  4:50   ` Christoph Hellwig
@ 2024-03-29 21:22   ` Bart Van Assche
  2024-03-29 21:26   ` Bart Van Assche
  2 siblings, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 21:22 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> Given that zone write plugging manages all writes to zones of a zoned
> block device and track the write pointer position of all zones,
> emulating zone append operations using regular writes can be
> implemented generically, without relying on the underlying device driver
> to implement such emulation. This is needed for devices that do not
> natively support the zone append command, e.g. SMR hard-disks.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 12/30] block: Implement zone append emulation
  2024-03-28  0:43 ` [PATCH v3 12/30] block: Implement zone append emulation Damien Le Moal
  2024-03-28  4:50   ` Christoph Hellwig
  2024-03-29 21:22   ` Bart Van Assche
@ 2024-03-29 21:26   ` Bart Van Assche
  2 siblings, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 21:26 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> Given that zone write plugging manages all writes to zones of a zoned
> block device and track the write pointer position of all zones,

track -> tracks


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 15/30] scsi: sd: Use the block layer zone append emulation
  2024-03-28  0:43 ` [PATCH v3 15/30] scsi: sd: " Damien Le Moal
  2024-03-28  4:50   ` Christoph Hellwig
  2024-03-28 10:49   ` Johannes Thumshirn
@ 2024-03-29 21:27   ` Bart Van Assche
  2 siblings, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 21:27 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 16/30] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  2024-03-28  0:43 ` [PATCH v3 16/30] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature Damien Le Moal
  2024-03-28  4:50   ` Christoph Hellwig
@ 2024-03-29 21:28   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 21:28 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 17/30] null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  2024-03-28  0:43 ` [PATCH v3 17/30] null_blk: " Damien Le Moal
  2024-03-28  4:51   ` Christoph Hellwig
@ 2024-03-29 21:29   ` Bart Van Assche
  2024-04-02  6:43   ` Chaitanya Kulkarni
  2 siblings, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 21:29 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> With zone write plugging enabled at the block layer level, any zone

zone -> zoned

Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 18/30] null_blk: Introduce zone_append_max_sectors attribute
  2024-03-28  0:43 ` [PATCH v3 18/30] null_blk: Introduce zone_append_max_sectors attribute Damien Le Moal
  2024-03-28  4:51   ` Christoph Hellwig
@ 2024-03-29 21:35   ` Bart Van Assche
  2024-03-30  0:33     ` Damien Le Moal
  2024-04-02  6:44   ` Chaitanya Kulkarni
  2 siblings, 1 reply; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 21:35 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> @@ -1953,7 +1961,7 @@ static int null_add_dev(struct nullb_device *dev)
>   
>   	rv = add_disk(nullb->disk);
>   	if (rv)
> -		goto out_ida_free;
> +		goto out_cleanup_zone;
>   

The above change is unrelated to the introduction of
zone_append_max_sectors and hence should not be in this patch.
Additionally, the order of cleanup actions in the error path seems
wrong to me. null_init_zoned_dev() is called before blk_mq_alloc_disk().
Hence, the put_disk() call in the error path should occur before the
null_free_zoned_dev() call.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 19/30] null_blk: Introduce fua attribute
  2024-03-28  0:43 ` [PATCH v3 19/30] null_blk: Introduce fua attribute Damien Le Moal
  2024-03-28  4:52   ` Christoph Hellwig
@ 2024-03-29 21:36   ` Bart Van Assche
  2024-04-02  6:42   ` Chaitanya Kulkarni
  2 siblings, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 21:36 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 5:43 PM, Damien Le Moal wrote:
> Add the fua configfs attribute and module parameter to allow
> configuring if the device supports FUA or not. Using this attribute
> has an effect on the null_blk device only if memory backing is enabled
> together with a write cache (cache_size option).
> 
> This new attribute allows configuring a null_blk device with a write
> cache but without FUA support. This is convenient to test the block
> layer flush machinery.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 21/30] block: Remove BLK_STS_ZONE_RESOURCE
  2024-03-28  0:44 ` [PATCH v3 21/30] block: Remove BLK_STS_ZONE_RESOURCE Damien Le Moal
@ 2024-03-29 21:37   ` Bart Van Assche
  0 siblings, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 21:37 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 22/30] block: Simplify blk_revalidate_disk_zones() interface
  2024-03-28  0:44 ` [PATCH v3 22/30] block: Simplify blk_revalidate_disk_zones() interface Damien Le Moal
@ 2024-03-29 21:41   ` Bart Van Assche
  0 siblings, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 21:41 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 23/30] block: mq-deadline: Remove support for zone write locking
  2024-03-28  0:44 ` [PATCH v3 23/30] block: mq-deadline: Remove support for zone write locking Damien Le Moal
  2024-03-28  4:52   ` Christoph Hellwig
@ 2024-03-29 21:43   ` Bart Van Assche
  1 sibling, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 21:43 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 24/30] block: Remove elevator required features
  2024-03-28  0:44 ` [PATCH v3 24/30] block: Remove elevator required features Damien Le Moal
@ 2024-03-29 21:44   ` Bart Van Assche
  0 siblings, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 21:44 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 25/30] block: Do not check zone type in blk_check_zone_append()
  2024-03-28  0:44 ` [PATCH v3 25/30] block: Do not check zone type in blk_check_zone_append() Damien Le Moal
@ 2024-03-29 21:45   ` Bart Van Assche
  0 siblings, 0 replies; 109+ messages in thread
From: Bart Van Assche @ 2024-03-29 21:45 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 18/30] null_blk: Introduce zone_append_max_sectors attribute
  2024-03-29 21:35   ` Bart Van Assche
@ 2024-03-30  0:33     ` Damien Le Moal
  0 siblings, 0 replies; 109+ messages in thread
From: Damien Le Moal @ 2024-03-30  0:33 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/30/24 06:35, Bart Van Assche wrote:
> On 3/27/24 5:43 PM, Damien Le Moal wrote:
>> @@ -1953,7 +1961,7 @@ static int null_add_dev(struct nullb_device *dev)
>>   
>>   	rv = add_disk(nullb->disk);
>>   	if (rv)
>> -		goto out_ida_free;
>> +		goto out_cleanup_zone;
>>   
> 
> The above change is unrelated to the introduction of
> zone_append_max_sectors and hence should not be in this patch.

Good catch. That is a bug of this patch. I removed this change as it is incorrect.

> Additionally, the order of cleanup actions in the error path seems
> wrong to me. null_init_zoned_dev() is called before blk_mq_alloc_disk().
> Hence, the put_disk() call in the error path should occur before the
> null_free_zoned_dev() call.

That is separate from this patch (and the series). But I will send a fix patch
for that.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 19/30] null_blk: Introduce fua attribute
  2024-03-28  0:43 ` [PATCH v3 19/30] null_blk: Introduce fua attribute Damien Le Moal
  2024-03-28  4:52   ` Christoph Hellwig
  2024-03-29 21:36   ` Bart Van Assche
@ 2024-04-02  6:42   ` Chaitanya Kulkarni
  2 siblings, 0 replies; 109+ messages in thread
From: Chaitanya Kulkarni @ 2024-04-02  6:42 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, Jens Axboe, linux-scsi, Martin K . Petersen,
	dm-devel, Mike Snitzer, linux-nvme, Keith Busch,
	Christoph Hellwig

On 3/27/24 17:43, Damien Le Moal wrote:
> Add the fua configfs attribute and module parameter to allow
> configuring if the device supports FUA or not. Using this attribute
> has an effect on the null_blk device only if memory backing is enabled
> together with a write cache (cache_size option).
> 
> This new attribute allows configuring a null_blk device with a write
> cache but without FUA support. This is convenient to test the block
> layer flush machinery.
> 
> Signed-off-by: Damien Le Moal<dlemoal@kernel.org>
> Reviewed-by: Hannes Reinecke<hare@suse.de>


Looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 17/30] null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  2024-03-28  0:43 ` [PATCH v3 17/30] null_blk: " Damien Le Moal
  2024-03-28  4:51   ` Christoph Hellwig
  2024-03-29 21:29   ` Bart Van Assche
@ 2024-04-02  6:43   ` Chaitanya Kulkarni
  2 siblings, 0 replies; 109+ messages in thread
From: Chaitanya Kulkarni @ 2024-04-02  6:43 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 17:43, Damien Le Moal wrote:
> With zone write plugging enabled at the block layer level, any zone
> device can only ever see at most a single write operation per zone.
> There is thus no need to request a block scheduler with strick per-zone
> sequential write ordering control through the ELEVATOR_F_ZBD_SEQ_WRITE
> feature. Removing this allows using a zoned null_blk device with any
> scheduler, including "none".
> 
> Signed-off-by: Damien Le Moal<dlemoal@kernel.org>
> Reviewed-by: Hannes Reinecke<hare@suse.de>


Looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 18/30] null_blk: Introduce zone_append_max_sectors attribute
  2024-03-28  0:43 ` [PATCH v3 18/30] null_blk: Introduce zone_append_max_sectors attribute Damien Le Moal
  2024-03-28  4:51   ` Christoph Hellwig
  2024-03-29 21:35   ` Bart Van Assche
@ 2024-04-02  6:44   ` Chaitanya Kulkarni
  2 siblings, 0 replies; 109+ messages in thread
From: Chaitanya Kulkarni @ 2024-04-02  6:44 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 17:43, Damien Le Moal wrote:
> Add the zone_append_max_sectors configfs attribute and module parameter
> to allow configuring the maximum number of 512B sectors of zone append
> operations. This attribute is meaningful only for zoned null block
> devices.
> 
> If not specified, the default is unchanged and the zoned device max
> append sectors limit is set to the device max sectors limit.
> If a non 0 value is used for this attribute, which is the default,
> then native support for zone append operations is enabled.
> Setting a 0 value disables native zone append operations support to
> instead use the block layer emulation.
> 
> Signed-off-by: Damien Le Moal<dlemoal@kernel.org>
> Reviewed-by: Hannes Reinecke<hare@suse.de>


Looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH v3 20/30] nvmet: zns: Do not reference the gendisk conv_zones_bitmap
  2024-03-28  0:43 ` [PATCH v3 20/30] nvmet: zns: Do not reference the gendisk conv_zones_bitmap Damien Le Moal
@ 2024-04-02  6:45   ` Chaitanya Kulkarni
  0 siblings, 0 replies; 109+ messages in thread
From: Chaitanya Kulkarni @ 2024-04-02  6:45 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, linux-scsi,
	Martin K . Petersen, dm-devel, Mike Snitzer, linux-nvme,
	Keith Busch, Christoph Hellwig

On 3/27/24 17:43, Damien Le Moal wrote:
> The gendisk conventional zone bitmap is going away. So to check for the
> presence of conventional zones on a zoned target device, always use
> report zones.
> 
> Signed-off-by: Damien Le Moal<dlemoal@kernel.org>
> Reviewed-by: Hannes Reinecke<hare@suse.de>
> Reviewed-by: Christoph Hellwig<hch@lst.de>

Looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck



^ permalink raw reply	[flat|nested] 109+ messages in thread

end of thread, other threads:[~2024-04-02  6:45 UTC | newest]

Thread overview: 109+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-28  0:43 [PATCH v3 00/30] Zone write plugging Damien Le Moal
2024-03-28  0:43 ` [PATCH v3 01/30] block: Do not force full zone append completion in req_bio_endio() Damien Le Moal
2024-03-28  4:10   ` Christoph Hellwig
2024-03-28 18:14   ` Bart Van Assche
2024-03-28 22:43     ` Damien Le Moal
2024-03-28 23:03       ` Jens Axboe
2024-03-28  0:43 ` [PATCH v3 02/30] block: Restore sector of flush requests Damien Le Moal
2024-03-28  0:43 ` [PATCH v3 03/30] block: Remove req_bio_endio() Damien Le Moal
2024-03-28  4:13   ` Christoph Hellwig
2024-03-28 21:28   ` Bart Van Assche
2024-03-28 22:42     ` Damien Le Moal
2024-03-28  0:43 ` [PATCH v3 04/30] block: Introduce blk_zone_update_request_bio() Damien Le Moal
2024-03-28  4:14   ` Christoph Hellwig
2024-03-28  5:20     ` Damien Le Moal
2024-03-28  5:42       ` Christoph Hellwig
2024-03-28  5:54         ` Damien Le Moal
2024-03-28 21:31   ` Bart Van Assche
2024-03-28  0:43 ` [PATCH v3 05/30] block: Introduce bio_straddles_zones() and bio_offset_from_zone_start() Damien Le Moal
2024-03-28 21:32   ` Bart Van Assche
2024-03-28  0:43 ` [PATCH v3 06/30] block: Allow using bio_attempt_back_merge() internally Damien Le Moal
2024-03-28  0:43 ` [PATCH v3 07/30] block: Remember zone capacity when revalidating zones Damien Le Moal
2024-03-28 21:38   ` Bart Van Assche
2024-03-28 22:40     ` Damien Le Moal
2024-03-28  0:43 ` [PATCH v3 08/30] block: Introduce zone write plugging Damien Le Moal
2024-03-28  4:48   ` Christoph Hellwig
2024-03-28 22:20   ` Bart Van Assche
2024-03-28 22:38     ` Damien Le Moal
2024-03-29 18:20       ` Bart Van Assche
2024-03-28  0:43 ` [PATCH v3 09/30] block: Pre-allocate zone write plugs Damien Le Moal
2024-03-28  4:30   ` Christoph Hellwig
2024-03-28  5:28     ` Damien Le Moal
2024-03-28  5:46       ` Christoph Hellwig
2024-03-28  6:02         ` Damien Le Moal
2024-03-28  6:03           ` Christoph Hellwig
2024-03-28  6:18             ` Damien Le Moal
2024-03-28  6:22               ` Christoph Hellwig
2024-03-28  6:33                 ` Damien Le Moal
2024-03-28  6:38                   ` Christoph Hellwig
2024-03-28  6:51                     ` Damien Le Moal
2024-03-28  6:52                       ` Christoph Hellwig
2024-03-28  6:53                         ` Damien Le Moal
2024-03-28 22:25     ` Bart Van Assche
2024-03-28 22:29   ` Bart Van Assche
2024-03-28 22:33     ` Damien Le Moal
2024-03-28  0:43 ` [PATCH v3 10/30] block: Fake max open zones limit when there is no limit Damien Le Moal
2024-03-28  4:49   ` Christoph Hellwig
2024-03-29 20:37   ` Bart Van Assche
2024-03-28  0:43 ` [PATCH v3 11/30] block: Allow zero value of max_zone_append_sectors queue limit Damien Le Moal
2024-03-28  4:49   ` Christoph Hellwig
2024-03-29 20:50   ` Bart Van Assche
2024-03-28  0:43 ` [PATCH v3 12/30] block: Implement zone append emulation Damien Le Moal
2024-03-28  4:50   ` Christoph Hellwig
2024-03-29 21:22   ` Bart Van Assche
2024-03-29 21:26   ` Bart Van Assche
2024-03-28  0:43 ` [PATCH v3 13/30] block: Allow BIO-based drivers to use blk_revalidate_disk_zones() Damien Le Moal
2024-03-28  0:43 ` [PATCH v3 14/30] dm: Use the block layer zone append emulation Damien Le Moal
2024-03-28  0:43 ` [PATCH v3 15/30] scsi: sd: " Damien Le Moal
2024-03-28  4:50   ` Christoph Hellwig
2024-03-28 10:49   ` Johannes Thumshirn
2024-03-29 21:27   ` Bart Van Assche
2024-03-28  0:43 ` [PATCH v3 16/30] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature Damien Le Moal
2024-03-28  4:50   ` Christoph Hellwig
2024-03-29 21:28   ` Bart Van Assche
2024-03-28  0:43 ` [PATCH v3 17/30] null_blk: " Damien Le Moal
2024-03-28  4:51   ` Christoph Hellwig
2024-03-29 21:29   ` Bart Van Assche
2024-04-02  6:43   ` Chaitanya Kulkarni
2024-03-28  0:43 ` [PATCH v3 18/30] null_blk: Introduce zone_append_max_sectors attribute Damien Le Moal
2024-03-28  4:51   ` Christoph Hellwig
2024-03-29 21:35   ` Bart Van Assche
2024-03-30  0:33     ` Damien Le Moal
2024-04-02  6:44   ` Chaitanya Kulkarni
2024-03-28  0:43 ` [PATCH v3 19/30] null_blk: Introduce fua attribute Damien Le Moal
2024-03-28  4:52   ` Christoph Hellwig
2024-03-29 21:36   ` Bart Van Assche
2024-04-02  6:42   ` Chaitanya Kulkarni
2024-03-28  0:43 ` [PATCH v3 20/30] nvmet: zns: Do not reference the gendisk conv_zones_bitmap Damien Le Moal
2024-04-02  6:45   ` Chaitanya Kulkarni
2024-03-28  0:44 ` [PATCH v3 21/30] block: Remove BLK_STS_ZONE_RESOURCE Damien Le Moal
2024-03-29 21:37   ` Bart Van Assche
2024-03-28  0:44 ` [PATCH v3 22/30] block: Simplify blk_revalidate_disk_zones() interface Damien Le Moal
2024-03-29 21:41   ` Bart Van Assche
2024-03-28  0:44 ` [PATCH v3 23/30] block: mq-deadline: Remove support for zone write locking Damien Le Moal
2024-03-28  4:52   ` Christoph Hellwig
2024-03-29 21:43   ` Bart Van Assche
2024-03-28  0:44 ` [PATCH v3 24/30] block: Remove elevator required features Damien Le Moal
2024-03-29 21:44   ` Bart Van Assche
2024-03-28  0:44 ` [PATCH v3 25/30] block: Do not check zone type in blk_check_zone_append() Damien Le Moal
2024-03-29 21:45   ` Bart Van Assche
2024-03-28  0:44 ` [PATCH v3 26/30] block: Move zone related debugfs attribute to blk-zoned.c Damien Le Moal
2024-03-28  4:52   ` Christoph Hellwig
2024-03-29 19:00   ` Bart Van Assche
2024-03-28  0:44 ` [PATCH v3 27/30] block: Replace zone_wlock debugfs entry with zone_wplugs entry Damien Le Moal
2024-03-28  4:53   ` Christoph Hellwig
2024-03-29 18:54   ` Bart Van Assche
2024-03-28  0:44 ` [PATCH v3 28/30] block: Remove zone write locking Damien Le Moal
2024-03-29 18:57   ` Bart Van Assche
2024-03-28  0:44 ` [PATCH v3 29/30] block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED Damien Le Moal
2024-03-28  4:53   ` Christoph Hellwig
2024-03-28  0:44 ` [PATCH v3 30/30] block: Do not special-case plugging of zone write operations Damien Le Moal
2024-03-28  4:54   ` Christoph Hellwig
2024-03-28  6:43     ` Damien Le Moal
2024-03-28  6:51       ` Christoph Hellwig
2024-03-28  6:54         ` Damien Le Moal
2024-03-29 18:58   ` Bart Van Assche
2024-03-28 23:05 ` (subset) [PATCH v3 00/30] Zone write plugging Jens Axboe
2024-03-28 23:13   ` Damien Le Moal
2024-03-28 23:27     ` Jens Axboe
2024-03-28 23:33       ` Damien Le Moal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.