dm-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/14] Zone write plugging fixes and cleanup
@ 2024-05-01 11:08 Damien Le Moal
  2024-05-01 11:08 ` [PATCH v3 01/14] dm: Check that a zoned table leads to a valid mapped device Damien Le Moal
                   ` (14 more replies)
  0 siblings, 15 replies; 31+ messages in thread
From: Damien Le Moal @ 2024-05-01 11:08 UTC (permalink / raw)
  To: linux-block, Jens Axboe, dm-devel, Mike Snitzer

Jens, Mike,

With more testing of zone write plugging on more device setups,
including weird/test setups (with scsi debug and null_blk), several
issues were identified. This patch series addresses them and cleanup the
code a little to try to make it more obvious.

The first patch is a DM modification to not expose zoned devices
composed solely of conventional zones as zoned block devices. The second
patch is a fix partly related to this to expose a correct max open zones
limit for devices with no limits (which DM devices are as the max
open/active limit is not propagated to the mapped device as there is no
easy way to do that).

Patches 3 to 9 are bug fixes. The most serious problem among theses was
detected with tests using scsi_debug zoned devices and is fixed in
patch 7.

Patches 10 to 14 improve and cleanup the code.

Changes from v2:
 - Fixed comment in patch 1
 - Addressed Christoph's suggestion in patch 2 (mempool_resize call).
 - Added review tags

Changes from v1:
 - Changed patch 4 as suggested by Christoph
 - Added additional comments in patch 7 to clarify the check for the
   zone write plug reference count in disk_should_remove_zone_wplug()
 - Added patch 14
 - Added review tags

Damien Le Moal (14):
  dm: Check that a zoned table leads to a valid mapped device
  block: Exclude conventional zones when faking max open limit
  block: Fix zone write plug initialization from blk_revalidate_zone_cb()
  block: Fix reference counting for zone write plugs in error state
  block: Hold a reference on zone write plugs to schedule submission
  block: Unhash a zone write plug only if needed
  block: Do not remove zone write plugs still in use
  block: Fix flush request sector restore
  block: Fix handling of non-empty flush write requests to zones
  block: Improve blk_zone_write_plug_bio_merged()
  block: Improve zone write request completion handling
  block: Simplify blk_zone_write_plug_bio_endio()
  block: Simplify zone write plug BIO abort
  block: Cleanup blk_revalidate_zone_cb()

 block/blk-flush.c     |   3 +-
 block/blk-mq.c        |  12 +-
 block/blk-zoned.c     | 403 +++++++++++++++++++++++++++---------------
 block/blk.h           |  12 +-
 drivers/md/dm-table.c |   3 +-
 drivers/md/dm-zone.c  |  57 ++++++
 6 files changed, 335 insertions(+), 155 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 01/14] dm: Check that a zoned table leads to a valid mapped device
  2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
@ 2024-05-01 11:08 ` Damien Le Moal
  2024-05-01 11:08 ` [PATCH v3 02/14] block: Exclude conventional zones when faking max open limit Damien Le Moal
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2024-05-01 11:08 UTC (permalink / raw)
  To: linux-block, Jens Axboe, dm-devel, Mike Snitzer

Using targets such as dm-linear, a mapped device can be created to
contain only conventional zones. Such device should not be treated as
zoned as it does not contain any mandatory sequential write required
zone. Since such device can be randomly written, we can modify
dm_set_zones_restrictions() to set the mapped device zoned queue limit
to false to expose it as a regular block device. The function
dm_check_zoned() does this after counting the number of conventional
zones of the mapped device and comparing it to the total number of zones
reported. The special dm_check_zoned_cb() report zones callback function
is used to count conventional zones.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 drivers/md/dm-table.c |  3 ++-
 drivers/md/dm-zone.c  | 57 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 41f1d731ae5a..2c6fbd87363f 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -2042,7 +2042,8 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 		r = dm_set_zones_restrictions(t, q);
 		if (r)
 			return r;
-		if (!static_key_enabled(&zoned_enabled.key))
+		if (blk_queue_is_zoned(q) &&
+		    !static_key_enabled(&zoned_enabled.key))
 			static_branch_enable(&zoned_enabled);
 	}
 
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index d17ae4486a6a..8e6bcb0d786a 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -145,6 +145,52 @@ bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
 	}
 }
 
+/*
+ * Count conventional zones of a mapped zoned device. If the device
+ * only has conventional zones, do not expose it as zoned.
+ */
+static int dm_check_zoned_cb(struct blk_zone *zone, unsigned int idx,
+			     void *data)
+{
+	unsigned int *nr_conv_zones = data;
+
+	if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
+		(*nr_conv_zones)++;
+
+	return 0;
+}
+
+static int dm_check_zoned(struct mapped_device *md, struct dm_table *t)
+{
+	struct gendisk *disk = md->disk;
+	unsigned int nr_conv_zones = 0;
+	int ret;
+
+	/* Count conventional zones */
+	md->zone_revalidate_map = t;
+	ret = dm_blk_report_zones(disk, 0, UINT_MAX,
+				  dm_check_zoned_cb, &nr_conv_zones);
+	md->zone_revalidate_map = NULL;
+	if (ret < 0) {
+		DMERR("Check zoned failed %d", ret);
+		return ret;
+	}
+
+	/*
+	 * If we only have conventional zones, expose the mapped device as
+	 * a regular device.
+	 */
+	if (nr_conv_zones >= ret) {
+		disk->queue->limits.max_open_zones = 0;
+		disk->queue->limits.max_active_zones = 0;
+		disk->queue->limits.zoned = false;
+		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+		disk->nr_zones = 0;
+	}
+
+	return 0;
+}
+
 /*
  * Revalidate the zones of a mapped device to initialize resource necessary
  * for zone append emulation. Note that we cannot simply use the block layer
@@ -208,6 +254,7 @@ static bool dm_table_supports_zone_append(struct dm_table *t)
 int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
 {
 	struct mapped_device *md = t->md;
+	int ret;
 
 	/*
 	 * Check if zone append is natively supported, and if not, set the
@@ -224,6 +271,16 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
 	if (!get_capacity(md->disk))
 		return 0;
 
+	/*
+	 * Check that the mapped device will indeed be zoned, that is, that it
+	 * has sequential write required zones.
+	 */
+	ret = dm_check_zoned(md, t);
+	if (ret)
+		return ret;
+	if (!blk_queue_is_zoned(q))
+		return 0;
+
 	if (!md->disk->nr_zones) {
 		DMINFO("%s using %s zone append",
 		       md->disk->disk_name,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 02/14] block: Exclude conventional zones when faking max open limit
  2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
  2024-05-01 11:08 ` [PATCH v3 01/14] dm: Check that a zoned table leads to a valid mapped device Damien Le Moal
@ 2024-05-01 11:08 ` Damien Le Moal
  2024-05-02  5:52   ` Hannes Reinecke
  2024-05-01 11:08 ` [PATCH v3 03/14] block: Fix zone write plug initialization from blk_revalidate_zone_cb() Damien Le Moal
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2024-05-01 11:08 UTC (permalink / raw)
  To: linux-block, Jens Axboe, dm-devel, Mike Snitzer

For a device that has no limits for the maximum number of open and
active zones, we default to using the number of zones, limited to
BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE (128), for the maximum number of open
zones indicated to the user. However, for a device that has conventional
zones and less zones than BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE, we should
not account conventional zones and set the limit to the number of
sequential write required zones. Furthermore, for cases where the limit
is equal to the number of sequential write required zones, we can
advertize a limit of 0 to indicate "no limits".

Fix this by moving the zone write plug mempool resizing from
disk_revalidate_zone_resources() to disk_update_zone_resources() where
we can safely compute the number of conventional zones and update the
limits.

Fixes: 843283e96e5a ("block: Fake max open zones limit when there is no limit")
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-zoned.c | 38 ++++++++++++++++++++++++++++----------
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index bad68277c0b2..731d1abb80f6 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1513,10 +1513,6 @@ static int disk_revalidate_zone_resources(struct gendisk *disk,
 	if (!disk->zone_wplugs_hash)
 		return disk_alloc_zone_resources(disk, pool_size);
 
-	/* Resize the zone write plug memory pool if needed. */
-	if (disk->zone_wplugs_pool->min_nr != pool_size)
-		return mempool_resize(disk->zone_wplugs_pool, pool_size);
-
 	return 0;
 }
 
@@ -1536,11 +1532,24 @@ static int disk_update_zone_resources(struct gendisk *disk,
 				      struct blk_revalidate_zone_args *args)
 {
 	struct request_queue *q = disk->queue;
+	unsigned int nr_seq_zones, nr_conv_zones = 0;
+	unsigned int pool_size;
 	struct queue_limits lim;
 
 	disk->nr_zones = args->nr_zones;
 	disk->zone_capacity = args->zone_capacity;
 	swap(disk->conv_zones_bitmap, args->conv_zones_bitmap);
+	if (disk->conv_zones_bitmap)
+		nr_conv_zones = bitmap_weight(disk->conv_zones_bitmap,
+					      disk->nr_zones);
+	if (nr_conv_zones >= disk->nr_zones) {
+		pr_warn("%s: Invalid number of conventional zones %u / %u\n",
+			disk->disk_name, nr_conv_zones, disk->nr_zones);
+		return -ENODEV;
+	}
+
+	if (!disk->zone_wplugs_pool)
+		return 0;
 
 	/*
 	 * If the device has no limit on the maximum number of open and active
@@ -1549,14 +1558,23 @@ static int disk_update_zone_resources(struct gendisk *disk,
 	 * dynamic zone write plug allocation when simultaneously writing to
 	 * more zones than the size of the mempool.
 	 */
-	if (disk->zone_wplugs_pool) {
-		lim = queue_limits_start_update(q);
-		if (!lim.max_open_zones && !lim.max_active_zones)
-			lim.max_open_zones = disk->zone_wplugs_pool->min_nr;
-		return queue_limits_commit_update(q, &lim);
+	lim = queue_limits_start_update(q);
+
+	nr_seq_zones = disk->nr_zones - nr_conv_zones;
+	pool_size = max(lim.max_open_zones, lim.max_active_zones);
+	if (!pool_size)
+		pool_size = min(BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE, nr_seq_zones);
+
+	mempool_resize(disk->zone_wplugs_pool, pool_size);
+
+	if (!lim.max_open_zones && !lim.max_active_zones) {
+		if (pool_size < nr_seq_zones)
+			lim.max_open_zones = pool_size;
+		else
+			lim.max_open_zones = 0;
 	}
 
-	return 0;
+	return queue_limits_commit_update(q, &lim);
 }
 
 /*
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 03/14] block: Fix zone write plug initialization from blk_revalidate_zone_cb()
  2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
  2024-05-01 11:08 ` [PATCH v3 01/14] dm: Check that a zoned table leads to a valid mapped device Damien Le Moal
  2024-05-01 11:08 ` [PATCH v3 02/14] block: Exclude conventional zones when faking max open limit Damien Le Moal
@ 2024-05-01 11:08 ` Damien Le Moal
  2024-05-02  5:53   ` Hannes Reinecke
  2024-05-01 11:08 ` [PATCH v3 04/14] block: Fix reference counting for zone write plugs in error state Damien Le Moal
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2024-05-01 11:08 UTC (permalink / raw)
  To: linux-block, Jens Axboe, dm-devel, Mike Snitzer

When revalidating the zones of a zoned block device,
blk_revalidate_zone_cb() must allocate a zone write plug for any
sequential write required zone that is not empty nor full. However, the
current code tests the latter case by comparing the zone write pointer
offset to the zone size instead of the zone capacity. Furthermore,
disk_get_and_lock_zone_wplug() is called with a sector argument equal to
the zone start instead of the current zone write pointer position.
This commit fixes both issues by calling disk_get_and_lock_zone_wplug()
for a zone that is not empty and with a write pointer offset lower than
the zone capacity and use the zone capacity sector as the sector
argument for disk_get_and_lock_zone_wplug().

Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-zoned.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 731d1abb80f6..7824bd52c82c 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1664,10 +1664,11 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 		 * empty nor full. So make sure we have a zone write plug for
 		 * such zone if the device has a zone write plug hash table.
 		 */
+		if (!disk->zone_wplugs_hash)
+			break;
 		wp_offset = blk_zone_wp_offset(zone);
-		if (disk->zone_wplugs_hash &&
-		    wp_offset && wp_offset < zone_sectors) {
-			zwplug = disk_get_and_lock_zone_wplug(disk, zone->start,
+		if (wp_offset && wp_offset < zone->capacity) {
+			zwplug = disk_get_and_lock_zone_wplug(disk, zone->wp,
 							      GFP_NOIO, &flags);
 			if (!zwplug)
 				return -ENOMEM;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 04/14] block: Fix reference counting for zone write plugs in error state
  2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
                   ` (2 preceding siblings ...)
  2024-05-01 11:08 ` [PATCH v3 03/14] block: Fix zone write plug initialization from blk_revalidate_zone_cb() Damien Le Moal
@ 2024-05-01 11:08 ` Damien Le Moal
  2024-05-02  6:01   ` Hannes Reinecke
  2024-05-01 11:08 ` [PATCH v3 05/14] block: Hold a reference on zone write plugs to schedule submission Damien Le Moal
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2024-05-01 11:08 UTC (permalink / raw)
  To: linux-block, Jens Axboe, dm-devel, Mike Snitzer

When zone is reset or finished, disk_zone_wplug_set_wp_offset() is
called to update the zone write plug write pointer offset and to clear
the zone error state (BLK_ZONE_WPLUG_ERROR flag) if it is set.
However, this processing is missing dropping the reference to the zone
write plug that was taken in disk_zone_wplug_set_error() when the error
flag was first set. Furthermore, the error state handling must release
the zone write plug lock to first execute a report zones command. When
the report zone races with a reset or finish operation that clears the
error, we can end up decrementing the zone write plug reference count
twice: once in disk_zone_wplug_set_wp_offset() for the reset/finish
operation and one more time in disk_zone_wplugs_work() once
disk_zone_wplug_handle_error() completes.

Fix this by introducing disk_zone_wplug_clear_error() as the symmetric
function of disk_zone_wplug_set_error(). disk_zone_wplug_clear_error()
decrements the zone write plug reference count obtained in
disk_zone_wplug_set_error() only if the error handling has not started
yet, that is, only if disk_zone_wplugs_work() has not yet taken the zone
write plug off the error list. This ensure that either
disk_zone_wplug_clear_error() or disk_zone_wplugs_work() drop the zone
write plug reference count.

Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-zoned.c | 75 +++++++++++++++++++++++++++++++----------------
 1 file changed, 49 insertions(+), 26 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 7824bd52c82c..23ad1de0da62 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -658,6 +658,54 @@ static void disk_zone_wplug_abort_unaligned(struct gendisk *disk,
 	bio_list_merge(&zwplug->bio_list, &bl);
 }
 
+static inline void disk_zone_wplug_set_error(struct gendisk *disk,
+					     struct blk_zone_wplug *zwplug)
+{
+	unsigned long flags;
+
+	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
+		return;
+
+	/*
+	 * At this point, we already have a reference on the zone write plug.
+	 * However, since we are going to add the plug to the disk zone write
+	 * plugs work list, increase its reference count. This reference will
+	 * be dropped in disk_zone_wplugs_work() once the error state is
+	 * handled, or in disk_zone_wplug_clear_error() if the zone is reset or
+	 * finished.
+	 */
+	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
+	atomic_inc(&zwplug->ref);
+
+	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+	list_add_tail(&zwplug->link, &disk->zone_wplugs_err_list);
+	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+}
+
+static inline void disk_zone_wplug_clear_error(struct gendisk *disk,
+					       struct blk_zone_wplug *zwplug)
+{
+	unsigned long flags;
+
+	if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR))
+		return;
+
+	/*
+	 * We are racing with the error handling work which drops the reference
+	 * on the zone write plug after handling the error state. So remove the
+	 * plug from the error list and drop its reference count only if the
+	 * error handling has not yet started, that is, if the zone write plug
+	 * is still listed.
+	 */
+	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+	if (!list_empty(&zwplug->link)) {
+		list_del_init(&zwplug->link);
+		zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
+		disk_put_zone_wplug(zwplug);
+	}
+	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+}
+
 /*
  * Set a zone write plug write pointer offset to either 0 (zone reset case)
  * or to the zone size (zone finish case). This aborts all plugged BIOs, which
@@ -691,12 +739,7 @@ static void disk_zone_wplug_set_wp_offset(struct gendisk *disk,
 	 * in a good state. So clear the error flag and decrement the
 	 * error count if we were in error state.
 	 */
-	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) {
-		zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
-		spin_lock(&disk->zone_wplugs_lock);
-		list_del_init(&zwplug->link);
-		spin_unlock(&disk->zone_wplugs_lock);
-	}
+	disk_zone_wplug_clear_error(disk, zwplug);
 
 	/*
 	 * The zone write plug now has no BIO plugged: remove it from the
@@ -885,26 +928,6 @@ void blk_zone_write_plug_attempt_merge(struct request *req)
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 }
 
-static inline void disk_zone_wplug_set_error(struct gendisk *disk,
-					     struct blk_zone_wplug *zwplug)
-{
-	if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR)) {
-		unsigned long flags;
-
-		/*
-		 * Increase the plug reference count. The reference will be
-		 * dropped in disk_zone_wplugs_work() once the error state
-		 * is handled.
-		 */
-		zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
-		atomic_inc(&zwplug->ref);
-
-		spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
-		list_add_tail(&zwplug->link, &disk->zone_wplugs_err_list);
-		spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
-	}
-}
-
 /*
  * Check and prepare a BIO for submission by incrementing the write pointer
  * offset of its zone write plug and changing zone append operations into
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 05/14] block: Hold a reference on zone write plugs to schedule submission
  2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
                   ` (3 preceding siblings ...)
  2024-05-01 11:08 ` [PATCH v3 04/14] block: Fix reference counting for zone write plugs in error state Damien Le Moal
@ 2024-05-01 11:08 ` Damien Le Moal
  2024-05-02  6:04   ` Hannes Reinecke
  2024-05-01 11:08 ` [PATCH v3 06/14] block: Unhash a zone write plug only if needed Damien Le Moal
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2024-05-01 11:08 UTC (permalink / raw)
  To: linux-block, Jens Axboe, dm-devel, Mike Snitzer

Since a zone write plug BIO work is a field of struct blk_zone_wplug, we
must ensure that a zone write plug is never freed when its BIO
submission work is queued or running. Do this by holding a reference on
the zone write plug when the submission work is scheduled for execution
with queue_work() and releasing the reference at the end of the
execution of the work function blk_zone_wplug_bio_work().
The helper function disk_zone_wplug_schedule_bio_work() is introduced to
get a reference on a zone write plug and queue its work. This helper is
used in disk_zone_wplug_unplug_bio() and disk_zone_wplug_handle_error().

Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-zoned.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 23ad1de0da62..78557f810f1d 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1132,6 +1132,19 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
 }
 EXPORT_SYMBOL_GPL(blk_zone_plug_bio);
 
+static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
+					      struct blk_zone_wplug *zwplug)
+{
+	/*
+	 * Take a reference on the zone write plug and schedule the submission
+	 * of the next plugged BIO. blk_zone_wplug_bio_work() will release the
+	 * reference we take here.
+	 */
+	WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
+	atomic_inc(&zwplug->ref);
+	queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
+}
+
 static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
 				       struct blk_zone_wplug *zwplug)
 {
@@ -1151,8 +1164,8 @@ static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
 
 	/* Schedule submission of the next plugged BIO if we have one. */
 	if (!bio_list_empty(&zwplug->bio_list)) {
+		disk_zone_wplug_schedule_bio_work(disk, zwplug);
 		spin_unlock_irqrestore(&zwplug->lock, flags);
-		queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
 		return;
 	}
 
@@ -1252,14 +1265,14 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 	if (!bio) {
 		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
 		spin_unlock_irqrestore(&zwplug->lock, flags);
-		return;
+		goto put_zwplug;
 	}
 
 	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
 		/* Error recovery will decide what to do with the BIO. */
 		bio_list_add_head(&zwplug->bio_list, bio);
 		spin_unlock_irqrestore(&zwplug->lock, flags);
-		return;
+		goto put_zwplug;
 	}
 
 	spin_unlock_irqrestore(&zwplug->lock, flags);
@@ -1275,6 +1288,10 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 	 */
 	if (bdev->bd_has_submit_bio)
 		blk_queue_exit(bdev->bd_disk->queue);
+
+put_zwplug:
+	/* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */
+	disk_put_zone_wplug(zwplug);
 }
 
 static unsigned int blk_zone_wp_offset(struct blk_zone *zone)
@@ -1354,8 +1371,7 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
 
 	/* Restart BIO submission if we still have any BIO left. */
 	if (!bio_list_empty(&zwplug->bio_list)) {
-		WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
-		queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
+		disk_zone_wplug_schedule_bio_work(disk, zwplug);
 		goto unlock;
 	}
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 06/14] block: Unhash a zone write plug only if needed
  2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
                   ` (4 preceding siblings ...)
  2024-05-01 11:08 ` [PATCH v3 05/14] block: Hold a reference on zone write plugs to schedule submission Damien Le Moal
@ 2024-05-01 11:08 ` Damien Le Moal
  2024-05-02  6:05   ` Hannes Reinecke
  2024-05-01 11:09 ` [PATCH v3 07/14] block: Do not remove zone write plugs still in use Damien Le Moal
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2024-05-01 11:08 UTC (permalink / raw)
  To: linux-block, Jens Axboe, dm-devel, Mike Snitzer

Fix disk_remove_zone_wplug() to ensure that a zone write plug already
removed from a disk hash table of zone write plugs is not removed
again. Do this by checking the BLK_ZONE_WPLUG_UNHASHED flag of the plug
and calling hlist_del_init_rcu() only if the flag is not set.

Furthermore, since BIO completions can happen at any time, that is,
decrementing of the zone write plug reference count can happen at any
time, make sure to use disk_put_zone_wplug() instead of atomic_dec() to
ensure that the zone write plug is freed when its last reference is
dropped. In order to do this, disk_remove_zone_wplug() is moved after
the definition of disk_put_zone_wplug(). disk_should_remove_zone_wplug()
is moved as well to keep it together with disk_remove_zone_wplug().

To be consistent with this change, add a check in disk_put_zone_wplug()
to ensure that a zone write plug being freed was already removed from
the disk hash table.

Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-zoned.c | 55 +++++++++++++++++++++++++++--------------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 78557f810f1d..2f61ba56dad2 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -476,29 +476,6 @@ static bool disk_insert_zone_wplug(struct gendisk *disk,
 	return true;
 }
 
-static void disk_remove_zone_wplug(struct gendisk *disk,
-				   struct blk_zone_wplug *zwplug)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
-	zwplug->flags |= BLK_ZONE_WPLUG_UNHASHED;
-	atomic_dec(&zwplug->ref);
-	hlist_del_init_rcu(&zwplug->node);
-	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
-}
-
-static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
-						 struct blk_zone_wplug *zwplug)
-{
-	/* If the zone is still busy, the plug cannot be removed. */
-	if (zwplug->flags & BLK_ZONE_WPLUG_BUSY)
-		return false;
-
-	/* We can remove zone write plugs for zones that are empty or full. */
-	return !zwplug->wp_offset || zwplug->wp_offset >= disk->zone_capacity;
-}
-
 static struct blk_zone_wplug *disk_get_zone_wplug(struct gendisk *disk,
 						  sector_t sector)
 {
@@ -534,11 +511,43 @@ static inline void disk_put_zone_wplug(struct blk_zone_wplug *zwplug)
 	if (atomic_dec_and_test(&zwplug->ref)) {
 		WARN_ON_ONCE(!bio_list_empty(&zwplug->bio_list));
 		WARN_ON_ONCE(!list_empty(&zwplug->link));
+		WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_UNHASHED));
 
 		call_rcu(&zwplug->rcu_head, disk_free_zone_wplug_rcu);
 	}
 }
 
+static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
+						 struct blk_zone_wplug *zwplug)
+{
+	/* If the zone is still busy, the plug cannot be removed. */
+	if (zwplug->flags & BLK_ZONE_WPLUG_BUSY)
+		return false;
+
+	/* We can remove zone write plugs for zones that are empty or full. */
+	return !zwplug->wp_offset || zwplug->wp_offset >= disk->zone_capacity;
+}
+
+static void disk_remove_zone_wplug(struct gendisk *disk,
+				   struct blk_zone_wplug *zwplug)
+{
+	unsigned long flags;
+
+	/* If the zone write plug was already removed, we have nothing to do. */
+	if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED)
+		return;
+
+	/*
+	 * Mark the zone write plug as unhashed and drop the extra reference we
+	 * took when the plug was inserted in the hash table.
+	 */
+	zwplug->flags |= BLK_ZONE_WPLUG_UNHASHED;
+	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+	hlist_del_init_rcu(&zwplug->node);
+	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+	disk_put_zone_wplug(zwplug);
+}
+
 static void blk_zone_wplug_bio_work(struct work_struct *work);
 
 /*
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 07/14] block: Do not remove zone write plugs still in use
  2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
                   ` (5 preceding siblings ...)
  2024-05-01 11:08 ` [PATCH v3 06/14] block: Unhash a zone write plug only if needed Damien Le Moal
@ 2024-05-01 11:09 ` Damien Le Moal
  2024-05-02  5:38   ` Christoph Hellwig
  2024-05-02  6:09   ` Hannes Reinecke
  2024-05-01 11:09 ` [PATCH v3 08/14] block: Fix flush request sector restore Damien Le Moal
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 31+ messages in thread
From: Damien Le Moal @ 2024-05-01 11:09 UTC (permalink / raw)
  To: linux-block, Jens Axboe, dm-devel, Mike Snitzer

Large write BIOs that span a zone boundary are split in
blk_mq_submit_bio() before being passed to blk_zone_plug_bio() for zone
write plugging. Such split BIO will be chained with one fragment
targeting one zone and the remainder of the BIO targeting the next
zone. The two BIOs can be executed in parallel, without a predetermine
order relative to eachother and their completion may be reversed: the
remainder first completing and the first fragment then completing. In
such case, bio_endio() will not immediately execute
blk_zone_write_plug_bio_endio() for the parent BIO (the remainder of the
split BIO) as the BIOs are chained. blk_zone_write_plug_bio_endio() for
the parent BIO will be executed only once the first fragment completes.

In the case of a device with small zones and very large BIOs, uch
completion pattern can lead to disk_should_remove_zone_wplug() to return
true for the zone of the parent BIO when the parent BIO request
completes and blk_zone_write_plug_complete_request() is executed. This
triggers the removal of the zone write plug from the hash table using
disk_remove_zone_wplug(). With the zone write plug of the parent BIO
missing, the call to disk_get_zone_wplug() in
blk_zone_write_plug_bio_endio() returns NULL and triggers a warning.

This patterns can be recreated fairly easily using a scsi_debug device
with small zone and btrfs. E.g.

modprobe scsi_debug delay=0 dev_size_mb=1024 sector_size=4096 \
	zbc=host-managed zone_cap_mb=3 zone_nr_conv=0 zone_size_mb=4
mkfs.btrfs -f -O zoned /dev/sda
mount -t btrfs /dev/sda /mnt
fio --name=wrtest --rw=randwrite --direct=1 --ioengine=libaio \
	--bs=4k --iodepth=16 --size=1M --directory=/mnt --time_based \
	--runtime=10
umount /dev/sda

Will result in the warning:

[   29.035538] WARNING: CPU: 3 PID: 37 at block/blk-zoned.c:1207 blk_zone_write_plug_bio_endio+0xee/0x1e0
...
[   29.058682] Call Trace:
[   29.059095]  <TASK>
[   29.059473]  ? __warn+0x80/0x120
[   29.059983]  ? blk_zone_write_plug_bio_endio+0xee/0x1e0
[   29.060728]  ? report_bug+0x160/0x190
[   29.061283]  ? handle_bug+0x36/0x70
[   29.061830]  ? exc_invalid_op+0x17/0x60
[   29.062399]  ? asm_exc_invalid_op+0x1a/0x20
[   29.063025]  ? blk_zone_write_plug_bio_endio+0xee/0x1e0
[   29.063760]  bio_endio+0xb7/0x150
[   29.064280]  btrfs_clone_write_end_io+0x2b/0x60 [btrfs]
[   29.065049]  blk_update_request+0x17c/0x500
[   29.065666]  scsi_end_request+0x27/0x1a0 [scsi_mod]
[   29.066356]  scsi_io_completion+0x5b/0x690 [scsi_mod]
[   29.067077]  blk_complete_reqs+0x3a/0x50
[   29.067692]  __do_softirq+0xcf/0x2b3
[   29.068248]  ? sort_range+0x20/0x20
[   29.068791]  run_ksoftirqd+0x1c/0x30
[   29.069339]  smpboot_thread_fn+0xcc/0x1b0
[   29.069936]  kthread+0xcf/0x100
[   29.070438]  ? kthread_complete_and_exit+0x20/0x20
[   29.071314]  ret_from_fork+0x31/0x50
[   29.071873]  ? kthread_complete_and_exit+0x20/0x20
[   29.072563]  ret_from_fork_asm+0x11/0x20
[   29.073146]  </TASK>

either when fio executes or when unmount is executed.

Fix this by modifying disk_should_remove_zone_wplug() to check that the
reference count to a zone write plug is not larger than 2, that is, that
the only references left on the zone are the caller held reference
(blk_zone_write_plug_complete_request()) and the initial extra reference
for the zone write plug taken when it was initialized (and that is
dropped when the zone write plug is removed from the hash table).

To be consistent with this change, make sure to drop the request or BIO
held reference to the zone write plug before calling
disk_zone_wplug_unplug_bio(). All references are also dropped using
disk_put_zone_wplug() instead of atomic_dec() to ensure that the zone
write plug is freed if it needs to be.

Comments are also improved to clarify zone write plugs reference
handling.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c | 39 +++++++++++++++++++++++++++++++--------
 1 file changed, 31 insertions(+), 8 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 2f61ba56dad2..1e5f362f0409 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -520,10 +520,28 @@ static inline void disk_put_zone_wplug(struct blk_zone_wplug *zwplug)
 static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
 						 struct blk_zone_wplug *zwplug)
 {
-	/* If the zone is still busy, the plug cannot be removed. */
+	/* If the zone write plug was already removed, we are done. */
+	if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED)
+		return false;
+
+	/* If the zone write plug is still busy, it cannot be removed. */
 	if (zwplug->flags & BLK_ZONE_WPLUG_BUSY)
 		return false;
 
+	/*
+	 * Completions of BIOs with blk_zone_write_plug_bio_endio() may
+	 * happen after handling a request completion with
+	 * blk_zone_write_plug_complete_request() (e.g. with split BIOs
+	 * that are chained). In such case, disk_zone_wplug_unplug_bio()
+	 * should not attempt to remove the zone write plug until all BIO
+	 * completions are seen. Check by looking at the zone write plug
+	 * reference count, which is 2 when the plug is unused (one reference
+	 * taken when the plug was allocated and another reference taken by the
+	 * caller context).
+	 */
+	if (atomic_read(&zwplug->ref) > 2)
+		return false;
+
 	/* We can remove zone write plugs for zones that are empty or full. */
 	return !zwplug->wp_offset || zwplug->wp_offset >= disk->zone_capacity;
 }
@@ -893,8 +911,9 @@ void blk_zone_write_plug_attempt_merge(struct request *req)
 	struct bio *bio;
 
 	/*
-	 * Completion of this request needs to be handled with
-	 * blk_zone_write_plug_complete_request().
+	 * Indicate that completion of this request needs to be handled with
+	 * blk_zone_write_plug_complete_request(), which will drop the reference
+	 * on the zone write plug we took above on entry to this function.
 	 */
 	req->rq_flags |= RQF_ZONE_WRITE_PLUGGING;
 
@@ -1223,6 +1242,9 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
 		spin_unlock_irqrestore(&zwplug->lock, flags);
 	}
 
+	/* Drop the reference we took when the BIO was issued. */
+	disk_put_zone_wplug(zwplug);
+
 	/*
 	 * For BIO-based devices, blk_zone_write_plug_complete_request()
 	 * is not called. So we need to schedule execution of the next
@@ -1231,8 +1253,7 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
 	if (bio->bi_bdev->bd_has_submit_bio)
 		disk_zone_wplug_unplug_bio(disk, zwplug);
 
-	/* Drop the reference we took when the BIO was issued. */
-	atomic_dec(&zwplug->ref);
+	/* Drop the reference we took when entering this function. */
 	disk_put_zone_wplug(zwplug);
 }
 
@@ -1246,13 +1267,15 @@ void blk_zone_write_plug_complete_request(struct request *req)
 
 	req->rq_flags &= ~RQF_ZONE_WRITE_PLUGGING;
 
-	disk_zone_wplug_unplug_bio(disk, zwplug);
-
 	/*
 	 * Drop the reference we took when the request was initialized in
 	 * blk_zone_write_plug_attempt_merge().
 	 */
-	atomic_dec(&zwplug->ref);
+	disk_put_zone_wplug(zwplug);
+
+	disk_zone_wplug_unplug_bio(disk, zwplug);
+
+	/* Drop the reference we took when entering this function. */
 	disk_put_zone_wplug(zwplug);
 }
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 08/14] block: Fix flush request sector restore
  2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
                   ` (6 preceding siblings ...)
  2024-05-01 11:09 ` [PATCH v3 07/14] block: Do not remove zone write plugs still in use Damien Le Moal
@ 2024-05-01 11:09 ` Damien Le Moal
  2024-05-02  6:10   ` Hannes Reinecke
  2024-05-01 11:09 ` [PATCH v3 09/14] block: Fix handling of non-empty flush write requests to zones Damien Le Moal
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2024-05-01 11:09 UTC (permalink / raw)
  To: linux-block, Jens Axboe, dm-devel, Mike Snitzer

Make sure that a request bio is not NULL before trying to restore the
request start sector.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 6f8fd758de63 ("block: Restore sector of flush requests")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-flush.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 2f58ae018464..c17cf8ed8113 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -130,7 +130,8 @@ static void blk_flush_restore_request(struct request *rq)
 	 * original @rq->bio.  Restore it.
 	 */
 	rq->bio = rq->biotail;
-	rq->__sector = rq->bio->bi_iter.bi_sector;
+	if (rq->bio)
+		rq->__sector = rq->bio->bi_iter.bi_sector;
 
 	/* make @rq a normal request */
 	rq->rq_flags &= ~RQF_FLUSH_SEQ;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 09/14] block: Fix handling of non-empty flush write requests to zones
  2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
                   ` (7 preceding siblings ...)
  2024-05-01 11:09 ` [PATCH v3 08/14] block: Fix flush request sector restore Damien Le Moal
@ 2024-05-01 11:09 ` Damien Le Moal
  2024-05-02  6:11   ` Hannes Reinecke
  2024-05-01 11:09 ` [PATCH v3 10/14] block: Improve blk_zone_write_plug_bio_merged() Damien Le Moal
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2024-05-01 11:09 UTC (permalink / raw)
  To: linux-block, Jens Axboe, dm-devel, Mike Snitzer

Zone write plugging ignores empty (no data) flush operations but handles
flush BIOs that have data to ensure that the flush machinery generated
write is processed in order. However, the call to
blk_zone_write_plug_attempt_merge() which sets a request
RQF_ZONE_WRITE_PLUGGING flag is called after blk_insert_flush(), thus
missing indicating that a non empty flush request completion needs
handling by zone write plugging.

Fix this by moving the call to blk_zone_write_plug_attempt_merge()
before blk_insert_flush(). And while at it, rename that function as
blk_zone_write_plug_init_request() to be clear that it is not just about
merging plugged BIOs in the request. While at it, also add a WARN_ONCE()
check that the zone write plug for the request is not NULL.

Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-mq.c    |  6 +++---
 block/blk-zoned.c | 12 ++++++++----
 block/blk.h       |  4 ++--
 3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 434d45219e23..0fae9bd0ecd4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3001,12 +3001,12 @@ void blk_mq_submit_bio(struct bio *bio)
 		return;
 	}
 
+	if (bio_zone_write_plugging(bio))
+		blk_zone_write_plug_init_request(rq);
+
 	if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
 		return;
 
-	if (bio_zone_write_plugging(bio))
-		blk_zone_write_plug_attempt_merge(rq);
-
 	if (plug) {
 		blk_add_rq_to_plug(plug, rq);
 		return;
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 1e5f362f0409..cd0049f5bf2f 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -874,8 +874,9 @@ void blk_zone_write_plug_bio_merged(struct bio *bio)
 
 	/*
 	 * If the BIO was already plugged, then we were called through
-	 * blk_zone_write_plug_attempt_merge() -> blk_attempt_bio_merge().
-	 * For this case, blk_zone_write_plug_attempt_merge() will handle the
+	 * blk_zone_write_plug_init_request() -> blk_attempt_bio_merge().
+	 * For this case, we already hold a reference on the zone write plug for
+	 * the BIO and blk_zone_write_plug_init_request() will handle the
 	 * zone write pointer offset update.
 	 */
 	if (bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING))
@@ -899,7 +900,7 @@ void blk_zone_write_plug_bio_merged(struct bio *bio)
  * already went through zone write plugging (either a new BIO or one that was
  * unplugged).
  */
-void blk_zone_write_plug_attempt_merge(struct request *req)
+void blk_zone_write_plug_init_request(struct request *req)
 {
 	sector_t req_back_sector = blk_rq_pos(req) + blk_rq_sectors(req);
 	struct request_queue *q = req->q;
@@ -910,6 +911,9 @@ void blk_zone_write_plug_attempt_merge(struct request *req)
 	unsigned long flags;
 	struct bio *bio;
 
+	if (WARN_ON_ONCE(!zwplug))
+		return;
+
 	/*
 	 * Indicate that completion of this request needs to be handled with
 	 * blk_zone_write_plug_complete_request(), which will drop the reference
@@ -1269,7 +1273,7 @@ void blk_zone_write_plug_complete_request(struct request *req)
 
 	/*
 	 * Drop the reference we took when the request was initialized in
-	 * blk_zone_write_plug_attempt_merge().
+	 * blk_zone_write_plug_init_request().
 	 */
 	disk_put_zone_wplug(zwplug);
 
diff --git a/block/blk.h b/block/blk.h
index 1140c4a0be03..8a62b861453c 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -427,7 +427,7 @@ static inline bool bio_is_zone_append(struct bio *bio)
 		bio_flagged(bio, BIO_EMULATES_ZONE_APPEND);
 }
 void blk_zone_write_plug_bio_merged(struct bio *bio);
-void blk_zone_write_plug_attempt_merge(struct request *rq);
+void blk_zone_write_plug_init_request(struct request *rq);
 static inline void blk_zone_update_request_bio(struct request *rq,
 					       struct bio *bio)
 {
@@ -481,7 +481,7 @@ static inline bool bio_is_zone_append(struct bio *bio)
 static inline void blk_zone_write_plug_bio_merged(struct bio *bio)
 {
 }
-static inline void blk_zone_write_plug_attempt_merge(struct request *rq)
+static inline void blk_zone_write_plug_init_request(struct request *rq)
 {
 }
 static inline void blk_zone_update_request_bio(struct request *rq,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 10/14] block: Improve blk_zone_write_plug_bio_merged()
  2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
                   ` (8 preceding siblings ...)
  2024-05-01 11:09 ` [PATCH v3 09/14] block: Fix handling of non-empty flush write requests to zones Damien Le Moal
@ 2024-05-01 11:09 ` Damien Le Moal
  2024-05-02  6:13   ` Hannes Reinecke
  2024-05-01 11:09 ` [PATCH v3 11/14] block: Improve zone write request completion handling Damien Le Moal
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2024-05-01 11:09 UTC (permalink / raw)
  To: linux-block, Jens Axboe, dm-devel, Mike Snitzer

Improve blk_zone_write_plug_bio_merged() to check that we succefully get
a reference on the zone write plug of the merged BIO, as expected since
for a merge we already have at least one request and one BIO referencing
the zone write plug. Comments in this function are also improved to
better explain the references to the BIO zone write plug.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-zoned.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index cd0049f5bf2f..1890b6d55d8b 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -885,11 +885,16 @@ void blk_zone_write_plug_bio_merged(struct bio *bio)
 	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
 
 	/*
-	 * Increase the plug reference count and advance the zone write
-	 * pointer offset.
+	 * Get a reference on the zone write plug of the target zone and advance
+	 * the zone write pointer offset. Given that this is a merge, we already
+	 * have at least one request and one BIO referencing the zone write
+	 * plug. So this should not fail.
 	 */
 	zwplug = disk_get_zone_wplug(bio->bi_bdev->bd_disk,
 				     bio->bi_iter.bi_sector);
+	if (WARN_ON_ONCE(!zwplug))
+		return;
+
 	spin_lock_irqsave(&zwplug->lock, flags);
 	zwplug->wp_offset += bio_sectors(bio);
 	spin_unlock_irqrestore(&zwplug->lock, flags);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 11/14] block: Improve zone write request completion handling
  2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
                   ` (9 preceding siblings ...)
  2024-05-01 11:09 ` [PATCH v3 10/14] block: Improve blk_zone_write_plug_bio_merged() Damien Le Moal
@ 2024-05-01 11:09 ` Damien Le Moal
  2024-05-02  6:14   ` Hannes Reinecke
  2024-05-01 11:09 ` [PATCH v3 12/14] block: Simplify blk_zone_write_plug_bio_endio() Damien Le Moal
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2024-05-01 11:09 UTC (permalink / raw)
  To: linux-block, Jens Axboe, dm-devel, Mike Snitzer

blk_zone_complete_request() must be called to handle the completion of a
zone write request handled with zone write plugging. This function is
called from blk_complete_request(), blk_update_request() and also in
blk_mq_submit_bio() error path. Improve this by moving this function
call into blk_mq_finish_request() as all requests are processed with
this function when they complete as well as when they are freed without
being executed. This also improves blk_update_request() used by scsi
devices as these may repeatedly call this function to handle partial
completions.

To be consistent with this change, blk_zone_complete_request() is
renamed to blk_zone_finish_request() and
blk_zone_write_plug_complete_request() is renamed to
blk_zone_write_plug_finish_request().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-mq.c    |  6 ++----
 block/blk-zoned.c | 11 ++++++-----
 block/blk.h       |  8 ++++----
 3 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0fae9bd0ecd4..9f677ea85a52 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -691,6 +691,8 @@ static void blk_mq_finish_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 
+	blk_zone_finish_request(rq);
+
 	if (rq->rq_flags & RQF_USE_SCHED) {
 		q->elevator->type->ops.finish_request(rq);
 		/*
@@ -828,8 +830,6 @@ static void blk_complete_request(struct request *req)
 		bio = next;
 	} while (bio);
 
-	blk_zone_complete_request(req);
-
 	/*
 	 * Reset counters so that the request stacking driver
 	 * can find how many bytes remain in the request
@@ -940,7 +940,6 @@ bool blk_update_request(struct request *req, blk_status_t error,
 	 * completely done
 	 */
 	if (!req->bio) {
-		blk_zone_complete_request(req);
 		/*
 		 * Reset counters so that the request stacking driver
 		 * can find how many bytes remain in the request
@@ -2996,7 +2995,6 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (ret != BLK_STS_OK) {
 		bio->bi_status = ret;
 		bio_endio(bio);
-		blk_zone_complete_request(rq);
 		blk_mq_free_request(rq);
 		return;
 	}
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 1890b6d55d8b..759e85e9167c 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -531,7 +531,7 @@ static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
 	/*
 	 * Completions of BIOs with blk_zone_write_plug_bio_endio() may
 	 * happen after handling a request completion with
-	 * blk_zone_write_plug_complete_request() (e.g. with split BIOs
+	 * blk_zone_write_plug_finish_request() (e.g. with split BIOs
 	 * that are chained). In such case, disk_zone_wplug_unplug_bio()
 	 * should not attempt to remove the zone write plug until all BIO
 	 * completions are seen. Check by looking at the zone write plug
@@ -921,7 +921,7 @@ void blk_zone_write_plug_init_request(struct request *req)
 
 	/*
 	 * Indicate that completion of this request needs to be handled with
-	 * blk_zone_write_plug_complete_request(), which will drop the reference
+	 * blk_zone_write_plug_finish_request(), which will drop the reference
 	 * on the zone write plug we took above on entry to this function.
 	 */
 	req->rq_flags |= RQF_ZONE_WRITE_PLUGGING;
@@ -1255,7 +1255,7 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
 	disk_put_zone_wplug(zwplug);
 
 	/*
-	 * For BIO-based devices, blk_zone_write_plug_complete_request()
+	 * For BIO-based devices, blk_zone_write_plug_finish_request()
 	 * is not called. So we need to schedule execution of the next
 	 * plugged BIO here.
 	 */
@@ -1266,11 +1266,12 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
 	disk_put_zone_wplug(zwplug);
 }
 
-void blk_zone_write_plug_complete_request(struct request *req)
+void blk_zone_write_plug_finish_request(struct request *req)
 {
 	struct gendisk *disk = req->q->disk;
-	struct blk_zone_wplug *zwplug = disk_get_zone_wplug(disk, req->__sector);
+	struct blk_zone_wplug *zwplug;
 
+	zwplug = disk_get_zone_wplug(disk, req->__sector);
 	if (WARN_ON_ONCE(!zwplug))
 		return;
 
diff --git a/block/blk.h b/block/blk.h
index 8a62b861453c..ee4f782d1496 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -453,11 +453,11 @@ static inline void blk_zone_bio_endio(struct bio *bio)
 		blk_zone_write_plug_bio_endio(bio);
 }
 
-void blk_zone_write_plug_complete_request(struct request *rq);
-static inline void blk_zone_complete_request(struct request *rq)
+void blk_zone_write_plug_finish_request(struct request *rq);
+static inline void blk_zone_finish_request(struct request *rq)
 {
 	if (rq->rq_flags & RQF_ZONE_WRITE_PLUGGING)
-		blk_zone_write_plug_complete_request(rq);
+		blk_zone_write_plug_finish_request(rq);
 }
 int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd,
 		unsigned long arg);
@@ -491,7 +491,7 @@ static inline void blk_zone_update_request_bio(struct request *rq,
 static inline void blk_zone_bio_endio(struct bio *bio)
 {
 }
-static inline void blk_zone_complete_request(struct request *rq)
+static inline void blk_zone_finish_request(struct request *rq)
 {
 }
 static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 12/14] block: Simplify blk_zone_write_plug_bio_endio()
  2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
                   ` (10 preceding siblings ...)
  2024-05-01 11:09 ` [PATCH v3 11/14] block: Improve zone write request completion handling Damien Le Moal
@ 2024-05-01 11:09 ` Damien Le Moal
  2024-05-02  6:15   ` Hannes Reinecke
  2024-05-01 11:09 ` [PATCH v3 13/14] block: Simplify zone write plug BIO abort Damien Le Moal
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2024-05-01 11:09 UTC (permalink / raw)
  To: linux-block, Jens Axboe, dm-devel, Mike Snitzer

We already have the disk variable obtained from the bio when calling
disk_get_zone_wplug(). So use that variable instead of dereferencing the
bio bdev again for the disk argument of disk_get_zone_wplug().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-zoned.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 759e85e9167c..132eb988f4d7 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1222,8 +1222,7 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
 {
 	struct gendisk *disk = bio->bi_bdev->bd_disk;
 	struct blk_zone_wplug *zwplug =
-		disk_get_zone_wplug(bio->bi_bdev->bd_disk,
-				    bio->bi_iter.bi_sector);
+		disk_get_zone_wplug(disk, bio->bi_iter.bi_sector);
 	unsigned long flags;
 
 	if (WARN_ON_ONCE(!zwplug))
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 13/14] block: Simplify zone write plug BIO abort
  2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
                   ` (11 preceding siblings ...)
  2024-05-01 11:09 ` [PATCH v3 12/14] block: Simplify blk_zone_write_plug_bio_endio() Damien Le Moal
@ 2024-05-01 11:09 ` Damien Le Moal
  2024-05-02  6:15   ` Hannes Reinecke
  2024-05-01 11:09 ` [PATCH v3 14/14] block: Cleanup blk_revalidate_zone_cb() Damien Le Moal
  2024-05-01 14:08 ` [PATCH v3 00/14] Zone write plugging fixes and cleanup Jens Axboe
  14 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2024-05-01 11:09 UTC (permalink / raw)
  To: linux-block, Jens Axboe, dm-devel, Mike Snitzer

When BIOs plugged in a zone write plug are aborted,
blk_zone_wplug_bio_io_error() clears the BIO BIO_ZONE_WRITE_PLUGGING
flag so that bio_io_error(bio) does not end up calling
blk_zone_write_plug_bio_endio() and we thus need to manually drop the
reference on the zone write plug held by the aborted BIO.

Move the call to disk_put_zone_wplug() that is alwasy following the call
to blk_zone_wplug_bio_io_error() inside that function to simplify the
code.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/blk-zoned.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 132eb988f4d7..15e4e14e16f7 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -634,12 +634,14 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
 	return zwplug;
 }
 
-static inline void blk_zone_wplug_bio_io_error(struct bio *bio)
+static inline void blk_zone_wplug_bio_io_error(struct blk_zone_wplug *zwplug,
+					       struct bio *bio)
 {
-	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+	struct request_queue *q = zwplug->disk->queue;
 
 	bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING);
 	bio_io_error(bio);
+	disk_put_zone_wplug(zwplug);
 	blk_queue_exit(q);
 }
 
@@ -650,10 +652,8 @@ static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
 {
 	struct bio *bio;
 
-	while ((bio = bio_list_pop(&zwplug->bio_list))) {
-		blk_zone_wplug_bio_io_error(bio);
-		disk_put_zone_wplug(zwplug);
-	}
+	while ((bio = bio_list_pop(&zwplug->bio_list)))
+		blk_zone_wplug_bio_io_error(zwplug, bio);
 }
 
 /*
@@ -673,8 +673,7 @@ static void disk_zone_wplug_abort_unaligned(struct gendisk *disk,
 		if (wp_offset >= zone_capacity ||
 		    (bio_op(bio) != REQ_OP_ZONE_APPEND &&
 		     bio_offset_from_zone_start(bio) != wp_offset)) {
-			blk_zone_wplug_bio_io_error(bio);
-			disk_put_zone_wplug(zwplug);
+			blk_zone_wplug_bio_io_error(zwplug, bio);
 			continue;
 		}
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 14/14] block: Cleanup blk_revalidate_zone_cb()
  2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
                   ` (12 preceding siblings ...)
  2024-05-01 11:09 ` [PATCH v3 13/14] block: Simplify zone write plug BIO abort Damien Le Moal
@ 2024-05-01 11:09 ` Damien Le Moal
  2024-05-02  6:17   ` Hannes Reinecke
  2024-05-01 14:08 ` [PATCH v3 00/14] Zone write plugging fixes and cleanup Jens Axboe
  14 siblings, 1 reply; 31+ messages in thread
From: Damien Le Moal @ 2024-05-01 11:09 UTC (permalink / raw)
  To: linux-block, Jens Axboe, dm-devel, Mike Snitzer

Define the code for checking conventional and sequential write required
zones suing the functions blk_revalidate_conv_zone() and
blk_revalidate_seq_zone() respectively. This simplifies the zone type
switch-case in blk_revalidate_zone_cb().

No functional changes.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c | 129 +++++++++++++++++++++++++++-------------------
 1 file changed, 77 insertions(+), 52 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 15e4e14e16f7..48e5e3bbb89c 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1656,6 +1656,74 @@ static int disk_update_zone_resources(struct gendisk *disk,
 	return queue_limits_commit_update(q, &lim);
 }
 
+static int blk_revalidate_conv_zone(struct blk_zone *zone, unsigned int idx,
+				    struct blk_revalidate_zone_args *args)
+{
+	struct gendisk *disk = args->disk;
+	struct request_queue *q = disk->queue;
+
+	if (zone->capacity != zone->len) {
+		pr_warn("%s: Invalid conventional zone capacity\n",
+			disk->disk_name);
+		return -ENODEV;
+	}
+
+	if (!disk_need_zone_resources(disk))
+		return 0;
+
+	if (!args->conv_zones_bitmap) {
+		args->conv_zones_bitmap =
+			blk_alloc_zone_bitmap(q->node, args->nr_zones);
+		if (!args->conv_zones_bitmap)
+			return -ENOMEM;
+	}
+
+	set_bit(idx, args->conv_zones_bitmap);
+
+	return 0;
+}
+
+static int blk_revalidate_seq_zone(struct blk_zone *zone, unsigned int idx,
+				   struct blk_revalidate_zone_args *args)
+{
+	struct gendisk *disk = args->disk;
+	struct blk_zone_wplug *zwplug;
+	unsigned int wp_offset;
+	unsigned long flags;
+
+	/*
+	 * Remember the capacity of the first sequential zone and check
+	 * if it is constant for all zones.
+	 */
+	if (!args->zone_capacity)
+		args->zone_capacity = zone->capacity;
+	if (zone->capacity != args->zone_capacity) {
+		pr_warn("%s: Invalid variable zone capacity\n",
+			disk->disk_name);
+		return -ENODEV;
+	}
+
+	/*
+	 * We need to track the write pointer of all zones that are not
+	 * empty nor full. So make sure we have a zone write plug for
+	 * such zone if the device has a zone write plug hash table.
+	 */
+	if (!disk->zone_wplugs_hash)
+		return 0;
+
+	wp_offset = blk_zone_wp_offset(zone);
+	if (!wp_offset || wp_offset >= zone->capacity)
+		return 0;
+
+	zwplug = disk_get_and_lock_zone_wplug(disk, zone->wp, GFP_NOIO, &flags);
+	if (!zwplug)
+		return -ENOMEM;
+	spin_unlock_irqrestore(&zwplug->lock, flags);
+	disk_put_zone_wplug(zwplug);
+
+	return 0;
+}
+
 /*
  * Helper function to check the validity of zones of a zoned block device.
  */
@@ -1664,12 +1732,9 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 {
 	struct blk_revalidate_zone_args *args = data;
 	struct gendisk *disk = args->disk;
-	struct request_queue *q = disk->queue;
 	sector_t capacity = get_capacity(disk);
-	sector_t zone_sectors = q->limits.chunk_sectors;
-	struct blk_zone_wplug *zwplug;
-	unsigned long flags;
-	unsigned int wp_offset;
+	sector_t zone_sectors = disk->queue->limits.chunk_sectors;
+	int ret;
 
 	/* Check for bad zones and holes in the zone report */
 	if (zone->start != args->sector) {
@@ -1709,62 +1774,22 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 	/* Check zone type */
 	switch (zone->type) {
 	case BLK_ZONE_TYPE_CONVENTIONAL:
-		if (zone->capacity != zone->len) {
-			pr_warn("%s: Invalid conventional zone capacity\n",
-				disk->disk_name);
-			return -ENODEV;
-		}
-
-		if (!disk_need_zone_resources(disk))
-			break;
-		if (!args->conv_zones_bitmap) {
-			args->conv_zones_bitmap =
-				blk_alloc_zone_bitmap(q->node, args->nr_zones);
-			if (!args->conv_zones_bitmap)
-				return -ENOMEM;
-		}
-		set_bit(idx, args->conv_zones_bitmap);
+		ret = blk_revalidate_conv_zone(zone, idx, args);
 		break;
 	case BLK_ZONE_TYPE_SEQWRITE_REQ:
-		/*
-		 * Remember the capacity of the first sequential zone and check
-		 * if it is constant for all zones.
-		 */
-		if (!args->zone_capacity)
-			args->zone_capacity = zone->capacity;
-		if (zone->capacity != args->zone_capacity) {
-			pr_warn("%s: Invalid variable zone capacity\n",
-				disk->disk_name);
-			return -ENODEV;
-		}
-
-		/*
-		 * We need to track the write pointer of all zones that are not
-		 * empty nor full. So make sure we have a zone write plug for
-		 * such zone if the device has a zone write plug hash table.
-		 */
-		if (!disk->zone_wplugs_hash)
-			break;
-		wp_offset = blk_zone_wp_offset(zone);
-		if (wp_offset && wp_offset < zone->capacity) {
-			zwplug = disk_get_and_lock_zone_wplug(disk, zone->wp,
-							      GFP_NOIO, &flags);
-			if (!zwplug)
-				return -ENOMEM;
-			spin_unlock_irqrestore(&zwplug->lock, flags);
-			disk_put_zone_wplug(zwplug);
-		}
-
+		ret = blk_revalidate_seq_zone(zone, idx, args);
 		break;
 	case BLK_ZONE_TYPE_SEQWRITE_PREF:
 	default:
 		pr_warn("%s: Invalid zone type 0x%x at sectors %llu\n",
 			disk->disk_name, (int)zone->type, zone->start);
-		return -ENODEV;
+		ret = -ENODEV;
 	}
 
-	args->sector += zone->len;
-	return 0;
+	if (!ret)
+		args->sector += zone->len;
+
+	return ret;
 }
 
 /**
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 00/14] Zone write plugging fixes and cleanup
  2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
                   ` (13 preceding siblings ...)
  2024-05-01 11:09 ` [PATCH v3 14/14] block: Cleanup blk_revalidate_zone_cb() Damien Le Moal
@ 2024-05-01 14:08 ` Jens Axboe
  14 siblings, 0 replies; 31+ messages in thread
From: Jens Axboe @ 2024-05-01 14:08 UTC (permalink / raw)
  To: linux-block, dm-devel, Mike Snitzer, Damien Le Moal


On Wed, 01 May 2024 20:08:53 +0900, Damien Le Moal wrote:
> Jens, Mike,
> 
> With more testing of zone write plugging on more device setups,
> including weird/test setups (with scsi debug and null_blk), several
> issues were identified. This patch series addresses them and cleanup the
> code a little to try to make it more obvious.
> 
> [...]

Applied, thanks!

[01/14] dm: Check that a zoned table leads to a valid mapped device
        commit: 44cccb3027d4719c9229203233250d73d3192bf9
[02/14] block: Exclude conventional zones when faking max open limit
        commit: 6b7593b5fb9eb73be92f78a1abfa502f05ff5e15
[03/14] block: Fix zone write plug initialization from blk_revalidate_zone_cb()
        commit: 74b7ae5f48e6f9518a32f50926619eba54be44de
[04/14] block: Fix reference counting for zone write plugs in error state
        commit: 19aad274c22b96fc4c0113d87cc8a083c87c467e
[05/14] block: Hold a reference on zone write plugs to schedule submission
        commit: 9e78c38ab30b14c1d6a07c61d57ac5e2f12fa568
[06/14] block: Unhash a zone write plug only if needed
        commit: 79ae35a4233df5909f8bea0b64eadbebde870de2
[07/14] block: Do not remove zone write plugs still in use
        commit: 7b295187287e0006dd1b0b95f995f00878e436c5
[08/14] block: Fix flush request sector restore
        commit: af147b740f111730c2e387ee6c0ac3ada7d51117
[09/14] block: Fix handling of non-empty flush write requests to zones
        commit: 096bc7ea335bc5dfbaed1d005ff27f008ec9d710
[10/14] block: Improve blk_zone_write_plug_bio_merged()
        commit: c4c3ffdab2e26780f6f7c9959a473b2c652f4d13
[11/14] block: Improve zone write request completion handling
        commit: 347bde9da10f410b8134a82d6096105cad44e1c1
[12/14] block: Simplify blk_zone_write_plug_bio_endio()
        commit: b5a64ec2ea2be2a7f7eb73c243c2381e9fc1c71b
[13/14] block: Simplify zone write plug BIO abort
        commit: c9c8aea03c4ac2ea902bc7dd5ba14f5d78af8ece
[14/14] block: Cleanup blk_revalidate_zone_cb()
        commit: d7580149efc5c86c4e72f9263b97c062616a84dd

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 07/14] block: Do not remove zone write plugs still in use
  2024-05-01 11:09 ` [PATCH v3 07/14] block: Do not remove zone write plugs still in use Damien Le Moal
@ 2024-05-02  5:38   ` Christoph Hellwig
  2024-05-02  6:09   ` Hannes Reinecke
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2024-05-02  5:38 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 02/14] block: Exclude conventional zones when faking max open limit
  2024-05-01 11:08 ` [PATCH v3 02/14] block: Exclude conventional zones when faking max open limit Damien Le Moal
@ 2024-05-02  5:52   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-05-02  5:52 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer

On 5/1/24 13:08, Damien Le Moal wrote:
> For a device that has no limits for the maximum number of open and
> active zones, we default to using the number of zones, limited to
> BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE (128), for the maximum number of open
> zones indicated to the user. However, for a device that has conventional
> zones and less zones than BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE, we should
> not account conventional zones and set the limit to the number of
> sequential write required zones. Furthermore, for cases where the limit
> is equal to the number of sequential write required zones, we can
> advertize a limit of 0 to indicate "no limits".
> 
> Fix this by moving the zone write plug mempool resizing from
> disk_revalidate_zone_resources() to disk_update_zone_resources() where
> we can safely compute the number of conventional zones and update the
> limits.
> 
> Fixes: 843283e96e5a ("block: Fake max open zones limit when there is no limit")
> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   block/blk-zoned.c | 38 ++++++++++++++++++++++++++++----------
>   1 file changed, 28 insertions(+), 10 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 03/14] block: Fix zone write plug initialization from blk_revalidate_zone_cb()
  2024-05-01 11:08 ` [PATCH v3 03/14] block: Fix zone write plug initialization from blk_revalidate_zone_cb() Damien Le Moal
@ 2024-05-02  5:53   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-05-02  5:53 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer

On 5/1/24 13:08, Damien Le Moal wrote:
> When revalidating the zones of a zoned block device,
> blk_revalidate_zone_cb() must allocate a zone write plug for any
> sequential write required zone that is not empty nor full. However, the
> current code tests the latter case by comparing the zone write pointer
> offset to the zone size instead of the zone capacity. Furthermore,
> disk_get_and_lock_zone_wplug() is called with a sector argument equal to
> the zone start instead of the current zone write pointer position.
> This commit fixes both issues by calling disk_get_and_lock_zone_wplug()
> for a zone that is not empty and with a write pointer offset lower than
> the zone capacity and use the zone capacity sector as the sector
> argument for disk_get_and_lock_zone_wplug().
> 
> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   block/blk-zoned.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 04/14] block: Fix reference counting for zone write plugs in error state
  2024-05-01 11:08 ` [PATCH v3 04/14] block: Fix reference counting for zone write plugs in error state Damien Le Moal
@ 2024-05-02  6:01   ` Hannes Reinecke
  2024-05-02  9:37     ` Damien Le Moal
  0 siblings, 1 reply; 31+ messages in thread
From: Hannes Reinecke @ 2024-05-02  6:01 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer

On 5/1/24 13:08, Damien Le Moal wrote:
> When zone is reset or finished, disk_zone_wplug_set_wp_offset() is
> called to update the zone write plug write pointer offset and to clear
> the zone error state (BLK_ZONE_WPLUG_ERROR flag) if it is set.
> However, this processing is missing dropping the reference to the zone
> write plug that was taken in disk_zone_wplug_set_error() when the error
> flag was first set. Furthermore, the error state handling must release
> the zone write plug lock to first execute a report zones command. When
> the report zone races with a reset or finish operation that clears the
> error, we can end up decrementing the zone write plug reference count
> twice: once in disk_zone_wplug_set_wp_offset() for the reset/finish
> operation and one more time in disk_zone_wplugs_work() once
> disk_zone_wplug_handle_error() completes.
> 
> Fix this by introducing disk_zone_wplug_clear_error() as the symmetric
> function of disk_zone_wplug_set_error(). disk_zone_wplug_clear_error()
> decrements the zone write plug reference count obtained in
> disk_zone_wplug_set_error() only if the error handling has not started
> yet, that is, only if disk_zone_wplugs_work() has not yet taken the zone
> write plug off the error list. This ensure that either
> disk_zone_wplug_clear_error() or disk_zone_wplugs_work() drop the zone
> write plug reference count.
> 
> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   block/blk-zoned.c | 75 +++++++++++++++++++++++++++++++----------------
>   1 file changed, 49 insertions(+), 26 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 7824bd52c82c..23ad1de0da62 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -658,6 +658,54 @@ static void disk_zone_wplug_abort_unaligned(struct gendisk *disk,
>   	bio_list_merge(&zwplug->bio_list, &bl);
>   }
>   
> +static inline void disk_zone_wplug_set_error(struct gendisk *disk,
> +					     struct blk_zone_wplug *zwplug)
> +{
> +	unsigned long flags;
> +
> +	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
> +		return;
> +

I still get nervous when I see an unprotected flag being set.
Especially in code which is known to race with error handling.
Wouldn't it be better to check the flag under the lock or at
least use 'test_and_set_bit' here?

> +	/*
> +	 * At this point, we already have a reference on the zone write plug.
> +	 * However, since we are going to add the plug to the disk zone write
> +	 * plugs work list, increase its reference count. This reference will
> +	 * be dropped in disk_zone_wplugs_work() once the error state is
> +	 * handled, or in disk_zone_wplug_clear_error() if the zone is reset or
> +	 * finished.
> +	 */
> +	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;

And that is even worse. We might have been interrupted between these
two lines, invalidating the first check.

Please consider using 'test_and_set_bit()' here.

> +	atomic_inc(&zwplug->ref);
> +
> +	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
> +	list_add_tail(&zwplug->link, &disk->zone_wplugs_err_list);
> +	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
> +}
> +
> +static inline void disk_zone_wplug_clear_error(struct gendisk *disk,
> +					       struct blk_zone_wplug *zwplug)
> +{
> +	unsigned long flags;
> +
> +	if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR))
> +		return;
> +
> +	/*
> +	 * We are racing with the error handling work which drops the reference
> +	 * on the zone write plug after handling the error state. So remove the
> +	 * plug from the error list and drop its reference count only if the
> +	 * error handling has not yet started, that is, if the zone write plug
> +	 * is still listed.
> +	 */
> +	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
> +	if (!list_empty(&zwplug->link)) {
> +		list_del_init(&zwplug->link);
> +		zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
> +		disk_put_zone_wplug(zwplug);
> +	}
> +	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
> +}
> +

Similar comments to above: you are clearing the flag under the lock,
but don't check or set under the lock...

>   /*
>    * Set a zone write plug write pointer offset to either 0 (zone reset case)
>    * or to the zone size (zone finish case). This aborts all plugged BIOs, which
> @@ -691,12 +739,7 @@ static void disk_zone_wplug_set_wp_offset(struct gendisk *disk,
>   	 * in a good state. So clear the error flag and decrement the
>   	 * error count if we were in error state.
>   	 */
> -	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) {
> -		zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
> -		spin_lock(&disk->zone_wplugs_lock);
> -		list_del_init(&zwplug->link);
> -		spin_unlock(&disk->zone_wplugs_lock);
> -	}
> +	disk_zone_wplug_clear_error(disk, zwplug);
>   
>   	/*
>   	 * The zone write plug now has no BIO plugged: remove it from the
> @@ -885,26 +928,6 @@ void blk_zone_write_plug_attempt_merge(struct request *req)
>   	spin_unlock_irqrestore(&zwplug->lock, flags);
>   }
>   
> -static inline void disk_zone_wplug_set_error(struct gendisk *disk,
> -					     struct blk_zone_wplug *zwplug)
> -{
> -	if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR)) {
> -		unsigned long flags;
> -
> -		/*
> -		 * Increase the plug reference count. The reference will be
> -		 * dropped in disk_zone_wplugs_work() once the error state
> -		 * is handled.
> -		 */
> -		zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
> -		atomic_inc(&zwplug->ref);
> -
> -		spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
> -		list_add_tail(&zwplug->link, &disk->zone_wplugs_err_list);
> -		spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
> -	}
> -}
> -
>   /*
>    * Check and prepare a BIO for submission by incrementing the write pointer
>    * offset of its zone write plug and changing zone append operations into

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 05/14] block: Hold a reference on zone write plugs to schedule submission
  2024-05-01 11:08 ` [PATCH v3 05/14] block: Hold a reference on zone write plugs to schedule submission Damien Le Moal
@ 2024-05-02  6:04   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-05-02  6:04 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer

On 5/1/24 13:08, Damien Le Moal wrote:
> Since a zone write plug BIO work is a field of struct blk_zone_wplug, we
> must ensure that a zone write plug is never freed when its BIO
> submission work is queued or running. Do this by holding a reference on
> the zone write plug when the submission work is scheduled for execution
> with queue_work() and releasing the reference at the end of the
> execution of the work function blk_zone_wplug_bio_work().
> The helper function disk_zone_wplug_schedule_bio_work() is introduced to
> get a reference on a zone write plug and queue its work. This helper is
> used in disk_zone_wplug_unplug_bio() and disk_zone_wplug_handle_error().
> 
> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   block/blk-zoned.c | 26 +++++++++++++++++++++-----
>   1 file changed, 21 insertions(+), 5 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/14] block: Unhash a zone write plug only if needed
  2024-05-01 11:08 ` [PATCH v3 06/14] block: Unhash a zone write plug only if needed Damien Le Moal
@ 2024-05-02  6:05   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-05-02  6:05 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer

On 5/1/24 13:08, Damien Le Moal wrote:
> Fix disk_remove_zone_wplug() to ensure that a zone write plug already
> removed from a disk hash table of zone write plugs is not removed
> again. Do this by checking the BLK_ZONE_WPLUG_UNHASHED flag of the plug
> and calling hlist_del_init_rcu() only if the flag is not set.
> 
> Furthermore, since BIO completions can happen at any time, that is,
> decrementing of the zone write plug reference count can happen at any
> time, make sure to use disk_put_zone_wplug() instead of atomic_dec() to
> ensure that the zone write plug is freed when its last reference is
> dropped. In order to do this, disk_remove_zone_wplug() is moved after
> the definition of disk_put_zone_wplug(). disk_should_remove_zone_wplug()
> is moved as well to keep it together with disk_remove_zone_wplug().
> 
> To be consistent with this change, add a check in disk_put_zone_wplug()
> to ensure that a zone write plug being freed was already removed from
> the disk hash table.
> 
> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   block/blk-zoned.c | 55 +++++++++++++++++++++++++++--------------------
>   1 file changed, 32 insertions(+), 23 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 07/14] block: Do not remove zone write plugs still in use
  2024-05-01 11:09 ` [PATCH v3 07/14] block: Do not remove zone write plugs still in use Damien Le Moal
  2024-05-02  5:38   ` Christoph Hellwig
@ 2024-05-02  6:09   ` Hannes Reinecke
  1 sibling, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-05-02  6:09 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer

On 5/1/24 13:09, Damien Le Moal wrote:
> Large write BIOs that span a zone boundary are split in
> blk_mq_submit_bio() before being passed to blk_zone_plug_bio() for zone
> write plugging. Such split BIO will be chained with one fragment
> targeting one zone and the remainder of the BIO targeting the next
> zone. The two BIOs can be executed in parallel, without a predetermine
> order relative to eachother and their completion may be reversed: the

each other

> remainder first completing and the first fragment then completing. In
> such case, bio_endio() will not immediately execute
> blk_zone_write_plug_bio_endio() for the parent BIO (the remainder of the
> split BIO) as the BIOs are chained. blk_zone_write_plug_bio_endio() for
> the parent BIO will be executed only once the first fragment completes.
> 
> In the case of a device with small zones and very large BIOs, uch

such

> completion pattern can lead to disk_should_remove_zone_wplug() to return
> true for the zone of the parent BIO when the parent BIO request
> completes and blk_zone_write_plug_complete_request() is executed. This
> triggers the removal of the zone write plug from the hash table using
> disk_remove_zone_wplug(). With the zone write plug of the parent BIO
> missing, the call to disk_get_zone_wplug() in
> blk_zone_write_plug_bio_endio() returns NULL and triggers a warning.
> 
> This patterns can be recreated fairly easily using a scsi_debug device
> with small zone and btrfs. E.g.
> 
> modprobe scsi_debug delay=0 dev_size_mb=1024 sector_size=4096 \
> 	zbc=host-managed zone_cap_mb=3 zone_nr_conv=0 zone_size_mb=4
> mkfs.btrfs -f -O zoned /dev/sda
> mount -t btrfs /dev/sda /mnt
> fio --name=wrtest --rw=randwrite --direct=1 --ioengine=libaio \
> 	--bs=4k --iodepth=16 --size=1M --directory=/mnt --time_based \
> 	--runtime=10
> umount /dev/sda
> 
> Will result in the warning:
> 
> [   29.035538] WARNING: CPU: 3 PID: 37 at block/blk-zoned.c:1207 blk_zone_write_plug_bio_endio+0xee/0x1e0
> ...
> [   29.058682] Call Trace:
> [   29.059095]  <TASK>
> [   29.059473]  ? __warn+0x80/0x120
> [   29.059983]  ? blk_zone_write_plug_bio_endio+0xee/0x1e0
> [   29.060728]  ? report_bug+0x160/0x190
> [   29.061283]  ? handle_bug+0x36/0x70
> [   29.061830]  ? exc_invalid_op+0x17/0x60
> [   29.062399]  ? asm_exc_invalid_op+0x1a/0x20
> [   29.063025]  ? blk_zone_write_plug_bio_endio+0xee/0x1e0
> [   29.063760]  bio_endio+0xb7/0x150
> [   29.064280]  btrfs_clone_write_end_io+0x2b/0x60 [btrfs]
> [   29.065049]  blk_update_request+0x17c/0x500
> [   29.065666]  scsi_end_request+0x27/0x1a0 [scsi_mod]
> [   29.066356]  scsi_io_completion+0x5b/0x690 [scsi_mod]
> [   29.067077]  blk_complete_reqs+0x3a/0x50
> [   29.067692]  __do_softirq+0xcf/0x2b3
> [   29.068248]  ? sort_range+0x20/0x20
> [   29.068791]  run_ksoftirqd+0x1c/0x30
> [   29.069339]  smpboot_thread_fn+0xcc/0x1b0
> [   29.069936]  kthread+0xcf/0x100
> [   29.070438]  ? kthread_complete_and_exit+0x20/0x20
> [   29.071314]  ret_from_fork+0x31/0x50
> [   29.071873]  ? kthread_complete_and_exit+0x20/0x20
> [   29.072563]  ret_from_fork_asm+0x11/0x20
> [   29.073146]  </TASK>
> 
> either when fio executes or when unmount is executed.
> 
> Fix this by modifying disk_should_remove_zone_wplug() to check that the
> reference count to a zone write plug is not larger than 2, that is, that
> the only references left on the zone are the caller held reference
> (blk_zone_write_plug_complete_request()) and the initial extra reference
> for the zone write plug taken when it was initialized (and that is
> dropped when the zone write plug is removed from the hash table).
> 
> To be consistent with this change, make sure to drop the request or BIO
> held reference to the zone write plug before calling
> disk_zone_wplug_unplug_bio(). All references are also dropped using
> disk_put_zone_wplug() instead of atomic_dec() to ensure that the zone
> write plug is freed if it needs to be.
> 
> Comments are also improved to clarify zone write plugs reference
> handling.
> 
> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-zoned.c | 39 +++++++++++++++++++++++++++++++--------
>   1 file changed, 31 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 08/14] block: Fix flush request sector restore
  2024-05-01 11:09 ` [PATCH v3 08/14] block: Fix flush request sector restore Damien Le Moal
@ 2024-05-02  6:10   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-05-02  6:10 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer

On 5/1/24 13:09, Damien Le Moal wrote:
> Make sure that a request bio is not NULL before trying to restore the
> request start sector.
> 
> Reported-by: Yi Zhang <yi.zhang@redhat.com>
> Fixes: 6f8fd758de63 ("block: Restore sector of flush requests")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   block/blk-flush.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 2f58ae018464..c17cf8ed8113 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -130,7 +130,8 @@ static void blk_flush_restore_request(struct request *rq)
>   	 * original @rq->bio.  Restore it.
>   	 */
>   	rq->bio = rq->biotail;
> -	rq->__sector = rq->bio->bi_iter.bi_sector;
> +	if (rq->bio)
> +		rq->__sector = rq->bio->bi_iter.bi_sector;
>   
>   	/* make @rq a normal request */
>   	rq->rq_flags &= ~RQF_FLUSH_SEQ;

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 09/14] block: Fix handling of non-empty flush write requests to zones
  2024-05-01 11:09 ` [PATCH v3 09/14] block: Fix handling of non-empty flush write requests to zones Damien Le Moal
@ 2024-05-02  6:11   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-05-02  6:11 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer

On 5/1/24 13:09, Damien Le Moal wrote:
> Zone write plugging ignores empty (no data) flush operations but handles
> flush BIOs that have data to ensure that the flush machinery generated
> write is processed in order. However, the call to
> blk_zone_write_plug_attempt_merge() which sets a request
> RQF_ZONE_WRITE_PLUGGING flag is called after blk_insert_flush(), thus
> missing indicating that a non empty flush request completion needs
> handling by zone write plugging.
> 
> Fix this by moving the call to blk_zone_write_plug_attempt_merge()
> before blk_insert_flush(). And while at it, rename that function as
> blk_zone_write_plug_init_request() to be clear that it is not just about
> merging plugged BIOs in the request. While at it, also add a WARN_ONCE()
> check that the zone write plug for the request is not NULL.
> 
> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   block/blk-mq.c    |  6 +++---
>   block/blk-zoned.c | 12 ++++++++----
>   block/blk.h       |  4 ++--
>   3 files changed, 13 insertions(+), 9 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 10/14] block: Improve blk_zone_write_plug_bio_merged()
  2024-05-01 11:09 ` [PATCH v3 10/14] block: Improve blk_zone_write_plug_bio_merged() Damien Le Moal
@ 2024-05-02  6:13   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-05-02  6:13 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer

On 5/1/24 13:09, Damien Le Moal wrote:
> Improve blk_zone_write_plug_bio_merged() to check that we succefully get
> a reference on the zone write plug of the merged BIO, as expected since
> for a merge we already have at least one request and one BIO referencing
> the zone write plug. Comments in this function are also improved to
> better explain the references to the BIO zone write plug.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   block/blk-zoned.c | 9 +++++++--
>   1 file changed, 7 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 11/14] block: Improve zone write request completion handling
  2024-05-01 11:09 ` [PATCH v3 11/14] block: Improve zone write request completion handling Damien Le Moal
@ 2024-05-02  6:14   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-05-02  6:14 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer

On 5/1/24 13:09, Damien Le Moal wrote:
> blk_zone_complete_request() must be called to handle the completion of a
> zone write request handled with zone write plugging. This function is
> called from blk_complete_request(), blk_update_request() and also in
> blk_mq_submit_bio() error path. Improve this by moving this function
> call into blk_mq_finish_request() as all requests are processed with
> this function when they complete as well as when they are freed without
> being executed. This also improves blk_update_request() used by scsi
> devices as these may repeatedly call this function to handle partial
> completions.
> 
> To be consistent with this change, blk_zone_complete_request() is
> renamed to blk_zone_finish_request() and
> blk_zone_write_plug_complete_request() is renamed to
> blk_zone_write_plug_finish_request().
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   block/blk-mq.c    |  6 ++----
>   block/blk-zoned.c | 11 ++++++-----
>   block/blk.h       |  8 ++++----
>   3 files changed, 12 insertions(+), 13 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 12/14] block: Simplify blk_zone_write_plug_bio_endio()
  2024-05-01 11:09 ` [PATCH v3 12/14] block: Simplify blk_zone_write_plug_bio_endio() Damien Le Moal
@ 2024-05-02  6:15   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-05-02  6:15 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer

On 5/1/24 13:09, Damien Le Moal wrote:
> We already have the disk variable obtained from the bio when calling
> disk_get_zone_wplug(). So use that variable instead of dereferencing the
> bio bdev again for the disk argument of disk_get_zone_wplug().
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   block/blk-zoned.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 13/14] block: Simplify zone write plug BIO abort
  2024-05-01 11:09 ` [PATCH v3 13/14] block: Simplify zone write plug BIO abort Damien Le Moal
@ 2024-05-02  6:15   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-05-02  6:15 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer

On 5/1/24 13:09, Damien Le Moal wrote:
> When BIOs plugged in a zone write plug are aborted,
> blk_zone_wplug_bio_io_error() clears the BIO BIO_ZONE_WRITE_PLUGGING
> flag so that bio_io_error(bio) does not end up calling
> blk_zone_write_plug_bio_endio() and we thus need to manually drop the
> reference on the zone write plug held by the aborted BIO.
> 
> Move the call to disk_put_zone_wplug() that is alwasy following the call
> to blk_zone_wplug_bio_io_error() inside that function to simplify the
> code.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   block/blk-zoned.c | 15 +++++++--------
>   1 file changed, 7 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 14/14] block: Cleanup blk_revalidate_zone_cb()
  2024-05-01 11:09 ` [PATCH v3 14/14] block: Cleanup blk_revalidate_zone_cb() Damien Le Moal
@ 2024-05-02  6:17   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-05-02  6:17 UTC (permalink / raw)
  To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer

On 5/1/24 13:09, Damien Le Moal wrote:
> Define the code for checking conventional and sequential write required
> zones suing the functions blk_revalidate_conv_zone() and

using

> blk_revalidate_seq_zone() respectively. This simplifies the zone type
> switch-case in blk_revalidate_zone_cb().
> 
> No functional changes.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   block/blk-zoned.c | 129 +++++++++++++++++++++++++++-------------------
>   1 file changed, 77 insertions(+), 52 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 04/14] block: Fix reference counting for zone write plugs in error state
  2024-05-02  6:01   ` Hannes Reinecke
@ 2024-05-02  9:37     ` Damien Le Moal
  0 siblings, 0 replies; 31+ messages in thread
From: Damien Le Moal @ 2024-05-02  9:37 UTC (permalink / raw)
  To: Hannes Reinecke, linux-block, Jens Axboe, dm-devel, Mike Snitzer

On 5/2/24 15:01, Hannes Reinecke wrote:
>> +static inline void disk_zone_wplug_set_error(struct gendisk *disk,
>> +					     struct blk_zone_wplug *zwplug)
>> +{
>> +	unsigned long flags;
>> +
>> +	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
>> +		return;
>> +
> 
> I still get nervous when I see an unprotected flag being set.
> Especially in code which is known to race with error handling.
> Wouldn't it be better to check the flag under the lock or at
> least use 'test_and_set_bit' here?

It is protected: this is always called with the zone write plug spinlock being
locked.

> 
>> +	/*
>> +	 * At this point, we already have a reference on the zone write plug.
>> +	 * However, since we are going to add the plug to the disk zone write
>> +	 * plugs work list, increase its reference count. This reference will
>> +	 * be dropped in disk_zone_wplugs_work() once the error state is
>> +	 * handled, or in disk_zone_wplug_clear_error() if the zone is reset or
>> +	 * finished.
>> +	 */
>> +	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
> 
> And that is even worse. We might have been interrupted between these
> two lines, invalidating the first check.

Nope: zone write plug spinlock is locked.

> 
> Please consider using 'test_and_set_bit()' here.
> 
>> +	atomic_inc(&zwplug->ref);
>> +
>> +	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>> +	list_add_tail(&zwplug->link, &disk->zone_wplugs_err_list);
>> +	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
>> +}
>> +
>> +static inline void disk_zone_wplug_clear_error(struct gendisk *disk,
>> +					       struct blk_zone_wplug *zwplug)
>> +{
>> +	unsigned long flags;
>> +
>> +	if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR))
>> +		return;
>> +
>> +	/*
>> +	 * We are racing with the error handling work which drops the reference
>> +	 * on the zone write plug after handling the error state. So remove the
>> +	 * plug from the error list and drop its reference count only if the
>> +	 * error handling has not yet started, that is, if the zone write plug
>> +	 * is still listed.
>> +	 */
>> +	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>> +	if (!list_empty(&zwplug->link)) {
>> +		list_del_init(&zwplug->link);
>> +		zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
>> +		disk_put_zone_wplug(zwplug);
>> +	}
>> +	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
>> +}
>> +
> 
> Similar comments to above: you are clearing the flag under the lock,
> but don't check or set under the lock...

And similar comment: this is called with the zone write plug spinlock held. So
no race with the flag handling. What is racy is the error handling because we
can only hold disk->zone_wplugs_lock at first, and have to release that lock
before we take the zone write plug spinlock (otherwise we would have lock order
inversion). And the error hanlding needs to do a report zone, so no zone write
plug spinlock either, and in the meantime, the user may do a zone reset or reset
all... Hence the trickery here to look at if the error handling work already
took the plug out of the list for processing or not. If it has, then the error
handling will do what is needed with the error flag.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-05-02  9:37 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-01 11:08 [PATCH v3 00/14] Zone write plugging fixes and cleanup Damien Le Moal
2024-05-01 11:08 ` [PATCH v3 01/14] dm: Check that a zoned table leads to a valid mapped device Damien Le Moal
2024-05-01 11:08 ` [PATCH v3 02/14] block: Exclude conventional zones when faking max open limit Damien Le Moal
2024-05-02  5:52   ` Hannes Reinecke
2024-05-01 11:08 ` [PATCH v3 03/14] block: Fix zone write plug initialization from blk_revalidate_zone_cb() Damien Le Moal
2024-05-02  5:53   ` Hannes Reinecke
2024-05-01 11:08 ` [PATCH v3 04/14] block: Fix reference counting for zone write plugs in error state Damien Le Moal
2024-05-02  6:01   ` Hannes Reinecke
2024-05-02  9:37     ` Damien Le Moal
2024-05-01 11:08 ` [PATCH v3 05/14] block: Hold a reference on zone write plugs to schedule submission Damien Le Moal
2024-05-02  6:04   ` Hannes Reinecke
2024-05-01 11:08 ` [PATCH v3 06/14] block: Unhash a zone write plug only if needed Damien Le Moal
2024-05-02  6:05   ` Hannes Reinecke
2024-05-01 11:09 ` [PATCH v3 07/14] block: Do not remove zone write plugs still in use Damien Le Moal
2024-05-02  5:38   ` Christoph Hellwig
2024-05-02  6:09   ` Hannes Reinecke
2024-05-01 11:09 ` [PATCH v3 08/14] block: Fix flush request sector restore Damien Le Moal
2024-05-02  6:10   ` Hannes Reinecke
2024-05-01 11:09 ` [PATCH v3 09/14] block: Fix handling of non-empty flush write requests to zones Damien Le Moal
2024-05-02  6:11   ` Hannes Reinecke
2024-05-01 11:09 ` [PATCH v3 10/14] block: Improve blk_zone_write_plug_bio_merged() Damien Le Moal
2024-05-02  6:13   ` Hannes Reinecke
2024-05-01 11:09 ` [PATCH v3 11/14] block: Improve zone write request completion handling Damien Le Moal
2024-05-02  6:14   ` Hannes Reinecke
2024-05-01 11:09 ` [PATCH v3 12/14] block: Simplify blk_zone_write_plug_bio_endio() Damien Le Moal
2024-05-02  6:15   ` Hannes Reinecke
2024-05-01 11:09 ` [PATCH v3 13/14] block: Simplify zone write plug BIO abort Damien Le Moal
2024-05-02  6:15   ` Hannes Reinecke
2024-05-01 11:09 ` [PATCH v3 14/14] block: Cleanup blk_revalidate_zone_cb() Damien Le Moal
2024-05-02  6:17   ` Hannes Reinecke
2024-05-01 14:08 ` [PATCH v3 00/14] Zone write plugging fixes and cleanup Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).