linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/13] support non power of 2 zoned device
       [not found] <CGME20220615101924eucas1p27fbce623c0e1b3097169bf23dd6266d8@eucas1p2.samsung.com>
@ 2022-06-15 10:19 ` Pankaj Raghav
       [not found]   ` <CGME20220615101927eucas1p17220c7a36f69f59ff8ddd560b42967ec@eucas1p1.samsung.com>
                     ` (12 more replies)
  0 siblings, 13 replies; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-15 10:19 UTC (permalink / raw)
  To: hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Pankaj Raghav

- Background and Motivation:

The zone storage implementation in Linux, introduced since v4.10, first
targetted SMR drives which have a power of 2 (po2) zone size alignment
requirement. The po2 zone size was further imposed implicitly by the
block layer's blk_queue_chunk_sectors(), used to prevent IO merging
across chunks beyond the specified size, since v3.16 through commit
762380ad9322 ("block: add notion of a chunk size for request merging").
But this same general block layer po2 requirement for blk_queue_chunk_sectors()
was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
to be non-power-of-2").

NAND, which is the media used in newer zoned storage devices, does not
naturally align to po2. In these devices, zone cap is not the same as the
po2 zone size. When the zone cap != zone size, then unmapped LBAs are
introduced to cover the space between the zone cap and zone size. po2
requirement does not make sense for these type of zone storage devices.
This patch series aims to remove these unmapped LBAs for zoned devices when
zone cap is npo2. This is done by relaxing the po2 zone size constraint
in the kernel and allowing zoned device with npo2 zone sizes if zone cap
== zone size.

Removing the po2 requirement from zone storage should be possible
now provided that no userspace regression and no performance regressions are
introduced. Stop-gap patches have been already merged into f2fs-tools to
proactively not allow npo2 zone sizes until proper support is added [1].

There were two efforts previously to add support to npo2 devices: 1) via
device level emulation [2] but that was rejected with a final conclusion
to add support for non po2 zoned device in the complete stack[3] 2)
adding support to the complete stack by removing the constraint in the
block layer and NVMe layer with support to btrfs, zonefs, etc which was
rejected with a conclusion to add a dm target for FS support [0]
to reduce the regression impact.

This series adds support to npo2 zoned devices in the block and nvme
layer and a new dm target is added: dm-zoned-npo2-target. This new
target will be initially used for filesystems such as btrfs and
f2fs that does not have native npo2 zone support.

- Patchset description:
Patches 1-2 deals with removing the po2 constraint from the
block layer.

Patches 3-4 deals with removing the constraint from nvme zns.

Patch 5-6 removes the po2 contraint in null blk

Patch 7 adds npo2 support to zonefs

Patches 8-13 adds support for npo2 zoned devices in the DM layer and
adds a new target dm-zoned-npo2-target

The patch series is based on linux-next tag: next-20220615

Testing:
The new target was tested with fio, xfstests with btrfs and zonefs test
suite. No new regression was found during the testing.

[0] https://lore.kernel.org/lkml/PH0PR04MB74166C87F694B150A5AE0F009BD09@PH0PR04MB7416.namprd04.prod.outlook.com/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git/commit/?h=dev-test&id=6afcf6493578e77528abe65ab8b12f3e1c16749f
[2] https://lore.kernel.org/all/20220310094725.GA28499@lst.de/T/
[3] https://lore.kernel.org/all/20220315135245.eqf4tqngxxb7ymqa@unifi/

Changes since v1:
- Put the function declaration and its usage in the same commit (Bart)
- Remove bdev_zone_aligned function (Bart)
- Change the name from blk_queue_zone_aligned to blk_queue_is_zone_start
  (Damien)
- q is never null in from bdev_get_queue (Damien)
- Add condition during bringup and check for zsze == zcap for npo2
  drives (Damien)
- Rounddown operation should be made generic to work in 32 bits arch
  (bart)
- Add comments where generic calculation is directly used instead having
  special handling for po2 zone sizes (Hannes)
- Make the minimum zone size alignment requirement for btrfs to be 1M
  instead of BTRFS_STRIPE_LEN(David)

Changes since v2:
- Minor formatting changes

Changes since v3:
- Make superblock mirror align with the existing superblock log offsets
  (David)
- DM change return value and remove extra newline
- Optimize null blk zone index lookup with shift for po2 zone size

Changes since v4:
- Remove direct filesystems support for npo2 devices (Johannes, Hannes,
  Damien)

Changes since v5:
- Use DIV_ROUND_UP* helper instead of round_up as it breaks 32bit arch
  build in null blk(kernel-test-robot, Nathan)
- Use DIV_ROUND_UP_SECTOR_T also in blkdev_nr_zones function instead of
  open coding it with div64_u64
- Added extra condition in dm-zoned and in dm to reject non power of 2
  zone sizes.

Changes since v6:
- Added a new dm target for non power of 2 devices
- Added support for non power of 2 devices in the DM layer.

Luis Chamberlain (1):
  dm-zoned: ensure only power of 2 zone sizes are allowed

Pankaj Raghav (12):
  block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2
    zsze
  block: allow blk-zoned devices to have non-power-of-2 zone size
  nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  nvmet: Allow ZNS target to support non-power_of_2 zone sizes
  null_blk: allow non power of 2 zoned devices
  null_blk: use zone_size_sects_shift for power of 2 zoned devices
  zonefs: allow non power of 2 zoned devices
  dm-zone: use generic helpers to calculate offset from zone start
  dm-table: use bdev_is_zone_start helper in device_area_is_invalid()
  dm-table: allow non po2 zoned devices
  dm: call dm_zone_endio after the target endio callback for zoned
    devices
  dm: add non power of 2 zoned target

 block/blk-core.c                  |   3 +-
 block/blk-zoned.c                 |  37 ++++-
 drivers/block/null_blk/main.c     |   5 +-
 drivers/block/null_blk/null_blk.h |   6 +
 drivers/block/null_blk/zoned.c    |  18 +-
 drivers/md/Kconfig                |   9 +
 drivers/md/Makefile               |   2 +
 drivers/md/dm-table.c             |   6 +-
 drivers/md/dm-zone.c              |  16 +-
 drivers/md/dm-zoned-npo2-target.c | 268 ++++++++++++++++++++++++++++++
 drivers/md/dm-zoned-target.c      |   8 +
 drivers/md/dm.c                   |   8 +-
 drivers/nvme/host/zns.c           |  21 ++-
 drivers/nvme/target/zns.c         |   2 +-
 fs/zonefs/super.c                 |   6 +-
 fs/zonefs/zonefs.h                |   1 -
 include/linux/blkdev.h            |  48 +++++-
 17 files changed, 416 insertions(+), 48 deletions(-)
 create mode 100644 drivers/md/dm-zoned-npo2-target.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v7 01/13] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze
       [not found]   ` <CGME20220615101927eucas1p17220c7a36f69f59ff8ddd560b42967ec@eucas1p1.samsung.com>
@ 2022-06-15 10:19     ` Pankaj Raghav
  2022-06-15 20:18       ` [dm-devel] " Bart Van Assche
  0 siblings, 1 reply; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-15 10:19 UTC (permalink / raw)
  To: hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Pankaj Raghav, Luis Chamberlain

Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
also work for non-power-of-2 zone sizes.

As the existing deployments of zoned devices had power-of-2
assumption, power-of-2 optimized calculation is kept for those devices.

There are no direct hot paths modified and the changes just
introduce one new branch per call.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 block/blk-zoned.c      | 12 +++++++++---
 include/linux/blkdev.h |  8 +++++++-
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 38cd840d8..8b0615287 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -111,16 +111,22 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
  * blkdev_nr_zones - Get number of zones
  * @disk:	Target gendisk
  *
- * Return the total number of zones of a zoned block device.  For a block
- * device without zone capabilities, the number of zones is always 0.
+ * Return the total number of zones of a zoned block device, including the
+ * eventual small last zone if present. For a block device without zone
+ * capabilities, the number of zones is always 0.
  */
 unsigned int blkdev_nr_zones(struct gendisk *disk)
 {
 	sector_t zone_sectors = blk_queue_zone_sectors(disk->queue);
+	sector_t capacity = get_capacity(disk);
 
 	if (!blk_queue_is_zoned(disk->queue))
 		return 0;
-	return (get_capacity(disk) + zone_sectors - 1) >> ilog2(zone_sectors);
+
+	if (is_power_of_2(zone_sectors))
+		return (capacity + zone_sectors - 1) >> ilog2(zone_sectors);
+
+	return DIV_ROUND_UP_SECTOR_T(capacity, zone_sectors);
 }
 EXPORT_SYMBOL_GPL(blkdev_nr_zones);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 914c613d8..39017ae9d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -681,9 +681,15 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
 static inline unsigned int blk_queue_zone_no(struct request_queue *q,
 					     sector_t sector)
 {
+	sector_t zone_sectors = blk_queue_zone_sectors(q);
+
 	if (!blk_queue_is_zoned(q))
 		return 0;
-	return sector >> ilog2(q->limits.chunk_sectors);
+
+	if (is_power_of_2(zone_sectors))
+		return sector >> ilog2(zone_sectors);
+
+	return div64_u64(sector, zone_sectors);
 }
 
 static inline bool blk_queue_zone_is_seq(struct request_queue *q,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 02/13] block: allow blk-zoned devices to have non-power-of-2 zone size
       [not found]   ` <CGME20220615101931eucas1p15ed09ae433a2c378b599e9086130d8eb@eucas1p1.samsung.com>
@ 2022-06-15 10:19     ` Pankaj Raghav
  2022-06-15 20:28       ` [dm-devel] " Bart Van Assche
  0 siblings, 1 reply; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-15 10:19 UTC (permalink / raw)
  To: hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Pankaj Raghav, Luis Chamberlain

Checking if a given sector is aligned to a zone is a common
operation that is performed for zoned devices. Add
blk_queue_is_zone_start helper to check for this instead of opencoding it
everywhere.

Convert the calculations on zone size to be generic instead of relying on
power_of_2 based logic in the block layer using the helpers wherever
possible.

The only hot path affected by this change for power_of_2 zoned devices
is in blk_check_zone_append() but blk_queue_is_zone_start() helper is
used to optimize the calculation for po2 zone sizes. Note that the append
path cannot be accessed by direct raw access to the block device but only
through a filesystem abstraction.

Finally, allow non power of 2 zoned devices provided that their zone
capacity and zone size are equal. The main motivation to allow non
power_of_2 zoned device is to remove the unmapped LBA between zcap and
zsze for devices that cannot have a power_of_2 zcap.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 block/blk-core.c       |  3 +--
 block/blk-zoned.c      | 25 +++++++++++++++++++------
 include/linux/blkdev.h | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 06ff5bbfe..248b947e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -629,8 +629,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 		return BLK_STS_NOTSUPP;
 
 	/* The bio sector must point to the start of a sequential zone */
-	if (pos & (blk_queue_zone_sectors(q) - 1) ||
-	    !blk_queue_zone_is_seq(q, pos))
+	if (!blk_queue_is_zone_start(q, pos) || !blk_queue_zone_is_seq(q, pos))
 		return BLK_STS_IOERR;
 
 	/*
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 8b0615287..7957eec04 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -288,10 +288,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
 		return -EINVAL;
 
 	/* Check alignment (handle eventual smaller last zone) */
-	if (sector & (zone_sectors - 1))
+	if (!blk_queue_is_zone_start(q, sector))
 		return -EINVAL;
 
-	if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
+	if (!blk_queue_is_zone_start(q, nr_sectors) && end_sector != capacity)
 		return -EINVAL;
 
 	/*
@@ -489,14 +489,27 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 	 * smaller last zone.
 	 */
 	if (zone->start == 0) {
-		if (zone->len == 0 || !is_power_of_2(zone->len)) {
-			pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
-				disk->disk_name, zone->len);
+		if (zone->len == 0) {
+			pr_warn("%s: Invalid zone size", disk->disk_name);
+			return -ENODEV;
+		}
+
+		/*
+		 * Don't allow zoned device with non power_of_2 zone size with
+		 * zone capacity less than zone size.
+		 */
+		if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
+			pr_warn("%s: Invalid zone capacity for non power of 2 zone size",
+				disk->disk_name);
 			return -ENODEV;
 		}
 
 		args->zone_sectors = zone->len;
-		args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
+		/*
+		 * Division is used to calculate nr_zones for both power_of_2
+		 * and non power_of_2 zone sizes as it is not in the hot path.
+		 */
+		args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
 	} else if (zone->start + args->zone_sectors < capacity) {
 		if (zone->len != args->zone_sectors) {
 			pr_warn("%s: Invalid zoned device with non constant zone size\n",
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 39017ae9d..3c106dba1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -692,6 +692,27 @@ static inline unsigned int blk_queue_zone_no(struct request_queue *q,
 	return div64_u64(sector, zone_sectors);
 }
 
+static inline sector_t blk_queue_offset_from_zone_start(struct request_queue *q,
+							sector_t sec)
+{
+	sector_t zone_sectors = blk_queue_zone_sectors(q);
+	u64 remainder = 0;
+
+	if (!blk_queue_is_zoned(q))
+		return false;
+
+	if (is_power_of_2(zone_sectors))
+		return sec & (zone_sectors - 1);
+
+	div64_u64_rem(sec, zone_sectors, &remainder);
+	return remainder;
+}
+
+static inline bool blk_queue_is_zone_start(struct request_queue *q, sector_t sec)
+{
+	return blk_queue_offset_from_zone_start(q, sec) == 0;
+}
+
 static inline bool blk_queue_zone_is_seq(struct request_queue *q,
 					 sector_t sector)
 {
@@ -738,6 +759,18 @@ static inline unsigned int blk_queue_zone_no(struct request_queue *q,
 {
 	return 0;
 }
+
+static inline sector_t blk_queue_offset_from_zone_start(struct request_queue *q,
+							sector_t sec)
+{
+	return 0;
+}
+
+static inline bool blk_queue_is_zone_start(struct request_queue *q, sector_t sec)
+{
+	return false;
+}
+
 static inline unsigned int queue_max_open_zones(const struct request_queue *q)
 {
 	return 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 03/13] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
       [not found]   ` <CGME20220615101935eucas1p26a7bc245d88a89312158d7a265f64aef@eucas1p2.samsung.com>
@ 2022-06-15 10:19     ` Pankaj Raghav
  0 siblings, 0 replies; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-15 10:19 UTC (permalink / raw)
  To: hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Pankaj Raghav, Luis Chamberlain

Remove the condition which disallows non-power_of_2 zone size ZNS drive
to be updated and use generic method to calculate number of zones
instead of relying on log and shift based calculation on zone size.

The power_of_2 calculation has been replaced directly with generic
calculation without special handling. Both modified functions are not
used in hot paths, they are only used during initialization &
revalidation of the ZNS device.

As rounddown macro from math.h does not work for 32 bit architectures,
round down operation is open coded.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/host/zns.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 9f81beb4d..d92f937d5 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
 	}
 
 	ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
-	if (!is_power_of_2(ns->zsze)) {
-		dev_warn(ns->ctrl->device,
-			"invalid zone size:%llu for namespace:%u\n",
-			ns->zsze, ns->head->ns_id);
-		status = -ENODEV;
-		goto free_data;
-	}
 
 	blk_queue_set_zoned(ns->disk, BLK_ZONED_HM);
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
@@ -128,8 +121,13 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
 	const size_t min_bufsize = sizeof(struct nvme_zone_report) +
 				   sizeof(struct nvme_zone_descriptor);
 
+	/*
+	 * Division is used to calculate nr_zones with no special handling
+	 * for power of 2 zone sizes as this function is not invoked in a
+	 * hot path
+	 */
 	nr_zones = min_t(unsigned int, nr_zones,
-			 get_capacity(ns->disk) >> ilog2(ns->zsze));
+			 div64_u64(get_capacity(ns->disk), ns->zsze));
 
 	bufsize = sizeof(struct nvme_zone_report) +
 		nr_zones * sizeof(struct nvme_zone_descriptor);
@@ -182,6 +180,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 	int ret, zone_idx = 0;
 	unsigned int nz, i;
 	size_t buflen;
+	u64 remainder = 0;
 
 	if (ns->head->ids.csi != NVME_CSI_ZNS)
 		return -EINVAL;
@@ -197,7 +196,11 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
 	c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
 
-	sector &= ~(ns->zsze - 1);
+	/*
+	 * Round down the sector value to the nearest zone start
+	 */
+	div64_u64_rem(sector, ns->zsze, &remainder);
+	sector -= remainder;
 	while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
 		memset(report, 0, buflen);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 04/13] nvmet: Allow ZNS target to support non-power_of_2 zone sizes
       [not found]   ` <CGME20220615101938eucas1p26ab159a1ffd0fa5a16d7f202ba7206e7@eucas1p2.samsung.com>
@ 2022-06-15 10:19     ` Pankaj Raghav
  0 siblings, 0 replies; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-15 10:19 UTC (permalink / raw)
  To: hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Pankaj Raghav,
	Johannes Thumshirn, Luis Chamberlain

A generic bdev_zone_no helper is added to calculate zone number for a given
sector in a block device. This helper internally uses blk_queue_zone_no to
find the zone number.

Use the helper bdev_zone_no() to calculate nr of zones. This let's us
make modifications to the math if needed in one place and adds now
support for npo2 zone devices.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/target/zns.c | 2 +-
 include/linux/blkdev.h    | 7 +++++++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
index 82b61acf7..5516dd6cc 100644
--- a/drivers/nvme/target/zns.c
+++ b/drivers/nvme/target/zns.c
@@ -242,7 +242,7 @@ static unsigned long nvmet_req_nr_zones_from_slba(struct nvmet_req *req)
 	unsigned int sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);
 
 	return blkdev_nr_zones(req->ns->bdev->bd_disk) -
-		(sect >> ilog2(bdev_zone_sectors(req->ns->bdev)));
+	       bdev_zone_no(req->ns->bdev, sect);
 }
 
 static unsigned long get_nr_zones_from_buf(struct nvmet_req *req, u32 bufsize)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3c106dba1..e09d73473 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1387,6 +1387,13 @@ static inline sector_t bdev_zone_sectors(struct block_device *bdev)
 	return 0;
 }
 
+static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	return blk_queue_zone_no(q, sec);
+}
+
 static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 05/13] null_blk: allow non power of 2 zoned devices
       [not found]   ` <CGME20220615101941eucas1p25e1c27b363e6b288b848521298e31705@eucas1p2.samsung.com>
@ 2022-06-15 10:19     ` Pankaj Raghav
  0 siblings, 0 replies; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-15 10:19 UTC (permalink / raw)
  To: hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Pankaj Raghav, Luis Chamberlain

Convert the power of 2 based calculation with zone size to be generic in
null_zone_no with optimization for power of 2 based zone sizes.

The nr_zones calculation in null_init_zoned_dev has been replaced with a
division without special handling for power of 2 based zone sizes as
this function is called only during the initialization and will not
invoked in the hot path.

Performance Measurement:

Device:
zone size = 128M, blocksize=4k

FIO cmd:

fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
--loops=4

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Sequential Write:

x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  578     |  2257    |   12.80   |  576     |  2248    |   25.78   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  581     |  2268    |   12.74   |  576     |  2248    |   25.85   |
x-----------------x---------------------------------x---------------------------------x

Sequential read:

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  667     |  2605    |   11.79   |  675     |  2637    |   23.49   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  667     |  2605    |   11.79   |  675     |  2638    |   23.48   |
x-----------------x---------------------------------x---------------------------------x

Random read:

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  522     |  2038    |   15.05   |  514     |  2006    |   30.87   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  522     |  2039    |   15.04   |  523     |  2042    |   30.33   |
x-----------------x---------------------------------x---------------------------------x

Minor variations are noticed in Sequential write with io depth 8 and
in random read with io depth 16. But overall no noticeable differences
were noticed

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/block/null_blk/main.c  |  5 ++---
 drivers/block/null_blk/zoned.c | 13 ++++++-------
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 6b67088f4..6c170927c 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -1929,9 +1929,8 @@ static int null_validate_conf(struct nullb_device *dev)
 	if (dev->queue_mode == NULL_Q_BIO)
 		dev->mbps = 0;
 
-	if (dev->zoned &&
-	    (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
-		pr_err("zone_size must be power-of-two\n");
+	if (dev->zoned && !dev->zone_size) {
+		pr_err("Invalid zero zone size\n");
 		return -EINVAL;
 	}
 
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 2fdd7b20c..daf327015 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -16,7 +16,10 @@ static inline sector_t mb_to_sects(unsigned long mb)
 
 static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
 {
-	return sect >> ilog2(dev->zone_size_sects);
+	if (is_power_of_2(dev->zone_size_sects))
+		return sect >> ilog2(dev->zone_size_sects);
+
+	return div64_u64(sect, dev->zone_size_sects);
 }
 
 static inline void null_lock_zone_res(struct nullb_device *dev)
@@ -65,10 +68,6 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	sector_t sector = 0;
 	unsigned int i;
 
-	if (!is_power_of_2(dev->zone_size)) {
-		pr_err("zone_size must be power-of-two\n");
-		return -EINVAL;
-	}
 	if (dev->zone_size > dev->size) {
 		pr_err("Zone size larger than device capacity\n");
 		return -EINVAL;
@@ -86,8 +85,8 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	zone_capacity_sects = mb_to_sects(dev->zone_capacity);
 	dev_capacity_sects = mb_to_sects(dev->size);
 	dev->zone_size_sects = mb_to_sects(dev->zone_size);
-	dev->nr_zones = round_up(dev_capacity_sects, dev->zone_size_sects)
-		>> ilog2(dev->zone_size_sects);
+	dev->nr_zones =	DIV_ROUND_UP_SECTOR_T(dev_capacity_sects,
+					      dev->zone_size_sects);
 
 	dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct nullb_zone),
 				    GFP_KERNEL | __GFP_ZERO);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 06/13] null_blk: use zone_size_sects_shift for power of 2 zoned devices
       [not found]   ` <CGME20220615101945eucas1p16fa264e81d9b6027ff131dd311ed91e2@eucas1p1.samsung.com>
@ 2022-06-15 10:19     ` Pankaj Raghav
  2022-06-15 11:56       ` Damien Le Moal
  0 siblings, 1 reply; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-15 10:19 UTC (permalink / raw)
  To: hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Pankaj Raghav, Damien Le Moal,
	Luis Chamberlain

Instead of doing is_power_of_2 and ilog2 operation for every IO, cache
the zone_size_sects_shift variable and use it for power of 2 zoned
devices.

This variable will be set to zero for non power of 2 zoned devices.

Suggested-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/block/null_blk/null_blk.h |  6 ++++++
 drivers/block/null_blk/zoned.c    | 11 ++++++++---
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 8359b4384..3bc7cbf25 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -83,6 +83,12 @@ struct nullb_device {
 	unsigned int imp_close_zone_no;
 	struct nullb_zone *zones;
 	sector_t zone_size_sects;
+	/*
+	 * zone_size_sects_shift is only useful when the zone size is
+	 * power of 2. This variable is set to zero when zone size is non
+	 * power of 2.
+	 */
+	unsigned int zone_size_sects_shift;
 	bool need_zone_res_mgmt;
 	spinlock_t zone_res_lock;
 
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index daf327015..5f929944b 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -16,8 +16,8 @@ static inline sector_t mb_to_sects(unsigned long mb)
 
 static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
 {
-	if (is_power_of_2(dev->zone_size_sects))
-		return sect >> ilog2(dev->zone_size_sects);
+	if (dev->zone_size_sects_shift)
+		return sect >> dev->zone_size_sects_shift;
 
 	return div64_u64(sect, dev->zone_size_sects);
 }
@@ -85,9 +85,14 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	zone_capacity_sects = mb_to_sects(dev->zone_capacity);
 	dev_capacity_sects = mb_to_sects(dev->size);
 	dev->zone_size_sects = mb_to_sects(dev->zone_size);
+
+	if (is_power_of_2(dev->zone_size_sects))
+		dev->zone_size_sects_shift = ilog2(dev->zone_size_sects);
+	else
+		dev->zone_size_sects_shift = 0;
+
 	dev->nr_zones =	DIV_ROUND_UP_SECTOR_T(dev_capacity_sects,
 					      dev->zone_size_sects);
-
 	dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct nullb_zone),
 				    GFP_KERNEL | __GFP_ZERO);
 	if (!dev->zones)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 07/13] zonefs: allow non power of 2 zoned devices
       [not found]   ` <CGME20220615101948eucas1p2d8d801735c39b25256a019134adb0c6f@eucas1p2.samsung.com>
@ 2022-06-15 10:19     ` Pankaj Raghav
  0 siblings, 0 replies; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-15 10:19 UTC (permalink / raw)
  To: hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Pankaj Raghav, Luis Chamberlain

The zone size shift variable is useful only if the zone sizes are known
to be power of 2. Remove that variable and use generic helpers from
block layer to calculate zone index in zonefs.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/zonefs/super.c  | 6 ++----
 fs/zonefs/zonefs.h | 1 -
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index cc6d4cf58..0b737c2fb 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -485,10 +485,9 @@ static void __zonefs_io_error(struct inode *inode, bool write)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	struct super_block *sb = inode->i_sb;
-	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
 	unsigned int noio_flag;
 	unsigned int nr_zones =
-		zi->i_zone_size >> (sbi->s_zone_sectors_shift + SECTOR_SHIFT);
+		bdev_zone_no(sb->s_bdev, zi->i_zone_size >> SECTOR_SHIFT);
 	struct zonefs_ioerr_data err = {
 		.inode = inode,
 		.write = write,
@@ -1410,7 +1409,7 @@ static int zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	int ret = 0;
 
-	inode->i_ino = zone->start >> sbi->s_zone_sectors_shift;
+	inode->i_ino = bdev_zone_no(sb->s_bdev, zone->start);
 	inode->i_mode = S_IFREG | sbi->s_perm;
 
 	zi->i_ztype = type;
@@ -1787,7 +1786,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 	 * interface constraints.
 	 */
 	sb_set_blocksize(sb, bdev_zone_write_granularity(sb->s_bdev));
-	sbi->s_zone_sectors_shift = ilog2(bdev_zone_sectors(sb->s_bdev));
 	sbi->s_uid = GLOBAL_ROOT_UID;
 	sbi->s_gid = GLOBAL_ROOT_GID;
 	sbi->s_perm = 0640;
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 4b3de66c3..39895195c 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -177,7 +177,6 @@ struct zonefs_sb_info {
 	kgid_t			s_gid;
 	umode_t			s_perm;
 	uuid_t			s_uuid;
-	unsigned int		s_zone_sectors_shift;
 
 	unsigned int		s_nr_files[ZONEFS_ZTYPE_MAX];
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 08/13] dm-zoned: ensure only power of 2 zone sizes are allowed
       [not found]   ` <CGME20220615101951eucas1p238eb45e563bd9645af81bf16c56d98ec@eucas1p2.samsung.com>
@ 2022-06-15 10:19     ` Pankaj Raghav
  0 siblings, 0 replies; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-15 10:19 UTC (permalink / raw)
  To: hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Luis Chamberlain, Pankaj Raghav

From: Luis Chamberlain <mcgrof@kernel.org>

Today dm-zoned relies on the assumption that you have a zone size
with a power of 2. Even though the block layer today enforces this
requirement, these devices do exist and so provide a stop-gap measure
to ensure these devices cannot be used by mistake

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/md/dm-zoned-target.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
index 0ec5d8b9b..ad4228db5 100644
--- a/drivers/md/dm-zoned-target.c
+++ b/drivers/md/dm-zoned-target.c
@@ -792,6 +792,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
 				return -EINVAL;
 			}
 			zone_nr_sectors = blk_queue_zone_sectors(q);
+			if (!is_power_of_2(zone_nr_sectors)) {
+				ti->error = "Zone size not power of 2";
+				return -EINVAL;
+			}
 			zoned_dev->zone_nr_sectors = zone_nr_sectors;
 			zoned_dev->nr_zones =
 				blkdev_nr_zones(zoned_dev->bdev->bd_disk);
@@ -806,6 +810,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
 		q = bdev_get_queue(zoned_dev->bdev);
 		zoned_dev->zone_nr_sectors = blk_queue_zone_sectors(q);
 		zoned_dev->nr_zones = blkdev_nr_zones(zoned_dev->bdev->bd_disk);
+		if (!is_power_of_2(zoned_dev->zone_nr_sectors)) {
+			ti->error = "Zone size not power of 2";
+			return -EINVAL;
+		}
 	}
 
 	if (reg_dev) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 09/13] dm-zone: use generic helpers to calculate offset from zone start
       [not found]   ` <CGME20220615101955eucas1p19b9d42ead7331f69f7dad1ec100312c2@eucas1p1.samsung.com>
@ 2022-06-15 10:19     ` Pankaj Raghav
  0 siblings, 0 replies; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-15 10:19 UTC (permalink / raw)
  To: hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Pankaj Raghav, Luis Chamberlain

Use the blk_queue_offset_from_zone_start() helper function to calculate
the offset from zone start instead of using power of 2 based
calculation.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/md/dm-zone.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 3e7b1fe15..af36d33f9 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -395,7 +395,8 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
 	case REQ_OP_WRITE_ZEROES:
 	case REQ_OP_WRITE:
 		/* Writes must be aligned to the zone write pointer */
-		if ((clone->bi_iter.bi_sector & (zsectors - 1)) != zwp_offset)
+		if ((blk_queue_offset_from_zone_start(md->queue,
+						      clone->bi_iter.bi_sector)) != zwp_offset)
 			return false;
 		break;
 	case REQ_OP_ZONE_APPEND:
@@ -608,10 +609,8 @@ void dm_zone_endio(struct dm_io *io, struct bio *clone)
 		 */
 		if (clone->bi_status == BLK_STS_OK &&
 		    bio_op(clone) == REQ_OP_ZONE_APPEND) {
-			sector_t mask = (sector_t)blk_queue_zone_sectors(q) - 1;
-
 			orig_bio->bi_iter.bi_sector +=
-				clone->bi_iter.bi_sector & mask;
+				blk_queue_offset_from_zone_start(q, clone->bi_iter.bi_sector);
 		}
 
 		return;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 10/13] dm-table: use bdev_is_zone_start helper in device_area_is_invalid()
       [not found]   ` <CGME20220615102000eucas1p27720aaa3c309327b2b9a33c5f840f498@eucas1p2.samsung.com>
@ 2022-06-15 10:19     ` Pankaj Raghav
  2022-06-15 11:53       ` Damien Le Moal
  0 siblings, 1 reply; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-15 10:19 UTC (permalink / raw)
  To: hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Pankaj Raghav, Luis Chamberlain

Use bdev_is_zone_start() helper that uses generic calculation to check
for zone alignment instead of using po2 based alignment check.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/md/dm-table.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index bd539afbf..b553cdb6d 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -251,7 +251,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
 	if (bdev_is_zoned(bdev)) {
 		unsigned int zone_sectors = bdev_zone_sectors(bdev);
 
-		if (start & (zone_sectors - 1)) {
+		if (blk_queue_is_zone_start(bdev_get_queue(bdev), start)) {
 			DMWARN("%s: start=%llu not aligned to h/w zone size %u of %pg",
 			       dm_device_name(ti->table->md),
 			       (unsigned long long)start,
@@ -268,7 +268,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
 		 * devices do not end up with a smaller zone in the middle of
 		 * the sector range.
 		 */
-		if (len & (zone_sectors - 1)) {
+		if (blk_queue_is_zone_start(bdev_get_queue(bdev), len)) {
 			DMWARN("%s: len=%llu not aligned to h/w zone size %u of %pg",
 			       dm_device_name(ti->table->md),
 			       (unsigned long long)len,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 11/13] dm-table: allow non po2 zoned devices
       [not found]   ` <CGME20220615102004eucas1p1e458ea097d381058b16fc6daa3eec998@eucas1p1.samsung.com>
@ 2022-06-15 10:19     ` Pankaj Raghav
  0 siblings, 0 replies; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-15 10:19 UTC (permalink / raw)
  To: hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Pankaj Raghav

As the block layer now supports non po2 zoned devices, allow dm to
support non po2 zoned device.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/md/dm-table.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index b553cdb6d..ec77e7830 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -251,7 +251,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
 	if (bdev_is_zoned(bdev)) {
 		unsigned int zone_sectors = bdev_zone_sectors(bdev);
 
-		if (blk_queue_is_zone_start(bdev_get_queue(bdev), start)) {
+		if (!blk_queue_is_zone_start(bdev_get_queue(bdev), start)) {
 			DMWARN("%s: start=%llu not aligned to h/w zone size %u of %pg",
 			       dm_device_name(ti->table->md),
 			       (unsigned long long)start,
@@ -268,7 +268,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
 		 * devices do not end up with a smaller zone in the middle of
 		 * the sector range.
 		 */
-		if (blk_queue_is_zone_start(bdev_get_queue(bdev), len)) {
+		if (!blk_queue_is_zone_start(bdev_get_queue(bdev), len)) {
 			DMWARN("%s: len=%llu not aligned to h/w zone size %u of %pg",
 			       dm_device_name(ti->table->md),
 			       (unsigned long long)len,
@@ -1648,7 +1648,7 @@ static int validate_hardware_zoned_model(struct dm_table *table,
 	}
 
 	/* Check zone size validity and compatibility */
-	if (!zone_sectors || !is_power_of_2(zone_sectors))
+	if (!zone_sectors)
 		return -EINVAL;
 
 	if (dm_table_any_dev_attr(table, device_not_matches_zone_sectors, &zone_sectors)) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 12/13] dm: call dm_zone_endio after the target endio callback for zoned devices
       [not found]   ` <CGME20220615102007eucas1p1106f9520e2a86beb3792107dffd8071b@eucas1p1.samsung.com>
@ 2022-06-15 10:19     ` Pankaj Raghav
  2022-06-15 11:01       ` [dm-devel] " Damien Le Moal
  0 siblings, 1 reply; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-15 10:19 UTC (permalink / raw)
  To: hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Pankaj Raghav

dm_zone_endio() updates the bi_sector of orig bio for zoned devices that
uses either native append or append emulation and it is called before the
endio of the target. But target endio can still update the clone bio
after dm_zone_endio is called, thereby, the orig bio does not contain
the updated information anymore. Call dm_zone_endio for zoned devices
after calling the target's endio function

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
@Damien and @Hannes: I couldn't come up with a testcase that uses endio callback and
zone append or append emulation for zoned devices to test for
regression in this change. It would be great if you can suggest
something. This change is required for the npo2 target as we update the
clone bio sector in the endio callback function and the orig bio should
be updated only after the endio callback for zone appends.

 drivers/md/dm.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3f17fe1de..3a74e1038 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1025,10 +1025,6 @@ static void clone_endio(struct bio *bio)
 			disable_write_zeroes(md);
 	}
 
-	if (static_branch_unlikely(&zoned_enabled) &&
-	    unlikely(blk_queue_is_zoned(bdev_get_queue(bio->bi_bdev))))
-		dm_zone_endio(io, bio);
-
 	if (endio) {
 		int r = endio(ti, bio, &error);
 		switch (r) {
@@ -1057,6 +1053,10 @@ static void clone_endio(struct bio *bio)
 		}
 	}
 
+	if (static_branch_unlikely(&zoned_enabled) &&
+	    unlikely(blk_queue_is_zoned(bdev_get_queue(bio->bi_bdev))))
+		dm_zone_endio(io, bio);
+
 	if (static_branch_unlikely(&swap_bios_enabled) &&
 	    unlikely(swap_bios_limit(ti, bio)))
 		up(&md->swap_bios_semaphore);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v7 13/13] dm: add non power of 2 zoned target
       [not found]   ` <CGME20220615102011eucas1p220368db4a186181b1927dea50a79e5d4@eucas1p2.samsung.com>
@ 2022-06-15 10:19     ` Pankaj Raghav
  2022-06-15 11:49       ` Damien Le Moal
                         ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-15 10:19 UTC (permalink / raw)
  To: hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Pankaj Raghav,
	Johannes Thumshirn, Damien Le Moal

Only power of 2(po2) zoned devices were supported in linux but now non
power of 2(npo2) zoned device support has been added to the block layer.

Filesystems such as F2FS and btrfs have support for zoned devices with
po2 zone size assumption. Before adding native support for npo2 zoned
devices, it was suggested to create a dm target for npo2 zoned device to
appear as po2 device so that file systems can initially work without any
explicit changes by using this target.

The design of this target is very simple: introduce gaps between the zone
capacity and the po2 zone size of the underlying device. All IOs will be
remapped from target to the actual device location. For devices that use
zone append, the bi_sector is remapped from device to target's layout.

The read IOs that fall in the "emulated" gap area will return 0 and all
the other IOs in that area will result in an error. If an read IO span
across the zone capacity boundary, then the IOs are split between the
boundary. All other IO operations that span across a zone capacity
boundary will result in an error.

The target can be easily updated as follows:
dmsetup create <label> --table '0 <size_sects> zoned-npo2 /dev/nvme<id>'

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Suggested-by: Damien Le Moal <damien.lemoal@wdc.com>
Suggested-by: Hannes Reinecke <hare@suse.de>
---
 drivers/md/Kconfig                |   9 +
 drivers/md/Makefile               |   2 +
 drivers/md/dm-zone.c              |   9 +
 drivers/md/dm-zoned-npo2-target.c | 268 ++++++++++++++++++++++++++++++
 4 files changed, 288 insertions(+)
 create mode 100644 drivers/md/dm-zoned-npo2-target.c

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 998a5cfdb..773314536 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -518,6 +518,15 @@ config DM_FLAKEY
 	help
 	 A target that intermittently fails I/O for debugging purposes.
 
+config DM_ZONED_NPO2
+	tristate "Zoned non power of 2 target"
+	depends on BLK_DEV_DM
+	depends on BLK_DEV_ZONED
+	help
+	A target that converts a zoned device with non power of 2 zone size to
+	be power of 2. This is done by introducing gaps in between the zone
+	capacity and the power of 2 zone size.
+
 config DM_VERITY
 	tristate "Verity target support"
 	depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 0454b0885..2863a94a7 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -26,6 +26,7 @@ dm-era-y	+= dm-era-target.o
 dm-clone-y	+= dm-clone-target.o dm-clone-metadata.o
 dm-verity-y	+= dm-verity-target.o
 dm-zoned-y	+= dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
+dm-zoned-npo2-y       += dm-zoned-npo2-target.o
 
 md-mod-y	+= md.o md-bitmap.o
 raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
@@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
 obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
 obj-$(CONFIG_DM_DUST)		+= dm-dust.o
 obj-$(CONFIG_DM_FLAKEY)		+= dm-flakey.o
+obj-$(CONFIG_DM_ZONED_NPO2)	+= dm-zoned-npo2.o
 obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
 obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
 obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index af36d33f9..5efb31ba0 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -210,6 +210,11 @@ static int dm_zone_revalidate_cb(struct blk_zone *zone, unsigned int idx,
 		}
 		md->zwp_offset[idx] = dm_get_zone_wp_offset(zone);
 
+		if (q->limits.chunk_sectors != zone->len) {
+			blk_queue_chunk_sectors(q, zone->len);
+			q->nr_zones = blkdev_nr_zones(md->disk);
+		}
+
 		break;
 	default:
 		DMERR("Invalid zone type 0x%x at sectors %llu",
@@ -307,6 +312,9 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
 	if (dm_table_supports_zone_append(t)) {
 		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
 		dm_cleanup_zoned_dev(md);
+
+		if (!is_power_of_2(blk_queue_zone_sectors(q)))
+			goto revalidate_zones;
 		return 0;
 	}
 
@@ -318,6 +326,7 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
 	if (!get_capacity(md->disk))
 		return 0;
 
+revalidate_zones:
 	return dm_revalidate_zones(md, t);
 }
 
diff --git a/drivers/md/dm-zoned-npo2-target.c b/drivers/md/dm-zoned-npo2-target.c
new file mode 100644
index 000000000..c1373d3ea
--- /dev/null
+++ b/drivers/md/dm-zoned-npo2-target.c
@@ -0,0 +1,268 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Samsung Electronics Co., Ltd.
+ */
+
+#include <linux/device-mapper.h>
+
+#define DM_MSG_PREFIX "zoned-npo2"
+
+struct dmz_npo2_target {
+	struct dm_dev *dev;
+	sector_t zsze;
+	sector_t zsze_po2;
+	sector_t zsze_diff;
+	u32 nr_zones;
+};
+
+enum dmz_npo2_io_cond {
+	DMZ_NPO2_IO_INSIDE_ZONE,
+	DMZ_NPO2_IO_ACROSS_ZONE,
+	DMZ_NPO2_IO_OUTSIDE_ZONE,
+};
+
+static inline u32 npo2_zone_no(struct dmz_npo2_target *dmh, sector_t sect)
+{
+	return div64_u64(sect, dmh->zsze);
+}
+
+static inline u32 po2_zone_no(struct dmz_npo2_target *dmh, sector_t sect)
+{
+	return sect >> ilog2(dmh->zsze_po2);
+}
+
+static inline sector_t target_to_device_sect(struct dmz_npo2_target *dmh,
+					     sector_t sect)
+{
+	u32 zone_idx = po2_zone_no(dmh, sect);
+
+	sect -= (zone_idx * dmh->zsze_diff);
+
+	return sect;
+}
+
+static inline sector_t device_to_target_sect(struct dmz_npo2_target *dmh,
+					     sector_t sect)
+{
+	u32 zone_idx = npo2_zone_no(dmh, sect);
+
+	sect += (zone_idx * dmh->zsze_diff);
+
+	return sect;
+}
+
+/*
+ * <dev-path>
+ * This target works on the complete zoned device. Partial mapping is not
+ * supported
+ */
+static int dmz_npo2_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct dmz_npo2_target *dmh = NULL;
+	int ret = 0;
+	sector_t zsze;
+	sector_t disk_size;
+
+	if (argc < 1)
+		return -EINVAL;
+
+	dmh = kmalloc(sizeof(*dmh), GFP_KERNEL);
+	if (!dmh)
+		return -ENOMEM;
+
+	ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
+			    &dmh->dev);
+
+	zsze = blk_queue_zone_sectors(bdev_get_queue(dmh->dev->bdev));
+
+	disk_size = get_capacity(dmh->dev->bdev->bd_disk);
+
+	if (ti->len != disk_size || ti->begin) {
+		DMERR("%pg Partial mapping of the target not supported",
+		      dmh->dev->bdev);
+		return -EINVAL;
+	}
+
+	if (is_power_of_2(zsze)) {
+		DMERR("%pg zone size is power of 2", dmh->dev->bdev);
+		return -EINVAL;
+	}
+
+	dmh->zsze = zsze;
+	dmh->zsze_po2 = 1 << get_count_order_long(zsze);
+	dmh->zsze_diff = dmh->zsze_po2 - dmh->zsze;
+
+	ti->private = dmh;
+	ti->num_flush_bios = 1;
+	ti->num_discard_bios = 1;
+	ti->num_secure_erase_bios = 1;
+	ti->num_write_zeroes_bios = 1;
+
+	dmh->nr_zones = npo2_zone_no(dmh, ti->len);
+	ti->len = dmh->zsze_po2 * dmh->nr_zones;
+
+	return 0;
+}
+
+static int dmz_npo2_report_zones_cb(struct blk_zone *zone, unsigned int idx,
+				    void *data)
+{
+	struct dm_report_zones_args *args = data;
+	struct dmz_npo2_target *dmh = args->tgt->private;
+
+	zone->start = device_to_target_sect(dmh, zone->start);
+	zone->wp = device_to_target_sect(dmh, zone->wp);
+	zone->len = dmh->zsze_po2;
+	args->next_sector = zone->start + zone->len;
+
+	return args->orig_cb(zone, args->zone_idx++, args->orig_data);
+}
+
+static int dmz_npo2_report_zones(struct dm_target *ti,
+				 struct dm_report_zones_args *args,
+				 unsigned int nr_zones)
+{
+	struct dmz_npo2_target *dmh = ti->private;
+	int ret = 0;
+	sector_t sect = po2_zone_no(dmh, args->next_sector) * dmh->zsze;
+
+	ret = blkdev_report_zones(dmh->dev->bdev, sect, nr_zones,
+				  dmz_npo2_report_zones_cb, args);
+	if (ret < 0)
+		DMERR("report zones error");
+
+	return ret;
+}
+
+static int check_zone_boundary_violation(struct dmz_npo2_target *dmh,
+					 sector_t sect, sector_t size)
+{
+	u32 zone_idx = po2_zone_no(dmh, sect);
+	sector_t relative_sect = 0;
+
+	sect = target_to_device_sect(dmh, sect);
+	relative_sect = sect - (zone_idx * dmh->zsze);
+
+	if ((relative_sect + size) <= dmh->zsze)
+		return DMZ_NPO2_IO_INSIDE_ZONE;
+	else if (relative_sect >= dmh->zsze)
+		return DMZ_NPO2_IO_OUTSIDE_ZONE;
+
+	return DMZ_NPO2_IO_ACROSS_ZONE;
+}
+
+static void split_io_across_zone_boundary(struct dmz_npo2_target *dmh,
+					  struct bio *bio)
+{
+	sector_t sect = bio->bi_iter.bi_sector;
+	sector_t sects_from_zone_start;
+
+	sect = target_to_device_sect(dmh, sect);
+	div64_u64_rem(sect, dmh->zsze, &sects_from_zone_start);
+	dm_accept_partial_bio(bio, dmh->zsze - sects_from_zone_start);
+	bio->bi_iter.bi_sector = sect;
+}
+
+static int handle_zone_boundary_violation(struct dmz_npo2_target *dmh,
+					  struct bio *bio,
+					  enum dmz_npo2_io_cond cond)
+{
+	/* Read should return zeroed page */
+	if (bio_op(bio) == REQ_OP_READ) {
+		if (cond == DMZ_NPO2_IO_ACROSS_ZONE) {
+			split_io_across_zone_boundary(dmh, bio);
+			return DM_MAPIO_REMAPPED;
+		}
+		zero_fill_bio(bio);
+		bio_endio(bio);
+		return DM_MAPIO_SUBMITTED;
+	}
+	return DM_MAPIO_KILL;
+}
+
+static int dmz_npo2_end_io(struct dm_target *ti, struct bio *bio,
+			   blk_status_t *error)
+{
+	struct dmz_npo2_target *dmh = ti->private;
+
+	if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND)
+		bio->bi_iter.bi_sector =
+			device_to_target_sect(dmh, bio->bi_iter.bi_sector);
+
+	return DM_ENDIO_DONE;
+}
+
+static int dmz_npo2_map(struct dm_target *ti, struct bio *bio)
+{
+	struct dmz_npo2_target *dmh = ti->private;
+	enum dmz_npo2_io_cond cond;
+
+	bio_set_dev(bio, dmh->dev->bdev);
+	if (bio_sectors(bio) || op_is_zone_mgmt(bio_op(bio))) {
+		cond = check_zone_boundary_violation(dmh, bio->bi_iter.bi_sector,
+						     bio->bi_iter.bi_size >> SECTOR_SHIFT);
+
+		/*
+		 * If the starting sector is in the emulated area then fill
+		 * all the bio with zeros. If bio is across boundaries,
+		 * split the bio across boundaries and fill zeros only for the
+		 * bio that is outside the zone capacity
+		 */
+		switch (cond) {
+		case DMZ_NPO2_IO_INSIDE_ZONE:
+			bio->bi_iter.bi_sector = target_to_device_sect(dmh,
+								       bio->bi_iter.bi_sector);
+			break;
+		case DMZ_NPO2_IO_ACROSS_ZONE:
+		case DMZ_NPO2_IO_OUTSIDE_ZONE:
+			return handle_zone_boundary_violation(dmh, bio, cond);
+		}
+	}
+	return DM_MAPIO_REMAPPED;
+}
+
+static int dmz_npo2_iterate_devices(struct dm_target *ti,
+				    iterate_devices_callout_fn fn, void *data)
+{
+	struct dmz_npo2_target *dmh = ti->private;
+	sector_t len = 0;
+
+	len = dmh->nr_zones * dmh->zsze;
+	return fn(ti, dmh->dev, 0, len, data);
+}
+
+static struct target_type dmz_npo2_target = {
+	.name = "zoned-npo2",
+	.version = { 1, 0, 0 },
+	.features = DM_TARGET_ZONED_HM,
+	.map = dmz_npo2_map,
+	.end_io = dmz_npo2_end_io,
+	.report_zones = dmz_npo2_report_zones,
+	.iterate_devices = dmz_npo2_iterate_devices,
+	.module = THIS_MODULE,
+	.ctr = dmz_npo2_ctr,
+};
+
+static int __init dmz_npo2_init(void)
+{
+	int r = dm_register_target(&dmz_npo2_target);
+
+	if (r < 0)
+		DMERR("register failed %d", r);
+
+	return r;
+}
+
+static void __exit dmz_npo2_exit(void)
+{
+	dm_unregister_target(&dmz_npo2_target);
+}
+
+/* Module hooks */
+module_init(dmz_npo2_init);
+module_exit(dmz_npo2_exit);
+
+MODULE_DESCRIPTION(DM_NAME " non power 2 zoned target");
+MODULE_AUTHOR("Pankaj Raghav <p.raghav@samsung.com>");
+MODULE_LICENSE("GPL");
+
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [dm-devel] [PATCH v7 12/13] dm: call dm_zone_endio after the target endio callback for zoned devices
  2022-06-15 10:19     ` [PATCH v7 12/13] dm: call dm_zone_endio after the target endio callback for " Pankaj Raghav
@ 2022-06-15 11:01       ` Damien Le Moal
  2022-06-16 12:24         ` Pankaj Raghav
  0 siblings, 1 reply; 37+ messages in thread
From: Damien Le Moal @ 2022-06-15 11:01 UTC (permalink / raw)
  To: Pankaj Raghav, hch, snitzer, axboe
  Cc: bvanassche, pankydev8, gost.dev, jiangbo.365, linux-nvme,
	linux-kernel, linux-block, dm-devel, jonathan.derrick,
	Johannes.Thumshirn, dsterba, jaegeuk

On 6/15/22 19:19, Pankaj Raghav wrote:
> dm_zone_endio() updates the bi_sector of orig bio for zoned devices that
> uses either native append or append emulation and it is called before the
> endio of the target. But target endio can still update the clone bio
> after dm_zone_endio is called, thereby, the orig bio does not contain
> the updated information anymore. Call dm_zone_endio for zoned devices
> after calling the target's endio function
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
> @Damien and @Hannes: I couldn't come up with a testcase that uses endio callback and
> zone append or append emulation for zoned devices to test for
> regression in this change. It would be great if you can suggest
> something. This change is required for the npo2 target as we update the
> clone bio sector in the endio callback function and the orig bio should
> be updated only after the endio callback for zone appends.

Running zonefs tests on top of dm-crypt will exercise DM zone append
emulation.

> 
>  drivers/md/dm.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3f17fe1de..3a74e1038 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1025,10 +1025,6 @@ static void clone_endio(struct bio *bio)
>  			disable_write_zeroes(md);
>  	}
>  
> -	if (static_branch_unlikely(&zoned_enabled) &&
> -	    unlikely(blk_queue_is_zoned(bdev_get_queue(bio->bi_bdev))))
> -		dm_zone_endio(io, bio);
> -
>  	if (endio) {
>  		int r = endio(ti, bio, &error);
>  		switch (r) {
> @@ -1057,6 +1053,10 @@ static void clone_endio(struct bio *bio)
>  		}
>  	}
>  
> +	if (static_branch_unlikely(&zoned_enabled) &&
> +	    unlikely(blk_queue_is_zoned(bdev_get_queue(bio->bi_bdev))))

blk_queue_is_zoned(bdev_get_queue(bio->bi_bdev))) ->
bdev_is_zoned(bio->bi_bdev)

> +		dm_zone_endio(io, bio);
> +
>  	if (static_branch_unlikely(&swap_bios_enabled) &&
>  	    unlikely(swap_bios_limit(ti, bio)))
>  		up(&md->swap_bios_semaphore);


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 13/13] dm: add non power of 2 zoned target
  2022-06-15 10:19     ` [PATCH v7 13/13] dm: add non power of 2 zoned target Pankaj Raghav
@ 2022-06-15 11:49       ` Damien Le Moal
  2022-06-16 16:12         ` Pankaj Raghav
  2022-06-15 14:19       ` kernel test robot
  2022-06-15 19:54       ` Randy Dunlap
  2 siblings, 1 reply; 37+ messages in thread
From: Damien Le Moal @ 2022-06-15 11:49 UTC (permalink / raw)
  To: Pankaj Raghav, hch, snitzer, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Damien Le Moal

On 6/15/22 19:19, Pankaj Raghav wrote:
> Only power of 2(po2) zoned devices were supported in linux but now non
> power of 2(npo2) zoned device support has been added to the block layer.
> 
> Filesystems such as F2FS and btrfs have support for zoned devices with
> po2 zone size assumption. Before adding native support for npo2 zoned
> devices, it was suggested to create a dm target for npo2 zoned device to
> appear as po2 device so that file systems can initially work without any
> explicit changes by using this target.
> 
> The design of this target is very simple: introduce gaps between the zone
> capacity and the po2 zone size of the underlying device. All IOs will be
> remapped from target to the actual device location. For devices that use
> zone append, the bi_sector is remapped from device to target's layout.

Nothing special for zone append in this respect. All IOs are remapped
likewise, right ?

> 
> The read IOs that fall in the "emulated" gap area will return 0 and all
> the other IOs in that area will result in an error. If an read IO span
> across the zone capacity boundary, then the IOs are split between the
> boundary. All other IO operations that span across a zone capacity
> boundary will result in an error.
> 
> The target can be easily updated as follows:

Updated ? you mean created, no ?

> dmsetup create <label> --table '0 <size_sects> zoned-npo2 /dev/nvme<id>'
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Suggested-by: Damien Le Moal <damien.lemoal@wdc.com>
> Suggested-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/md/Kconfig                |   9 +
>  drivers/md/Makefile               |   2 +
>  drivers/md/dm-zone.c              |   9 +
>  drivers/md/dm-zoned-npo2-target.c | 268 ++++++++++++++++++++++++++++++
>  4 files changed, 288 insertions(+)
>  create mode 100644 drivers/md/dm-zoned-npo2-target.c
> 
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index 998a5cfdb..773314536 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -518,6 +518,15 @@ config DM_FLAKEY
>  	help
>  	 A target that intermittently fails I/O for debugging purposes.
>  
> +config DM_ZONED_NPO2
> +	tristate "Zoned non power of 2 target"
> +	depends on BLK_DEV_DM
> +	depends on BLK_DEV_ZONED
> +	help
> +	A target that converts a zoned device with non power of 2 zone size to
> +	be power of 2. This is done by introducing gaps in between the zone
> +	capacity and the power of 2 zone size.
> +
>  config DM_VERITY
>  	tristate "Verity target support"
>  	depends on BLK_DEV_DM
> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
> index 0454b0885..2863a94a7 100644
> --- a/drivers/md/Makefile
> +++ b/drivers/md/Makefile
> @@ -26,6 +26,7 @@ dm-era-y	+= dm-era-target.o
>  dm-clone-y	+= dm-clone-target.o dm-clone-metadata.o
>  dm-verity-y	+= dm-verity-target.o
>  dm-zoned-y	+= dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
> +dm-zoned-npo2-y       += dm-zoned-npo2-target.o

This naming is in my opinion very bad as it seems related to the dm-zoned
target. e.g. dm-po2z, dm-zp2, etc.

>  
>  md-mod-y	+= md.o md-bitmap.o
>  raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
> @@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
>  obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
>  obj-$(CONFIG_DM_DUST)		+= dm-dust.o
>  obj-$(CONFIG_DM_FLAKEY)		+= dm-flakey.o
> +obj-$(CONFIG_DM_ZONED_NPO2)	+= dm-zoned-npo2.o
>  obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
>  obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
>  obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index af36d33f9..5efb31ba0 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -210,6 +210,11 @@ static int dm_zone_revalidate_cb(struct blk_zone *zone, unsigned int idx,
>  		}
>  		md->zwp_offset[idx] = dm_get_zone_wp_offset(zone);
>  
> +		if (q->limits.chunk_sectors != zone->len) {

Why is this if needed ?

> +			blk_queue_chunk_sectors(q, zone->len);
> +			q->nr_zones = blkdev_nr_zones(md->disk);
> +		}
> +
>  		break;
>  	default:
>  		DMERR("Invalid zone type 0x%x at sectors %llu",
> @@ -307,6 +312,9 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
>  	if (dm_table_supports_zone_append(t)) {
>  		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
>  		dm_cleanup_zoned_dev(md);
> +

no need for the blank line.

> +		if (!is_power_of_2(blk_queue_zone_sectors(q)))
> +			goto revalidate_zones;
>  		return 0;
>  	}

Why do you need to change dm_set_zones_restrictions() at all ?

>  
> @@ -318,6 +326,7 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
>  	if (!get_capacity(md->disk))
>  		return 0;
>  
> +revalidate_zones:
>  	return dm_revalidate_zones(md, t);
>  }
>  
> diff --git a/drivers/md/dm-zoned-npo2-target.c b/drivers/md/dm-zoned-npo2-target.c
> new file mode 100644
> index 000000000..c1373d3ea
> --- /dev/null
> +++ b/drivers/md/dm-zoned-npo2-target.c
> @@ -0,0 +1,268 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2022 Samsung Electronics Co., Ltd.
> + */
> +
> +#include <linux/device-mapper.h>
> +
> +#define DM_MSG_PREFIX "zoned-npo2"
> +
> +struct dmz_npo2_target {
> +	struct dm_dev *dev;
> +	sector_t zsze;
> +	sector_t zsze_po2;
> +	sector_t zsze_diff;

zsze ? is that zone size ? Spell this out please. This is not nvme.

> +	u32 nr_zones;
> +};
> +
> +enum dmz_npo2_io_cond {
> +	DMZ_NPO2_IO_INSIDE_ZONE,
> +	DMZ_NPO2_IO_ACROSS_ZONE,
> +	DMZ_NPO2_IO_OUTSIDE_ZONE,
> +};
> +
> +static inline u32 npo2_zone_no(struct dmz_npo2_target *dmh, sector_t sect)
> +{
> +	return div64_u64(sect, dmh->zsze);
> +}
> +
> +static inline u32 po2_zone_no(struct dmz_npo2_target *dmh, sector_t sect)
> +{
> +	return sect >> ilog2(dmh->zsze_po2);
> +}
> +
> +static inline sector_t target_to_device_sect(struct dmz_npo2_target *dmh,
> +					     sector_t sect)
> +{
> +	u32 zone_idx = po2_zone_no(dmh, sect);
> +
> +	sect -= (zone_idx * dmh->zsze_diff);

	return sect - (po2_zone_no(dmh, sect) * dmh->zsze_diff);
> +
> +	return sect;
> +}
> +
> +static inline sector_t device_to_target_sect(struct dmz_npo2_target *dmh,
> +					     sector_t sect)
> +{
> +	u32 zone_idx = npo2_zone_no(dmh, sect);
> +
> +	sect += (zone_idx * dmh->zsze_diff);

see above. Simplify.

> +
> +	return sect;
> +}
> +
> +/*
> + * <dev-path>

What is this above line intended meaning ?

> + * This target works on the complete zoned device. Partial mapping is not
> + * supported
> + */
> +static int dmz_npo2_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> +{
> +	struct dmz_npo2_target *dmh = NULL;
> +	int ret = 0;
> +	sector_t zsze;
> +	sector_t disk_size;
> +
> +	if (argc < 1)
> +		return -EINVAL;
> +
> +	dmh = kmalloc(sizeof(*dmh), GFP_KERNEL);
> +	if (!dmh)
> +		return -ENOMEM;
> +
> +	ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
> +			    &dmh->dev);

No error check ?

> +
> +	zsze = blk_queue_zone_sectors(bdev_get_queue(dmh->dev->bdev));
> +
> +	disk_size = get_capacity(dmh->dev->bdev->bd_disk);

s/disk_size/dev_capacity

> +
> +	if (ti->len != disk_size || ti->begin) {
> +		DMERR("%pg Partial mapping of the target not supported",

missing a verb ("is not...")

> +		      dmh->dev->bdev);
> +		return -EINVAL;
> +	}
> +
> +	if (is_power_of_2(zsze)) {
> +		DMERR("%pg zone size is power of 2", dmh->dev->bdev);

Hmmm... You would end up with no remapping needed so it would still
work... Why error this ? A warning would work too.

> +		return -EINVAL;
> +	}
> +
> +	dmh->zsze = zsze;
> +	dmh->zsze_po2 = 1 << get_count_order_long(zsze);
> +	dmh->zsze_diff = dmh->zsze_po2 - dmh->zsze;
> +
> +	ti->private = dmh;
> +	ti->num_flush_bios = 1;
> +	ti->num_discard_bios = 1;
> +	ti->num_secure_erase_bios = 1;
> +	ti->num_write_zeroes_bios = 1;

Why all these ? I know dm-linear do that but I do not see why they would
be necessary for a single device target.

> +
> +	dmh->nr_zones = npo2_zone_no(dmh, ti->len);
> +	ti->len = dmh->zsze_po2 * dmh->nr_zones;
> +
> +	return 0;
> +}
> +
> +static int dmz_npo2_report_zones_cb(struct blk_zone *zone, unsigned int idx,
> +				    void *data)
> +{
> +	struct dm_report_zones_args *args = data;
> +	struct dmz_npo2_target *dmh = args->tgt->private;
> +
> +	zone->start = device_to_target_sect(dmh, zone->start);
> +	zone->wp = device_to_target_sect(dmh, zone->wp);
> +	zone->len = dmh->zsze_po2;
> +	args->next_sector = zone->start + zone->len;
> +
> +	return args->orig_cb(zone, args->zone_idx++, args->orig_data);
> +}
> +
> +static int dmz_npo2_report_zones(struct dm_target *ti,
> +				 struct dm_report_zones_args *args,
> +				 unsigned int nr_zones)
> +{
> +	struct dmz_npo2_target *dmh = ti->private;
> +	int ret = 0;

no need for the = 0. No need for ret at all in fact.

> +	sector_t sect = po2_zone_no(dmh, args->next_sector) * dmh->zsze;
> +
> +	ret = blkdev_report_zones(dmh->dev->bdev, sect, nr_zones,
> +				  dmz_npo2_report_zones_cb, args);
> +	if (ret < 0)
> +		DMERR("report zones error");

Not useful. just "return blkdev_report_zones();"

> +
> +	return ret;
> +}
> +
> +static int check_zone_boundary_violation(struct dmz_npo2_target *dmh,
> +					 sector_t sect, sector_t size)
> +{
> +	u32 zone_idx = po2_zone_no(dmh, sect);
> +	sector_t relative_sect = 0;

No need for "= 0".

> +
> +	sect = target_to_device_sect(dmh, sect);
> +	relative_sect = sect - (zone_idx * dmh->zsze);

ofst_in_zone ? or sect_osft ?

> +
> +	if ((relative_sect + size) <= dmh->zsze)

no need for the inner brackets.

> +		return DMZ_NPO2_IO_INSIDE_ZONE;
> +	else if (relative_sect >= dmh->zsze)

no need for the else. And this is super confusing. This case correspond to
the BIO going beyond the zone capacity in the target address space,
meaning it is still WITHIN the target zone. But you call that "outside"
because it is for the device zone. Super confusing. It took me a lot of
rereading to finally get it.

> +		return DMZ_NPO2_IO_OUTSIDE_ZONE;
> +
> +	return DMZ_NPO2_IO_ACROSS_ZONE;

So you BIO is eeither fully contained within the zone or it is not. So why
not just return a bool ?

> +}
> +
> +static void split_io_across_zone_boundary(struct dmz_npo2_target *dmh,
> +					  struct bio *bio)
> +{
> +	sector_t sect = bio->bi_iter.bi_sector;
> +	sector_t sects_from_zone_start;
> +
> +	sect = target_to_device_sect(dmh, sect);

	sect = target_to_device_sect(dmh, bio->bi_iter.bi_sector);

is more readable.

> +	div64_u64_rem(sect, dmh->zsze, &sects_from_zone_start);
> +	dm_accept_partial_bio(bio, dmh->zsze - sects_from_zone_start);

So if this is a read BIO starting exactly at the target zone capacity
(sects_from_zone_start == zsze), then you accept 0 sectors ? What am I
missing here ?

> +	bio->bi_iter.bi_sector = sect;
> +}
> +
> +static int handle_zone_boundary_violation(struct dmz_npo2_target *dmh,
> +					  struct bio *bio,
> +					  enum dmz_npo2_io_cond cond)
> +{
> +	/* Read should return zeroed page */
> +	if (bio_op(bio) == REQ_OP_READ) {
> +		if (cond == DMZ_NPO2_IO_ACROSS_ZONE) {
> +			split_io_across_zone_boundary(dmh, bio);
> +			return DM_MAPIO_REMAPPED;
> +		}
> +		zero_fill_bio(bio);
> +		bio_endio(bio);
> +		return DM_MAPIO_SUBMITTED;
> +	}
> +	return DM_MAPIO_KILL;
> +}
> +
> +static int dmz_npo2_end_io(struct dm_target *ti, struct bio *bio,
> +			   blk_status_t *error)
> +{
> +	struct dmz_npo2_target *dmh = ti->private;
> +
> +	if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND)
> +		bio->bi_iter.bi_sector =
> +			device_to_target_sect(dmh, bio->bi_iter.bi_sector);
> +
> +	return DM_ENDIO_DONE;
> +}
> +
> +static int dmz_npo2_map(struct dm_target *ti, struct bio *bio)
> +{
> +	struct dmz_npo2_target *dmh = ti->private;
> +	enum dmz_npo2_io_cond cond;
> +
> +	bio_set_dev(bio, dmh->dev->bdev);
> +	if (bio_sectors(bio) || op_is_zone_mgmt(bio_op(bio))) {
> +		cond = check_zone_boundary_violation(dmh, bio->bi_iter.bi_sector,
> +						     bio->bi_iter.bi_size >> SECTOR_SHIFT);

Why check this for zone management BIOs ? These have length = 0, always.

> +
> +		/*
> +		 * If the starting sector is in the emulated area then fill
> +		 * all the bio with zeros. If bio is across boundaries,
> +		 * split the bio across boundaries and fill zeros only for the
> +		 * bio that is outside the zone capacity
> +		 */
> +		switch (cond) {
> +		case DMZ_NPO2_IO_INSIDE_ZONE:
> +			bio->bi_iter.bi_sector = target_to_device_sect(dmh,
> +								       bio->bi_iter.bi_sector);
> +			break;
> +		case DMZ_NPO2_IO_ACROSS_ZONE:
> +		case DMZ_NPO2_IO_OUTSIDE_ZONE:
> +			return handle_zone_boundary_violation(dmh, bio, cond);
> +		}
> +	}
> +	return DM_MAPIO_REMAPPED;

This entire function is very hard to read because everything is hidden in
helpers that are not super useful in my opinion. I would prefer seeing
cases for:
* zone management BIOs
* Reads and writes
* Everything else

where tests against the bio sector and length are visible, so one can
understand what is going on. If you need helpers, have handle_zone_mgmt(),
handle_read() etc. Something clear.

> +}
> +
> +static int dmz_npo2_iterate_devices(struct dm_target *ti,
> +				    iterate_devices_callout_fn fn, void *data)
> +{
> +	struct dmz_npo2_target *dmh = ti->private;
> +	sector_t len = 0;
> +
> +	len = dmh->nr_zones * dmh->zsze;

Move this to the declaration instead of setting len to 0 for nothing.

> +	return fn(ti, dmh->dev, 0, len, data);
> +}
> +
> +static struct target_type dmz_npo2_target = {
> +	.name = "zoned-npo2",
> +	.version = { 1, 0, 0 },
> +	.features = DM_TARGET_ZONED_HM,
> +	.map = dmz_npo2_map,
> +	.end_io = dmz_npo2_end_io,
> +	.report_zones = dmz_npo2_report_zones,
> +	.iterate_devices = dmz_npo2_iterate_devices,
> +	.module = THIS_MODULE,
> +	.ctr = dmz_npo2_ctr,
> +};
> +
> +static int __init dmz_npo2_init(void)
> +{
> +	int r = dm_register_target(&dmz_npo2_target);
> +
> +	if (r < 0)
> +		DMERR("register failed %d", r);
> +
> +	return r;
> +}
> +
> +static void __exit dmz_npo2_exit(void)
> +{
> +	dm_unregister_target(&dmz_npo2_target);
> +}
> +
> +/* Module hooks */
> +module_init(dmz_npo2_init);
> +module_exit(dmz_npo2_exit);
> +
> +MODULE_DESCRIPTION(DM_NAME " non power 2 zoned target");
> +MODULE_AUTHOR("Pankaj Raghav <p.raghav@samsung.com>");
> +MODULE_LICENSE("GPL");
> +


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 10/13] dm-table: use bdev_is_zone_start helper in device_area_is_invalid()
  2022-06-15 10:19     ` [PATCH v7 10/13] dm-table: use bdev_is_zone_start helper in device_area_is_invalid() Pankaj Raghav
@ 2022-06-15 11:53       ` Damien Le Moal
  2022-06-16  9:55         ` Pankaj Raghav
  0 siblings, 1 reply; 37+ messages in thread
From: Damien Le Moal @ 2022-06-15 11:53 UTC (permalink / raw)
  To: Pankaj Raghav, hch, snitzer, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Luis Chamberlain

On 6/15/22 19:19, Pankaj Raghav wrote:
> Use bdev_is_zone_start() helper that uses generic calculation to check
> for zone alignment instead of using po2 based alignment check.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  drivers/md/dm-table.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index bd539afbf..b553cdb6d 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -251,7 +251,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
>  	if (bdev_is_zoned(bdev)) {
>  		unsigned int zone_sectors = bdev_zone_sectors(bdev);
>  
> -		if (start & (zone_sectors - 1)) {
> +		if (blk_queue_is_zone_start(bdev_get_queue(bdev), start)) {

This is wrong. And you are changing this to the correct test in the next
patch.

>  			DMWARN("%s: start=%llu not aligned to h/w zone size %u of %pg",
>  			       dm_device_name(ti->table->md),
>  			       (unsigned long long)start,
> @@ -268,7 +268,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
>  		 * devices do not end up with a smaller zone in the middle of
>  		 * the sector range.
>  		 */
> -		if (len & (zone_sectors - 1)) {
> +		if (blk_queue_is_zone_start(bdev_get_queue(bdev), len)) {
>  			DMWARN("%s: len=%llu not aligned to h/w zone size %u of %pg",
>  			       dm_device_name(ti->table->md),
>  			       (unsigned long long)len,


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 06/13] null_blk: use zone_size_sects_shift for power of 2 zoned devices
  2022-06-15 10:19     ` [PATCH v7 06/13] null_blk: use zone_size_sects_shift for " Pankaj Raghav
@ 2022-06-15 11:56       ` Damien Le Moal
  0 siblings, 0 replies; 37+ messages in thread
From: Damien Le Moal @ 2022-06-15 11:56 UTC (permalink / raw)
  To: Pankaj Raghav, hch, snitzer, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Damien Le Moal,
	Luis Chamberlain

On 6/15/22 19:19, Pankaj Raghav wrote:
> Instead of doing is_power_of_2 and ilog2 operation for every IO, cache
> the zone_size_sects_shift variable and use it for power of 2 zoned
> devices.
> 
> This variable will be set to zero for non power of 2 zoned devices.
> 
> Suggested-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/block/null_blk/null_blk.h |  6 ++++++
>  drivers/block/null_blk/zoned.c    | 11 ++++++++---
>  2 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
> index 8359b4384..3bc7cbf25 100644
> --- a/drivers/block/null_blk/null_blk.h
> +++ b/drivers/block/null_blk/null_blk.h
> @@ -83,6 +83,12 @@ struct nullb_device {
>  	unsigned int imp_close_zone_no;
>  	struct nullb_zone *zones;
>  	sector_t zone_size_sects;
> +	/*
> +	 * zone_size_sects_shift is only useful when the zone size is
> +	 * power of 2. This variable is set to zero when zone size is non
> +	 * power of 2.
> +	 */
> +	unsigned int zone_size_sects_shift;
>  	bool need_zone_res_mgmt;
>  	spinlock_t zone_res_lock;
>  
> diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
> index daf327015..5f929944b 100644
> --- a/drivers/block/null_blk/zoned.c
> +++ b/drivers/block/null_blk/zoned.c
> @@ -16,8 +16,8 @@ static inline sector_t mb_to_sects(unsigned long mb)
>  
>  static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
>  {
> -	if (is_power_of_2(dev->zone_size_sects))
> -		return sect >> ilog2(dev->zone_size_sects);
> +	if (dev->zone_size_sects_shift)
> +		return sect >> dev->zone_size_sects_shift;
>  
>  	return div64_u64(sect, dev->zone_size_sects);
>  }
> @@ -85,9 +85,14 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
>  	zone_capacity_sects = mb_to_sects(dev->zone_capacity);
>  	dev_capacity_sects = mb_to_sects(dev->size);
>  	dev->zone_size_sects = mb_to_sects(dev->zone_size);
> +
> +	if (is_power_of_2(dev->zone_size_sects))
> +		dev->zone_size_sects_shift = ilog2(dev->zone_size_sects);
> +	else
> +		dev->zone_size_sects_shift = 0;
> +
>  	dev->nr_zones =	DIV_ROUND_UP_SECTOR_T(dev_capacity_sects,
>  					      dev->zone_size_sects);
> -

white line change.

This patch should be squashed with the previous one.

>  	dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct nullb_zone),
>  				    GFP_KERNEL | __GFP_ZERO);
>  	if (!dev->zones)


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 13/13] dm: add non power of 2 zoned target
  2022-06-15 10:19     ` [PATCH v7 13/13] dm: add non power of 2 zoned target Pankaj Raghav
  2022-06-15 11:49       ` Damien Le Moal
@ 2022-06-15 14:19       ` kernel test robot
  2022-06-15 19:54       ` Randy Dunlap
  2 siblings, 0 replies; 37+ messages in thread
From: kernel test robot @ 2022-06-15 14:19 UTC (permalink / raw)
  To: Pankaj Raghav, hch, snitzer, damien.lemoal, axboe
  Cc: kbuild-all, bvanassche, linux-kernel, jiangbo.365, hare,
	pankydev8, dm-devel, jonathan.derrick, gost.dev, dsterba,
	jaegeuk, linux-nvme, Johannes.Thumshirn, linux-block,
	Pankaj Raghav, Damien Le Moal

Hi Pankaj,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on device-mapper-dm/for-next linus/master v5.19-rc2 next-20220615]
[cannot apply to song-md/md-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Pankaj-Raghav/block-make-blkdev_nr_zones-and-blk_queue_zone_no-generic-for-npo2-zsze/20220615-191927
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: s390-allmodconfig (https://download.01.org/0day-ci/archive/20220615/202206152257.pnoPyl7X-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 11.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/add4ab54d5b34d4a2f91f241007f23a56c164fb3
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Pankaj-Raghav/block-make-blkdev_nr_zones-and-blk_queue_zone_no-generic-for-npo2-zsze/20220615-191927
        git checkout add4ab54d5b34d4a2f91f241007f23a56c164fb3
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   drivers/md/dm-zoned-npo2-target.c: In function 'dmz_npo2_ctr':
>> drivers/md/dm-zoned-npo2-target.c:62:13: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
      62 |         int ret = 0;
         |             ^~~


vim +/ret +62 drivers/md/dm-zoned-npo2-target.c

    53	
    54	/*
    55	 * <dev-path>
    56	 * This target works on the complete zoned device. Partial mapping is not
    57	 * supported
    58	 */
    59	static int dmz_npo2_ctr(struct dm_target *ti, unsigned int argc, char **argv)
    60	{
    61		struct dmz_npo2_target *dmh = NULL;
  > 62		int ret = 0;
    63		sector_t zsze;
    64		sector_t disk_size;
    65	
    66		if (argc < 1)
    67			return -EINVAL;
    68	
    69		dmh = kmalloc(sizeof(*dmh), GFP_KERNEL);
    70		if (!dmh)
    71			return -ENOMEM;
    72	
    73		ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
    74				    &dmh->dev);
    75	
    76		zsze = blk_queue_zone_sectors(bdev_get_queue(dmh->dev->bdev));
    77	
    78		disk_size = get_capacity(dmh->dev->bdev->bd_disk);
    79	
    80		if (ti->len != disk_size || ti->begin) {
    81			DMERR("%pg Partial mapping of the target not supported",
    82			      dmh->dev->bdev);
    83			return -EINVAL;
    84		}
    85	
    86		if (is_power_of_2(zsze)) {
    87			DMERR("%pg zone size is power of 2", dmh->dev->bdev);
    88			return -EINVAL;
    89		}
    90	
    91		dmh->zsze = zsze;
    92		dmh->zsze_po2 = 1 << get_count_order_long(zsze);
    93		dmh->zsze_diff = dmh->zsze_po2 - dmh->zsze;
    94	
    95		ti->private = dmh;
    96		ti->num_flush_bios = 1;
    97		ti->num_discard_bios = 1;
    98		ti->num_secure_erase_bios = 1;
    99		ti->num_write_zeroes_bios = 1;
   100	
   101		dmh->nr_zones = npo2_zone_no(dmh, ti->len);
   102		ti->len = dmh->zsze_po2 * dmh->nr_zones;
   103	
   104		return 0;
   105	}
   106	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 13/13] dm: add non power of 2 zoned target
  2022-06-15 10:19     ` [PATCH v7 13/13] dm: add non power of 2 zoned target Pankaj Raghav
  2022-06-15 11:49       ` Damien Le Moal
  2022-06-15 14:19       ` kernel test robot
@ 2022-06-15 19:54       ` Randy Dunlap
  2022-06-16 10:28         ` Pankaj Raghav
  2 siblings, 1 reply; 37+ messages in thread
From: Randy Dunlap @ 2022-06-15 19:54 UTC (permalink / raw)
  To: Pankaj Raghav, hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Damien Le Moal

Hi--

On 6/15/22 03:19, Pankaj Raghav wrote:

> ---
>  drivers/md/Kconfig                |   9 +
>  drivers/md/Makefile               |   2 +
>  drivers/md/dm-zone.c              |   9 +
>  drivers/md/dm-zoned-npo2-target.c | 268 ++++++++++++++++++++++++++++++
>  4 files changed, 288 insertions(+)
>  create mode 100644 drivers/md/dm-zoned-npo2-target.c
> 
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index 998a5cfdb..773314536 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -518,6 +518,15 @@ config DM_FLAKEY
>  	help
>  	 A target that intermittently fails I/O for debugging purposes.
>  
> +config DM_ZONED_NPO2
> +	tristate "Zoned non power of 2 target"

	         "Zoned non-power-of-2 target"

> +	depends on BLK_DEV_DM
> +	depends on BLK_DEV_ZONED
> +	help
> +	A target that converts a zoned device with non power of 2 zone size to

	                                           non-power-of-2 zone size to

> +	be power of 2. This is done by introducing gaps in between the zone
> +	capacity and the power of 2 zone size.

All help text should be indented with one tab and 2 spaces
according to Documentation/process/coding-style.rst.

> +
>  config DM_VERITY
>  	tristate "Verity target support"
>  	depends on BLK_DEV_DM

[snip]

> +
> +MODULE_DESCRIPTION(DM_NAME " non power 2 zoned target");

                                non-power-of-2

> +MODULE_AUTHOR("Pankaj Raghav <p.raghav@samsung.com>");
> +MODULE_LICENSE("GPL");
> +

-- 
~Randy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [dm-devel] [PATCH v7 01/13] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze
  2022-06-15 10:19     ` [PATCH v7 01/13] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze Pankaj Raghav
@ 2022-06-15 20:18       ` Bart Van Assche
  0 siblings, 0 replies; 37+ messages in thread
From: Bart Van Assche @ 2022-06-15 20:18 UTC (permalink / raw)
  To: Pankaj Raghav, hch, snitzer, damien.lemoal, axboe
  Cc: pankydev8, gost.dev, jiangbo.365, linux-nvme, linux-kernel,
	linux-block, dm-devel, jonathan.derrick, Johannes.Thumshirn,
	dsterba, jaegeuk, Luis Chamberlain

On 6/15/22 03:19, Pankaj Raghav wrote:
> Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
> also work for non-power-of-2 zone sizes.
> 
> As the existing deployments of zoned devices had power-of-2
> assumption, power-of-2 optimized calculation is kept for those devices.
> 
> There are no direct hot paths modified and the changes just
> introduce one new branch per call.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [dm-devel] [PATCH v7 02/13] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-06-15 10:19     ` [PATCH v7 02/13] block: allow blk-zoned devices to have non-power-of-2 zone size Pankaj Raghav
@ 2022-06-15 20:28       ` Bart Van Assche
  2022-06-16 10:09         ` Pankaj Raghav
  0 siblings, 1 reply; 37+ messages in thread
From: Bart Van Assche @ 2022-06-15 20:28 UTC (permalink / raw)
  To: Pankaj Raghav, hch, snitzer, damien.lemoal, axboe
  Cc: pankydev8, gost.dev, jiangbo.365, linux-nvme, linux-kernel,
	linux-block, dm-devel, jonathan.derrick, Johannes.Thumshirn,
	dsterba, jaegeuk, Luis Chamberlain

On 6/15/22 03:19, Pankaj Raghav wrote:
> @@ -489,14 +489,27 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>   	 * smaller last zone.
>   	 */
>   	if (zone->start == 0) {
> -		if (zone->len == 0 || !is_power_of_2(zone->len)) {
> -			pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
> -				disk->disk_name, zone->len);
> +		if (zone->len == 0) {
> +			pr_warn("%s: Invalid zone size", disk->disk_name);
> +			return -ENODEV;
> +		}
> +
> +		/*
> +		 * Don't allow zoned device with non power_of_2 zone size with
> +		 * zone capacity less than zone size.
> +		 */

Please change "power_of_2" into "power-of-2".

> +		if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
> +			pr_warn("%s: Invalid zone capacity for non power of 2 zone size",
> +				disk->disk_name);
>   			return -ENODEV;
>   		}

The above check seems wrong to me. I don't see why devices that report a 
capacity that is less than the zone size should be rejected.

> +		/*
> +		 * Division is used to calculate nr_zones for both power_of_2
> +		 * and non power_of_2 zone sizes as it is not in the hot path.
> +		 */

Shouldn't the above comment be moved to the patch description? I'm not 
sure whether having such a comment in the source code is valuable.

> +static inline sector_t blk_queue_offset_from_zone_start(struct request_queue *q,
> +							sector_t sec)
> +{
> +	sector_t zone_sectors = blk_queue_zone_sectors(q);
> +	u64 remainder = 0;
> +
> +	if (!blk_queue_is_zoned(q))
> +		return false;

"return false" should only occur in functions returning a boolean. This 
function returns type sector_t.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 10/13] dm-table: use bdev_is_zone_start helper in device_area_is_invalid()
  2022-06-15 11:53       ` Damien Le Moal
@ 2022-06-16  9:55         ` Pankaj Raghav
  2022-06-16 23:29           ` Damien Le Moal
  0 siblings, 1 reply; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-16  9:55 UTC (permalink / raw)
  To: Damien Le Moal, hch, snitzer, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Luis Chamberlain

drivers/md/dm-table.c
>> +++ b/drivers/md/dm-table.c
>> @@ -251,7 +251,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
>>  	if (bdev_is_zoned(bdev)) {
>>  		unsigned int zone_sectors = bdev_zone_sectors(bdev);
>>  
>> -		if (start & (zone_sectors - 1)) {
>> +		if (blk_queue_is_zone_start(bdev_get_queue(bdev), start)) {
> 
> This is wrong. And you are changing this to the correct test in the next
> patch.
> 
Yeah, I think I made a mistake while committing it. The next patch
should only have the removing po2 condition check and these changes
should have been only in this patch. I will fix it up!
>>  			DMWARN("%s: start=%llu not aligned to h/w zone size %u of %pg",
>>  			       dm_device_name(ti->table->md),
>>  			       (unsigned long long)start,
>> @@ -268,7 +268,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
>>  		 * devices do not end up with a smaller zone in the middle of
>>  		 * the sector range.
>>  		 */
>> -		if (len & (zone_sectors - 1)) {
>> +		if (blk_queue_is_zone_start(bdev_get_queue(bdev), len)) {
>>  			DMWARN("%s: len=%llu not aligned to h/w zone size %u of %pg",
>>  			       dm_device_name(ti->table->md),
>>  			       (unsigned long long)len,
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [dm-devel] [PATCH v7 02/13] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-06-15 20:28       ` [dm-devel] " Bart Van Assche
@ 2022-06-16 10:09         ` Pankaj Raghav
  2022-06-16 16:04           ` Luis Chamberlain
  2022-06-16 23:30           ` Damien Le Moal
  0 siblings, 2 replies; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-16 10:09 UTC (permalink / raw)
  To: Bart Van Assche, hch, snitzer, damien.lemoal, axboe
  Cc: pankydev8, gost.dev, jiangbo.365, linux-nvme, linux-kernel,
	linux-block, dm-devel, jonathan.derrick, Johannes.Thumshirn,
	dsterba, jaegeuk, Luis Chamberlain

On 2022-06-15 22:28, Bart Van Assche wrote:
isk_name, zone->len);
>> +        if (zone->len == 0) {
>> +            pr_warn("%s: Invalid zone size", disk->disk_name);
>> +            return -ENODEV;
>> +        }
>> +
>> +        /*
>> +         * Don't allow zoned device with non power_of_2 zone size with
>> +         * zone capacity less than zone size.
>> +         */
> 

> Please change "power_of_2" into "power-of-2".
> 
Ok.
>> +        if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
>> +            pr_warn("%s: Invalid zone capacity for non power of 2
>> zone size",
>> +                disk->disk_name);
>>               return -ENODEV;
>>           }
> 
> The above check seems wrong to me. I don't see why devices that report a
> capacity that is less than the zone size should be rejected.
> 
This was brought up by Damien during previous reviews. The argument was
that the reason to allow non power-of-2 zoned device is to remove the
gaps between zone size and zone capacity. Allowing a npo2 zone size with
a different capacity, even though it is technically possible, it does
not make any practical sense. That is why this check was introduced.
Does that answer your question?
>> +        /*
>> +         * Division is used to calculate nr_zones for both power_of_2
>> +         * and non power_of_2 zone sizes as it is not in the hot path.
>> +         */
> 
> Shouldn't the above comment be moved to the patch description? I'm not
> sure whether having such a comment in the source code is valuable.
> 
Yeah, I will remove it. Maybe it is very obvious at this point.
>> +static inline sector_t blk_queue_offset_from_zone_start(struct
>> request_queue *q,
>> +                            sector_t sec)
>> +{
>> +    sector_t zone_sectors = blk_queue_zone_sectors(q);
>> +    u64 remainder = 0;
>> +
>> +    if (!blk_queue_is_zoned(q))
>> +        return false;
> 
> "return false" should only occur in functions returning a boolean. This
> function returns type sector_t.
> 
Good catch. It was a copy paste mistake. Fixed it.
> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 13/13] dm: add non power of 2 zoned target
  2022-06-15 19:54       ` Randy Dunlap
@ 2022-06-16 10:28         ` Pankaj Raghav
  0 siblings, 0 replies; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-16 10:28 UTC (permalink / raw)
  To: Randy Dunlap, hch, snitzer, damien.lemoal, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Damien Le Moal

On 2022-06-15 21:54, Randy Dunlap wrote:
> Hi--
> 
> On 6/15/22 03:19, Pankaj Raghav wrote:
> 
>> ---
>>  drivers/md/Kconfig                |   9 +
>>  drivers/md/Makefile               |   2 +
>>  drivers/md/dm-zone.c              |   9 +
>>  drivers/md/dm-zoned-npo2-target.c | 268 ++++++++++++++++++++++++++++++
>>  4 files changed, 288 insertions(+)
>>  create mode 100644 drivers/md/dm-zoned-npo2-target.c
>>
>> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
>> index 998a5cfdb..773314536 100644
>> --- a/drivers/md/Kconfig
>> +++ b/drivers/md/Kconfig
>> @@ -518,6 +518,15 @@ config DM_FLAKEY
>>  	help
>>  	 A target that intermittently fails I/O for debugging purposes.
>>  
>> +config DM_ZONED_NPO2
>> +	tristate "Zoned non power of 2 target"
> 
> 	         "Zoned non-power-of-2 target"
> 
>> +	depends on BLK_DEV_DM
>> +	depends on BLK_DEV_ZONED
>> +	help
>> +	A target that converts a zoned device with non power of 2 zone size to
> 
> 	                                           non-power-of-2 zone size to
> 
>> +	be power of 2. This is done by introducing gaps in between the zone
>> +	capacity and the power of 2 zone size.
> 
> All help text should be indented with one tab and 2 spaces
> according to Documentation/process/coding-style.rst.
> 
>> +
>>  config DM_VERITY
>>  	tristate "Verity target support"
>>  	depends on BLK_DEV_DM
> 
> [snip]
> 
>> +
>> +MODULE_DESCRIPTION(DM_NAME " non power 2 zoned target");
> 
>                                 non-power-of-2
> 
>> +MODULE_AUTHOR("Pankaj Raghav <p.raghav@samsung.com>");
>> +MODULE_LICENSE("GPL");
>> +
> 
Thanks Randy. Fixed.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [dm-devel] [PATCH v7 12/13] dm: call dm_zone_endio after the target endio callback for zoned devices
  2022-06-15 11:01       ` [dm-devel] " Damien Le Moal
@ 2022-06-16 12:24         ` Pankaj Raghav
  2022-06-16 23:33           ` Damien Le Moal
  0 siblings, 1 reply; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-16 12:24 UTC (permalink / raw)
  To: Damien Le Moal, hch, snitzer, axboe
  Cc: bvanassche, pankydev8, gost.dev, jiangbo.365, linux-nvme,
	linux-kernel, linux-block, dm-devel, jonathan.derrick,
	Johannes.Thumshirn, dsterba, jaegeuk

On 2022-06-15 13:01, Damien Le Moal wrote:
> On 6/15/22 19:19, Pankaj Raghav wrote:
>> dm_zone_endio() updates the bi_sector of orig bio for zoned devices that
>> uses either native append or append emulation and it is called before the
>> endio of the target. But target endio can still update the clone bio
>> after dm_zone_endio is called, thereby, the orig bio does not contain
>> the updated information anymore. Call dm_zone_endio for zoned devices
>> after calling the target's endio function
>>
>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>> ---
>> @Damien and @Hannes: I couldn't come up with a testcase that uses endio callback and
>> zone append or append emulation for zoned devices to test for
>> regression in this change. It would be great if you can suggest
>> something. This change is required for the npo2 target as we update the
>> clone bio sector in the endio callback function and the orig bio should
>> be updated only after the endio callback for zone appends.
> 
> Running zonefs tests on top of dm-crypt will exercise DM zone append
> emulation.
> 
Thanks. However, I am facing issues creating a dm-crypt target with a
zoned device. Steps:
- cryptsetup luksFormat <zns-device>

is throwing a bunch of IO errors with the following error message:
Device wipe error, offset 32768.
Cannot wipe header on device <zns-device>.

I can observe the same behavior in both v5.18 and next-20220615 with
cryptsetup 2.4.3.The same step is working correctly on a normal NVMe device.
Am I doing something wrong?
ZNS info: zsze 128M and zcap 128M with 50 zones
>>
>>  drivers/md/dm.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>> index 3f17fe1de..3a74e1038 100644
>> --- a/drivers/md/dm.c
>> +++ b/drivers/md/dm.c
>> @@ -1025,10 +1025,6 @@ static void clone_endio(struct bio *bio)
>>  			disable_write_zeroes(md);
>>  	}
>>  
>> -	if (static_branch_unlikely(&zoned_enabled) &&
>> -	    unlikely(blk_queue_is_zoned(bdev_get_queue(bio->bi_bdev))))
>> -		dm_zone_endio(io, bio);
>> -
>>  	if (endio) {
>>  		int r = endio(ti, bio, &error);
>>  		switch (r) {
>> @@ -1057,6 +1053,10 @@ static void clone_endio(struct bio *bio)
>>  		}
>>  	}
>>  
>> +	if (static_branch_unlikely(&zoned_enabled) &&
>> +	    unlikely(blk_queue_is_zoned(bdev_get_queue(bio->bi_bdev))))
> 
> blk_queue_is_zoned(bdev_get_queue(bio->bi_bdev))) ->
> bdev_is_zoned(bio->bi_bdev)
> 
Ok. Even though I just moved the statements, I think this is trivial
enough to take it along.
>> +		dm_zone_endio(io, bio);
>> +
>>  	if (static_branch_unlikely(&swap_bios_enabled) &&
>>  	    unlikely(swap_bios_limit(ti, bio)))
>>  		up(&md->swap_bios_semaphore);
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [dm-devel] [PATCH v7 02/13] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-06-16 10:09         ` Pankaj Raghav
@ 2022-06-16 16:04           ` Luis Chamberlain
  2022-06-16 23:30           ` Damien Le Moal
  1 sibling, 0 replies; 37+ messages in thread
From: Luis Chamberlain @ 2022-06-16 16:04 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Bart Van Assche, hch, snitzer, damien.lemoal, axboe, pankydev8,
	gost.dev, jiangbo.365, linux-nvme, linux-kernel, linux-block,
	dm-devel, jonathan.derrick, Johannes.Thumshirn, dsterba, jaegeuk

On Thu, Jun 16, 2022 at 12:09:35PM +0200, Pankaj Raghav wrote:
> On 2022-06-15 22:28, Bart Van Assche wrote:
> >> +        if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
> >> +            pr_warn("%s: Invalid zone capacity for non power of 2
> >> zone size",
> >> +                disk->disk_name);
> >>               return -ENODEV;
> >>           }
> > 
> > The above check seems wrong to me. I don't see why devices that report a
> > capacity that is less than the zone size should be rejected.
> > 
> This was brought up by Damien during previous reviews. The argument was
> that the reason to allow non power-of-2 zoned device is to remove the
> gaps between zone size and zone capacity. Allowing a npo2 zone size with
> a different capacity, even though it is technically possible, it does
> not make any practical sense. That is why this check was introduced.
> Does that answer your question?

Perhaps just add a comment because unless you are involved in the prior
reviews this might not be clear.

  Luis

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 13/13] dm: add non power of 2 zoned target
  2022-06-15 11:49       ` Damien Le Moal
@ 2022-06-16 16:12         ` Pankaj Raghav
  2022-06-16 23:49           ` Damien Le Moal
  0 siblings, 1 reply; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-16 16:12 UTC (permalink / raw)
  To: Damien Le Moal, hch, snitzer, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Damien Le Moal

Hi Damien,
On 2022-06-15 13:49, Damien Le Moal wrote:
> On 6/15/22 19:19, Pankaj Raghav wrote:
>> Only power of 2(po2) zoned devices were supported in linux but now non
>> power of 2(npo2) zoned device support has been added to the block layer.
>>
>> Filesystems such as F2FS and btrfs have support for zoned devices with
>> po2 zone size assumption. Before adding native support for npo2 zoned
>> devices, it was suggested to create a dm target for npo2 zoned device to
>> appear as po2 device so that file systems can initially work without any
>> explicit changes by using this target.
>>
>> The design of this target is very simple: introduce gaps between the zone
>> capacity and the po2 zone size of the underlying device. All IOs will be
>> remapped from target to the actual device location. For devices that use
>> zone append, the bi_sector is remapped from device to target's layout.
> 
> Nothing special for zone append in this respect. All IOs are remapped
> likewise, right ?
> 
This is what is being done: when we submit, we adjust the sector value
from target to device, and the actual sector value from bio gets updated
in the endio function where we transform from device -> target for zone
appends.
>>
>> The read IOs that fall in the "emulated" gap area will return 0 and all
>> the other IOs in that area will result in an error. If an read IO span
>> across the zone capacity boundary, then the IOs are split between the
>> boundary. All other IO operations that span across a zone capacity
>> boundary will result in an error.
>>
>> The target can be easily updated as follows:
> 
> Updated ? you mean created, no ?
> 
Yeah. I will fix it.
>> dmsetup create <label> --table '0 <size_sects> zoned-npo2 /dev/nvme<id>'
>>
>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>> Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> Suggested-by: Damien Le Moal <damien.lemoal@wdc.com>
>> Suggested-by: Hannes Reinecke <hare@suse.de>
>> ---
>>  drivers/md/Kconfig                |   9 +
>>  drivers/md/Makefile               |   2 +
>>  drivers/md/dm-zone.c              |   9 +
>>  drivers/md/dm-zoned-npo2-target.c | 268 ++++++++++++++++++++++++++++++
>>  4 files changed, 288 insertions(+)
>>  create mode 100644 drivers/md/dm-zoned-npo2-target.c
>>
>> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
>> index 998a5cfdb..773314536 100644
>> --- a/drivers/md/Kconfig
>> +++ b/drivers/md/Kconfig
>> @@ -518,6 +518,15 @@ config DM_FLAKEY
>>  	help
>>  	 A target that intermittently fails I/O for debugging purposes.
>>  
>> +config DM_ZONED_NPO2
>> +	tristate "Zoned non power of 2 target"
>> +	depends on BLK_DEV_DM
>> +	depends on BLK_DEV_ZONED
>> +	help
>> +	A target that converts a zoned device with non power of 2 zone size to
>> +	be power of 2. This is done by introducing gaps in between the zone
>> +	capacity and the power of 2 zone size.
>> +
>>  config DM_VERITY
>>  	tristate "Verity target support"
>>  	depends on BLK_DEV_DM
>> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
>> index 0454b0885..2863a94a7 100644
>> --- a/drivers/md/Makefile
>> +++ b/drivers/md/Makefile
>> @@ -26,6 +26,7 @@ dm-era-y	+= dm-era-target.o
>>  dm-clone-y	+= dm-clone-target.o dm-clone-metadata.o
>>  dm-verity-y	+= dm-verity-target.o
>>  dm-zoned-y	+= dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
>> +dm-zoned-npo2-y       += dm-zoned-npo2-target.o
> 
> This naming is in my opinion very bad as it seems related to the dm-zoned
> target. e.g. dm-po2z, dm-zp2, etc.
> 
Probably dm-po2z sounds good. I will go for it if others don't have any
objection.
>>  
>>  md-mod-y	+= md.o md-bitmap.o
>>  raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
>> @@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
>>  obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
>>  obj-$(CONFIG_DM_DUST)		+= dm-dust.o
>>  obj-$(CONFIG_DM_FLAKEY)		+= dm-flakey.o
>> +obj-$(CONFIG_DM_ZONED_NPO2)	+= dm-zoned-npo2.o
>>  obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
>>  obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
>>  obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
>> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
>> index af36d33f9..5efb31ba0 100644
>> --- a/drivers/md/dm-zone.c
>> +++ b/drivers/md/dm-zone.c
>> @@ -210,6 +210,11 @@ static int dm_zone_revalidate_cb(struct blk_zone *zone, unsigned int idx,
>>  		}
>>  		md->zwp_offset[idx] = dm_get_zone_wp_offset(zone);
>>  
>> +		if (q->limits.chunk_sectors != zone->len) {
> 
> Why is this if needed ?
> 
Explanation below.
>> +			blk_queue_chunk_sectors(q, zone->len);
>> +			q->nr_zones = blkdev_nr_zones(md->disk);
>> +		}
>> +
>>  		break;
>>  	default:
>>  		DMERR("Invalid zone type 0x%x at sectors %llu",
>> @@ -307,6 +312,9 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
>>  	if (dm_table_supports_zone_append(t)) {
>>  		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
>>  		dm_cleanup_zoned_dev(md);
>> +
> 
> no need for the blank line.
> 
>> +		if (!is_power_of_2(blk_queue_zone_sectors(q)))
>> +			goto revalidate_zones;
>>  		return 0;
>>  	}
> 
> Why do you need to change dm_set_zones_restrictions() at all ?
> 
When the device mapper is created, the q->limits gets inherited from the
underlying device. The chunk sectors of the target and the device will
be the same but we want the chunk sector of the target to be different
(rounded to po2) compared to the underlying device's chunk sector. This
needs to be done only for the dm-po2z target and not for other targets
that uses npo2 zoned devices (like dm-linear). So to perform this
operation in a target independent way in dm-zone.c, I chose to always
revalidate npo2 zoned device and update the chunk sector and nr_zones in
dm_zone_revalidate_cb based on the zone information from the target.
This allows to set the limits correctly for dm-po2z target.
>>  
<snip>
return -EINVAL;
>> +	}
>> +
>> +	if (is_power_of_2(zsze)) {
>> +		DMERR("%pg zone size is power of 2", dmh->dev->bdev);
> 
> Hmmm... You would end up with no remapping needed so it would still
> work... Why error this ? A warning would work too.
> 
You mean a DMWARN and not return -EINVAL? I mean there is no usecase for
po2 device to use this target so why allow it in the first place?
>> +		return -EINVAL;
>> +	}
>> +
>> +	dmh->zsze = zsze;
>> +	dmh->zsze_po2 = 1 << get_count_order_long(zsze);
>> +	dmh->zsze_diff = dmh->zsze_po2 - dmh->zsze;
>> +
>> +	ti->private = dmh;
>> +	ti->num_flush_bios = 1;
>> +	ti->num_discard_bios = 1;
>> +	ti->num_secure_erase_bios = 1;
>> +	ti->num_write_zeroes_bios = 1;
> 
> Why all these ? I know dm-linear do that but I do not see why they would
> be necessary for a single device target.
> 
Good point. I will remove them

<snip>
>> +		return DMZ_NPO2_IO_INSIDE_ZONE;
>> +	else if (relative_sect >= dmh->zsze)
> 
> no need for the else. And this is super confusing. This case correspond to
> the BIO going beyond the zone capacity in the target address space,
> meaning it is still WITHIN the target zone. But you call that "outside"
> because it is for the device zone. Super confusing. It took me a lot of
> rereading to finally get it.
> 
Probably my naming choice was not correct here for the enum. It should
be s/IO_INSIDE_ZONE/IO_INSIDE_ZONE_CAP, etc to be clear. I mainly wanted
to handle the case where a read across zone capacity should return
something valid instead of just an error as we emulate the LBAs from
zone cap to zone size.

DMZ_NPO2_IO_INSIDE_ZONE_CAP:
             zcap   zsize
---------------|-----|
      <------>
        bio
Normal scenario we just send what is there in the device.


DMZ_NPO2_IO_OUTSIDE_ZONE_CAP:
             zcap       zsize
---------------|---------|
                 <---->
                   bio

Read should return zero filled bio and other operation will return an
error because we are touching the emulated area.

DMZ_NPO2_IO_ACROSS_ZONE_CAP:
             zcap   zsize
---------------|-----|
           <------>
              bio
For reads, the bio should be split across zone cap and the bio on the
left hand side returns what is there in the device and the split bio on
the right hand side should just return zeroes. All other requests will
return an error.

>> +		return DMZ_NPO2_IO_OUTSIDE_ZONE;
>> +
>> +	return DMZ_NPO2_IO_ACROSS_ZONE;
> 
> So you BIO is eeither fully contained within the zone or it is not. So why
> not just return a bool ?
> 
As I explained above, I was considering the boundary as zone cap inside
a zone. The bio can be within zone cap, across zone cap into the
emulated zone size and outside zone capacity.

I didn't take into account the read across zone. I will make sure it is
correctly handled in the next revision.

             zcap  zsize          zcap
---------------|-----|--------------|
                  <------>
                     bio
>> +}
>> +
>> +static void split_io_across_zone_boundary(struct dmz_npo2_target *dmh,
>> +					  struct bio *bio)
>> +{
>> +	sector_t sect = bio->bi_iter.bi_sector;
>> +	sector_t sects_from_zone_start;
>> +
>> +	sect = target_to_device_sect(dmh, sect);
> 
> 	sect = target_to_device_sect(dmh, bio->bi_iter.bi_sector);
> 
> is more readable.
> 
>> +	div64_u64_rem(sect, dmh->zsze, &sects_from_zone_start);
>> +	dm_accept_partial_bio(bio, dmh->zsze - sects_from_zone_start);
> 
> So if this is a read BIO starting exactly at the target zone capacity
> (sects_from_zone_start == zsze), then you accept 0 sectors ? What am I
> missing here ?
> 
Your condition will not even touch this function. This function comes
into play only when the bio runs across the zone capacity as I mentioned
before.
>> +	bio->bi_iter.bi_sector = sect;
>> +}
>> +
>> +static int handle_zone_boundary_violation(struct dmz_npo2_target *dmh,
>> +					  struct bio *bio,
>> +					  enum dmz_npo2_io_cond cond)
>> +{
>> +	/* Read should return zeroed page */
>> +	if (bio_op(bio) == REQ_OP_READ) {
>> +		if (cond == DMZ_NPO2_IO_ACROSS_ZONE) {
>> +			split_io_across_zone_boundary(dmh, bio);
>> +			return DM_MAPIO_REMAPPED;
>> +		}
>> +		zero_fill_bio(bio);
>> +		bio_endio(bio);
>> +		return DM_MAPIO_SUBMITTED;
>> +	}
>> +	return DM_MAPIO_KILL;
>> +}
>> +
>> +static int dmz_npo2_end_io(struct dm_target *ti, struct bio *bio,
>> +			   blk_status_t *error)
>> +{
>> +	struct dmz_npo2_target *dmh = ti->private;
>> +
>> +	if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND)
>> +		bio->bi_iter.bi_sector =
>> +			device_to_target_sect(dmh, bio->bi_iter.bi_sector);
>> +
>> +	return DM_ENDIO_DONE;
>> +}
>> +
>> +static int dmz_npo2_map(struct dm_target *ti, struct bio *bio)
>> +{
>> +	struct dmz_npo2_target *dmh = ti->private;
>> +	enum dmz_npo2_io_cond cond;
>> +
>> +	bio_set_dev(bio, dmh->dev->bdev);
>> +	if (bio_sectors(bio) || op_is_zone_mgmt(bio_op(bio))) {
>> +		cond = check_zone_boundary_violation(dmh, bio->bi_iter.bi_sector,
>> +						     bio->bi_iter.bi_size >> SECTOR_SHIFT);
> 
> Why check this for zone management BIOs ? These have length = 0, always.
> 
>> +
>> +		/*
>> +		 * If the starting sector is in the emulated area then fill
>> +		 * all the bio with zeros. If bio is across boundaries,
>> +		 * split the bio across boundaries and fill zeros only for the
>> +		 * bio that is outside the zone capacity
>> +		 */
>> +		switch (cond) {
>> +		case DMZ_NPO2_IO_INSIDE_ZONE:
>> +			bio->bi_iter.bi_sector = target_to_device_sect(dmh,
>> +								       bio->bi_iter.bi_sector);
>> +			break;
>> +		case DMZ_NPO2_IO_ACROSS_ZONE:
>> +		case DMZ_NPO2_IO_OUTSIDE_ZONE:
>> +			return handle_zone_boundary_violation(dmh, bio, cond);
>> +		}
>> +	}
>> +	return DM_MAPIO_REMAPPED;
> 
> This entire function is very hard to read because everything is hidden in
> helpers that are not super useful in my opinion. I would prefer seeing
> cases for:
> * zone management BIOs
> * Reads and writes
> * Everything else
> 
> where tests against the bio sector and length are visible, so one can
> understand what is going on. If you need helpers, have handle_zone_mgmt(),
> handle_read() etc. Something clear.
> 
Got it. I see the confusion here. I will rearrange it properly in the
next revision. Thanks for this comment.
>> +}
>> +
>> +static int dmz_npo2_iterate_devices(struct dm_target *ti,
>> +				    iterate_devices_callout_fn fn, void *data)
>> +{
>> +	struct dmz_npo2_target *dmh = ti->private;
>> +	sector_t len = 0;
>> +
>> +	len = dmh->nr_zones * dmh->zsze;
> 
> Move this to the declaration instead of setting len to 0 for nothing.
> 
Ok.
>> +	return fn(ti, dmh->dev, 0, len, data);


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 10/13] dm-table: use bdev_is_zone_start helper in device_area_is_invalid()
  2022-06-16  9:55         ` Pankaj Raghav
@ 2022-06-16 23:29           ` Damien Le Moal
  0 siblings, 0 replies; 37+ messages in thread
From: Damien Le Moal @ 2022-06-16 23:29 UTC (permalink / raw)
  To: Pankaj Raghav, hch, snitzer, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Luis Chamberlain

On 6/16/22 18:55, Pankaj Raghav wrote:
> drivers/md/dm-table.c
>>> +++ b/drivers/md/dm-table.c
>>> @@ -251,7 +251,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
>>>  	if (bdev_is_zoned(bdev)) {
>>>  		unsigned int zone_sectors = bdev_zone_sectors(bdev);
>>>  
>>> -		if (start & (zone_sectors - 1)) {
>>> +		if (blk_queue_is_zone_start(bdev_get_queue(bdev), start)) {
>>
>> This is wrong. And you are changing this to the correct test in the next
>> patch.
>>
> Yeah, I think I made a mistake while committing it. The next patch
> should only have the removing po2 condition check and these changes
> should have been only in this patch. I will fix it up!

This one and the next patch should be squashed together. They both deal
with non power of 2 zone size.


>>>  			DMWARN("%s: start=%llu not aligned to h/w zone size %u of %pg",
>>>  			       dm_device_name(ti->table->md),
>>>  			       (unsigned long long)start,
>>> @@ -268,7 +268,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
>>>  		 * devices do not end up with a smaller zone in the middle of
>>>  		 * the sector range.
>>>  		 */
>>> -		if (len & (zone_sectors - 1)) {
>>> +		if (blk_queue_is_zone_start(bdev_get_queue(bdev), len)) {
>>>  			DMWARN("%s: len=%llu not aligned to h/w zone size %u of %pg",
>>>  			       dm_device_name(ti->table->md),
>>>  			       (unsigned long long)len,
>>
>>


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [dm-devel] [PATCH v7 02/13] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-06-16 10:09         ` Pankaj Raghav
  2022-06-16 16:04           ` Luis Chamberlain
@ 2022-06-16 23:30           ` Damien Le Moal
  1 sibling, 0 replies; 37+ messages in thread
From: Damien Le Moal @ 2022-06-16 23:30 UTC (permalink / raw)
  To: Pankaj Raghav, Bart Van Assche, hch, snitzer, axboe
  Cc: pankydev8, gost.dev, jiangbo.365, linux-nvme, linux-kernel,
	linux-block, dm-devel, Luis Chamberlain, jonathan.derrick,
	Johannes.Thumshirn, dsterba, jaegeuk

On 6/16/22 19:09, Pankaj Raghav wrote:
> On 2022-06-15 22:28, Bart Van Assche wrote:
> isk_name, zone->len);
>>> +        if (zone->len == 0) {
>>> +            pr_warn("%s: Invalid zone size", disk->disk_name);
>>> +            return -ENODEV;
>>> +        }
>>> +
>>> +        /*
>>> +         * Don't allow zoned device with non power_of_2 zone size with
>>> +         * zone capacity less than zone size.
>>> +         */
>>
> 
>> Please change "power_of_2" into "power-of-2".
>>
> Ok.
>>> +        if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
>>> +            pr_warn("%s: Invalid zone capacity for non power of 2
>>> zone size",
>>> +                disk->disk_name);
>>>               return -ENODEV;
>>>           }
>>
>> The above check seems wrong to me. I don't see why devices that report a
>> capacity that is less than the zone size should be rejected.
>>
> This was brought up by Damien during previous reviews. The argument was
> that the reason to allow non power-of-2 zoned device is to remove the
> gaps between zone size and zone capacity. Allowing a npo2 zone size with
> a different capacity, even though it is technically possible, it does
> not make any practical sense. That is why this check was introduced.
> Does that answer your question?

Add a comment explaining this restriction, clearly mentioning that it is a
Linux restrictions and not mandated by the specifications.

>>> +        /*
>>> +         * Division is used to calculate nr_zones for both power_of_2
>>> +         * and non power_of_2 zone sizes as it is not in the hot path.
>>> +         */
>>
>> Shouldn't the above comment be moved to the patch description? I'm not
>> sure whether having such a comment in the source code is valuable.
>>
> Yeah, I will remove it. Maybe it is very obvious at this point.
>>> +static inline sector_t blk_queue_offset_from_zone_start(struct
>>> request_queue *q,
>>> +                            sector_t sec)
>>> +{
>>> +    sector_t zone_sectors = blk_queue_zone_sectors(q);
>>> +    u64 remainder = 0;
>>> +
>>> +    if (!blk_queue_is_zoned(q))
>>> +        return false;
>>
>> "return false" should only occur in functions returning a boolean. This
>> function returns type sector_t.
>>
> Good catch. It was a copy paste mistake. Fixed it.
>> Thanks,
>>
>> Bart.
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [dm-devel] [PATCH v7 12/13] dm: call dm_zone_endio after the target endio callback for zoned devices
  2022-06-16 12:24         ` Pankaj Raghav
@ 2022-06-16 23:33           ` Damien Le Moal
  0 siblings, 0 replies; 37+ messages in thread
From: Damien Le Moal @ 2022-06-16 23:33 UTC (permalink / raw)
  To: Pankaj Raghav, hch, snitzer, axboe
  Cc: bvanassche, pankydev8, gost.dev, jiangbo.365, linux-nvme,
	linux-kernel, linux-block, dm-devel, jonathan.derrick,
	Johannes.Thumshirn, dsterba, jaegeuk

On 6/16/22 21:24, Pankaj Raghav wrote:
> On 2022-06-15 13:01, Damien Le Moal wrote:
>> On 6/15/22 19:19, Pankaj Raghav wrote:
>>> dm_zone_endio() updates the bi_sector of orig bio for zoned devices that
>>> uses either native append or append emulation and it is called before the
>>> endio of the target. But target endio can still update the clone bio
>>> after dm_zone_endio is called, thereby, the orig bio does not contain
>>> the updated information anymore. Call dm_zone_endio for zoned devices
>>> after calling the target's endio function
>>>
>>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>>> ---
>>> @Damien and @Hannes: I couldn't come up with a testcase that uses endio callback and
>>> zone append or append emulation for zoned devices to test for
>>> regression in this change. It would be great if you can suggest
>>> something. This change is required for the npo2 target as we update the
>>> clone bio sector in the endio callback function and the orig bio should
>>> be updated only after the endio callback for zone appends.
>>
>> Running zonefs tests on top of dm-crypt will exercise DM zone append
>> emulation.
>>
> Thanks. However, I am facing issues creating a dm-crypt target with a
> zoned device. Steps:
> - cryptsetup luksFormat <zns-device>

luks format is not supported because cryptsetup does not write the
metadata sequentially. I am working on fixing that. Use the plain format.

> 
> is throwing a bunch of IO errors with the following error message:
> Device wipe error, offset 32768.
> Cannot wipe header on device <zns-device>.
> 
> I can observe the same behavior in both v5.18 and next-20220615 with
> cryptsetup 2.4.3.The same step is working correctly on a normal NVMe device.
> Am I doing something wrong?
> ZNS info: zsze 128M and zcap 128M with 50 zones
>>>
>>>  drivers/md/dm.c | 8 ++++----
>>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>> index 3f17fe1de..3a74e1038 100644
>>> --- a/drivers/md/dm.c
>>> +++ b/drivers/md/dm.c
>>> @@ -1025,10 +1025,6 @@ static void clone_endio(struct bio *bio)
>>>  			disable_write_zeroes(md);
>>>  	}
>>>  
>>> -	if (static_branch_unlikely(&zoned_enabled) &&
>>> -	    unlikely(blk_queue_is_zoned(bdev_get_queue(bio->bi_bdev))))
>>> -		dm_zone_endio(io, bio);
>>> -
>>>  	if (endio) {
>>>  		int r = endio(ti, bio, &error);
>>>  		switch (r) {
>>> @@ -1057,6 +1053,10 @@ static void clone_endio(struct bio *bio)
>>>  		}
>>>  	}
>>>  
>>> +	if (static_branch_unlikely(&zoned_enabled) &&
>>> +	    unlikely(blk_queue_is_zoned(bdev_get_queue(bio->bi_bdev))))
>>
>> blk_queue_is_zoned(bdev_get_queue(bio->bi_bdev))) ->
>> bdev_is_zoned(bio->bi_bdev)
>>
> Ok. Even though I just moved the statements, I think this is trivial
> enough to take it along.
>>> +		dm_zone_endio(io, bio);
>>> +
>>>  	if (static_branch_unlikely(&swap_bios_enabled) &&
>>>  	    unlikely(swap_bios_limit(ti, bio)))
>>>  		up(&md->swap_bios_semaphore);
>>
>>


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 13/13] dm: add non power of 2 zoned target
  2022-06-16 16:12         ` Pankaj Raghav
@ 2022-06-16 23:49           ` Damien Le Moal
  2022-06-17  5:45             ` Pankaj Raghav
  0 siblings, 1 reply; 37+ messages in thread
From: Damien Le Moal @ 2022-06-16 23:49 UTC (permalink / raw)
  To: Pankaj Raghav, hch, snitzer, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Damien Le Moal

On 6/17/22 01:12, Pankaj Raghav wrote:
> Hi Damien,
> On 2022-06-15 13:49, Damien Le Moal wrote:
>> On 6/15/22 19:19, Pankaj Raghav wrote:
>>> Only power of 2(po2) zoned devices were supported in linux but now non
>>> power of 2(npo2) zoned device support has been added to the block layer.
>>>
>>> Filesystems such as F2FS and btrfs have support for zoned devices with
>>> po2 zone size assumption. Before adding native support for npo2 zoned
>>> devices, it was suggested to create a dm target for npo2 zoned device to
>>> appear as po2 device so that file systems can initially work without any
>>> explicit changes by using this target.
>>>
>>> The design of this target is very simple: introduce gaps between the zone
>>> capacity and the po2 zone size of the underlying device. All IOs will be
>>> remapped from target to the actual device location. For devices that use
>>> zone append, the bi_sector is remapped from device to target's layout.
>>
>> Nothing special for zone append in this respect. All IOs are remapped
>> likewise, right ?
>>
> This is what is being done: when we submit, we adjust the sector value
> from target to device, and the actual sector value from bio gets updated
> in the endio function where we transform from device -> target for zone
> appends.

I know all that. This was a remark intended at pointing out that this
commit message statement does not have any value, it does not help in
understanding any peculiarity of this target driver (if any). It seems
targeted at zone append only. Reword this to explain the remapping for all
IOs, and that zone management request and report zones also need remapping.

>>>
>>> The read IOs that fall in the "emulated" gap area will return 0 and all
>>> the other IOs in that area will result in an error. If an read IO span
>>> across the zone capacity boundary, then the IOs are split between the
>>> boundary. All other IO operations that span across a zone capacity
>>> boundary will result in an error.
>>>
>>> The target can be easily updated as follows:
>>
>> Updated ? you mean created, no ?
>>
> Yeah. I will fix it.
>>> dmsetup create <label> --table '0 <size_sects> zoned-npo2 /dev/nvme<id>'
>>>
>>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>>> Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>> Suggested-by: Damien Le Moal <damien.lemoal@wdc.com>
>>> Suggested-by: Hannes Reinecke <hare@suse.de>
>>> ---
>>>  drivers/md/Kconfig                |   9 +
>>>  drivers/md/Makefile               |   2 +
>>>  drivers/md/dm-zone.c              |   9 +
>>>  drivers/md/dm-zoned-npo2-target.c | 268 ++++++++++++++++++++++++++++++
>>>  4 files changed, 288 insertions(+)
>>>  create mode 100644 drivers/md/dm-zoned-npo2-target.c
>>>
>>> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
>>> index 998a5cfdb..773314536 100644
>>> --- a/drivers/md/Kconfig
>>> +++ b/drivers/md/Kconfig
>>> @@ -518,6 +518,15 @@ config DM_FLAKEY
>>>  	help
>>>  	 A target that intermittently fails I/O for debugging purposes.
>>>  
>>> +config DM_ZONED_NPO2
>>> +	tristate "Zoned non power of 2 target"
>>> +	depends on BLK_DEV_DM
>>> +	depends on BLK_DEV_ZONED
>>> +	help
>>> +	A target that converts a zoned device with non power of 2 zone size to
>>> +	be power of 2. This is done by introducing gaps in between the zone
>>> +	capacity and the power of 2 zone size.
>>> +
>>>  config DM_VERITY
>>>  	tristate "Verity target support"
>>>  	depends on BLK_DEV_DM
>>> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
>>> index 0454b0885..2863a94a7 100644
>>> --- a/drivers/md/Makefile
>>> +++ b/drivers/md/Makefile
>>> @@ -26,6 +26,7 @@ dm-era-y	+= dm-era-target.o
>>>  dm-clone-y	+= dm-clone-target.o dm-clone-metadata.o
>>>  dm-verity-y	+= dm-verity-target.o
>>>  dm-zoned-y	+= dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
>>> +dm-zoned-npo2-y       += dm-zoned-npo2-target.o
>>
>> This naming is in my opinion very bad as it seems related to the dm-zoned
>> target. e.g. dm-po2z, dm-zp2, etc.
>>
> Probably dm-po2z sounds good. I will go for it if others don't have any
> objection.
>>>  
>>>  md-mod-y	+= md.o md-bitmap.o
>>>  raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
>>> @@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
>>>  obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
>>>  obj-$(CONFIG_DM_DUST)		+= dm-dust.o
>>>  obj-$(CONFIG_DM_FLAKEY)		+= dm-flakey.o
>>> +obj-$(CONFIG_DM_ZONED_NPO2)	+= dm-zoned-npo2.o
>>>  obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
>>>  obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
>>>  obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
>>> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
>>> index af36d33f9..5efb31ba0 100644
>>> --- a/drivers/md/dm-zone.c
>>> +++ b/drivers/md/dm-zone.c
>>> @@ -210,6 +210,11 @@ static int dm_zone_revalidate_cb(struct blk_zone *zone, unsigned int idx,
>>>  		}
>>>  		md->zwp_offset[idx] = dm_get_zone_wp_offset(zone);
>>>  
>>> +		if (q->limits.chunk_sectors != zone->len) {
>>
>> Why is this if needed ?
>>
> Explanation below.
>>> +			blk_queue_chunk_sectors(q, zone->len);
>>> +			q->nr_zones = blkdev_nr_zones(md->disk);
>>> +		}
>>> +
>>>  		break;
>>>  	default:
>>>  		DMERR("Invalid zone type 0x%x at sectors %llu",
>>> @@ -307,6 +312,9 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
>>>  	if (dm_table_supports_zone_append(t)) {
>>>  		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
>>>  		dm_cleanup_zoned_dev(md);
>>> +
>>
>> no need for the blank line.
>>
>>> +		if (!is_power_of_2(blk_queue_zone_sectors(q)))
>>> +			goto revalidate_zones;
>>>  		return 0;
>>>  	}
>>
>> Why do you need to change dm_set_zones_restrictions() at all ?
>>
> When the device mapper is created, the q->limits gets inherited from the
> underlying device. The chunk sectors of the target and the device will
> be the same but we want the chunk sector of the target to be different
> (rounded to po2) compared to the underlying device's chunk sector. This
> needs to be done only for the dm-po2z target and not for other targets
> that uses npo2 zoned devices (like dm-linear). So to perform this
> operation in a target independent way in dm-zone.c, I chose to always
> revalidate npo2 zoned device and update the chunk sector and nr_zones in
> dm_zone_revalidate_cb based on the zone information from the target.
> This allows to set the limits correctly for dm-po2z target.

But DM revalidate will be called for the target AFTER it is setup (after
its gendisk is added). So how can DM revalidate see the incorrect zone
size ? If that is the case, then the target constructor is broken or
missing something. DM revalidate zone is generic and only allocates the
zone bitmaps for the target device. There should be not need at all to
touch that function.

>>>  
> <snip>
> return -EINVAL;
>>> +	}
>>> +
>>> +	if (is_power_of_2(zsze)) {
>>> +		DMERR("%pg zone size is power of 2", dmh->dev->bdev);
>>
>> Hmmm... You would end up with no remapping needed so it would still
>> work... Why error this ? A warning would work too.
>>
> You mean a DMWARN and not return -EINVAL? I mean there is no usecase for
> po2 device to use this target so why allow it in the first place?

Why disallow it ? It will work. Either way is OK but I wanted to point out
that both path are I think equally valid.

>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	dmh->zsze = zsze;
>>> +	dmh->zsze_po2 = 1 << get_count_order_long(zsze);
>>> +	dmh->zsze_diff = dmh->zsze_po2 - dmh->zsze;
>>> +
>>> +	ti->private = dmh;
>>> +	ti->num_flush_bios = 1;
>>> +	ti->num_discard_bios = 1;
>>> +	ti->num_secure_erase_bios = 1;
>>> +	ti->num_write_zeroes_bios = 1;
>>
>> Why all these ? I know dm-linear do that but I do not see why they would
>> be necessary for a single device target.
>>
> Good point. I will remove them
> 
> <snip>
>>> +		return DMZ_NPO2_IO_INSIDE_ZONE;
>>> +	else if (relative_sect >= dmh->zsze)
>>
>> no need for the else. And this is super confusing. This case correspond to
>> the BIO going beyond the zone capacity in the target address space,
>> meaning it is still WITHIN the target zone. But you call that "outside"
>> because it is for the device zone. Super confusing. It took me a lot of
>> rereading to finally get it.
>>
> Probably my naming choice was not correct here for the enum. It should
> be s/IO_INSIDE_ZONE/IO_INSIDE_ZONE_CAP, etc to be clear. I mainly wanted
> to handle the case where a read across zone capacity should return
> something valid instead of just an error as we emulate the LBAs from
> zone cap to zone size.

Your target information fields are also badly named and make things hard
to understand.

> 
> DMZ_NPO2_IO_INSIDE_ZONE_CAP:
>              zcap   zsize
> ---------------|-----|
>       <------>
>         bio
> Normal scenario we just send what is there in the device.
> 
> 
> DMZ_NPO2_IO_OUTSIDE_ZONE_CAP:
>              zcap       zsize
> ---------------|---------|
>                  <---->
>                    bio
> 
> Read should return zero filled bio and other operation will return an
> error because we are touching the emulated area.
> 
> DMZ_NPO2_IO_ACROSS_ZONE_CAP:
>              zcap   zsize
> ---------------|-----|
>            <------>
>               bio
> For reads, the bio should be split across zone cap and the bio on the
> left hand side returns what is there in the device and the split bio on
> the right hand side should just return zeroes. All other requests will
> return an error.
> 
>>> +		return DMZ_NPO2_IO_OUTSIDE_ZONE;
>>> +
>>> +	return DMZ_NPO2_IO_ACROSS_ZONE;
>>
>> So you BIO is eeither fully contained within the zone or it is not. So why
>> not just return a bool ?
>>
> As I explained above, I was considering the boundary as zone cap inside
> a zone. The bio can be within zone cap, across zone cap into the
> emulated zone size and outside zone capacity.
> 
> I didn't take into account the read across zone. I will make sure it is
> correctly handled in the next revision.

Please test properly. This should have been caught in testing before sending.

> 
>              zcap  zsize          zcap
> ---------------|-----|--------------|
>                   <------>
>                      bio
>>> +}
>>> +
>>> +static void split_io_across_zone_boundary(struct dmz_npo2_target *dmh,
>>> +					  struct bio *bio)
>>> +{
>>> +	sector_t sect = bio->bi_iter.bi_sector;
>>> +	sector_t sects_from_zone_start;
>>> +
>>> +	sect = target_to_device_sect(dmh, sect);
>>
>> 	sect = target_to_device_sect(dmh, bio->bi_iter.bi_sector);
>>
>> is more readable.
>>
>>> +	div64_u64_rem(sect, dmh->zsze, &sects_from_zone_start);
>>> +	dm_accept_partial_bio(bio, dmh->zsze - sects_from_zone_start);
>>
>> So if this is a read BIO starting exactly at the target zone capacity
>> (sects_from_zone_start == zsze), then you accept 0 sectors ? What am I
>> missing here ?
>>
> Your condition will not even touch this function. This function comes
> into play only when the bio runs across the zone capacity as I mentioned
> before.
>>> +	bio->bi_iter.bi_sector = sect;
>>> +}
>>> +
>>> +static int handle_zone_boundary_violation(struct dmz_npo2_target *dmh,
>>> +					  struct bio *bio,
>>> +					  enum dmz_npo2_io_cond cond)
>>> +{
>>> +	/* Read should return zeroed page */
>>> +	if (bio_op(bio) == REQ_OP_READ) {
>>> +		if (cond == DMZ_NPO2_IO_ACROSS_ZONE) {
>>> +			split_io_across_zone_boundary(dmh, bio);
>>> +			return DM_MAPIO_REMAPPED;
>>> +		}
>>> +		zero_fill_bio(bio);
>>> +		bio_endio(bio);
>>> +		return DM_MAPIO_SUBMITTED;
>>> +	}
>>> +	return DM_MAPIO_KILL;
>>> +}
>>> +
>>> +static int dmz_npo2_end_io(struct dm_target *ti, struct bio *bio,
>>> +			   blk_status_t *error)
>>> +{
>>> +	struct dmz_npo2_target *dmh = ti->private;
>>> +
>>> +	if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND)
>>> +		bio->bi_iter.bi_sector =
>>> +			device_to_target_sect(dmh, bio->bi_iter.bi_sector);
>>> +
>>> +	return DM_ENDIO_DONE;
>>> +}
>>> +
>>> +static int dmz_npo2_map(struct dm_target *ti, struct bio *bio)
>>> +{
>>> +	struct dmz_npo2_target *dmh = ti->private;
>>> +	enum dmz_npo2_io_cond cond;
>>> +
>>> +	bio_set_dev(bio, dmh->dev->bdev);
>>> +	if (bio_sectors(bio) || op_is_zone_mgmt(bio_op(bio))) {
>>> +		cond = check_zone_boundary_violation(dmh, bio->bi_iter.bi_sector,
>>> +						     bio->bi_iter.bi_size >> SECTOR_SHIFT);
>>
>> Why check this for zone management BIOs ? These have length = 0, always.
>>
>>> +
>>> +		/*
>>> +		 * If the starting sector is in the emulated area then fill
>>> +		 * all the bio with zeros. If bio is across boundaries,
>>> +		 * split the bio across boundaries and fill zeros only for the
>>> +		 * bio that is outside the zone capacity
>>> +		 */
>>> +		switch (cond) {
>>> +		case DMZ_NPO2_IO_INSIDE_ZONE:
>>> +			bio->bi_iter.bi_sector = target_to_device_sect(dmh,
>>> +								       bio->bi_iter.bi_sector);
>>> +			break;
>>> +		case DMZ_NPO2_IO_ACROSS_ZONE:
>>> +		case DMZ_NPO2_IO_OUTSIDE_ZONE:
>>> +			return handle_zone_boundary_violation(dmh, bio, cond);
>>> +		}
>>> +	}
>>> +	return DM_MAPIO_REMAPPED;
>>
>> This entire function is very hard to read because everything is hidden in
>> helpers that are not super useful in my opinion. I would prefer seeing
>> cases for:
>> * zone management BIOs
>> * Reads and writes
>> * Everything else
>>
>> where tests against the bio sector and length are visible, so one can
>> understand what is going on. If you need helpers, have handle_zone_mgmt(),
>> handle_read() etc. Something clear.
>>
> Got it. I see the confusion here. I will rearrange it properly in the
> next revision. Thanks for this comment.
>>> +}
>>> +
>>> +static int dmz_npo2_iterate_devices(struct dm_target *ti,
>>> +				    iterate_devices_callout_fn fn, void *data)
>>> +{
>>> +	struct dmz_npo2_target *dmh = ti->private;
>>> +	sector_t len = 0;
>>> +
>>> +	len = dmh->nr_zones * dmh->zsze;
>>
>> Move this to the declaration instead of setting len to 0 for nothing.
>>
> Ok.
>>> +	return fn(ti, dmh->dev, 0, len, data);
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 13/13] dm: add non power of 2 zoned target
  2022-06-16 23:49           ` Damien Le Moal
@ 2022-06-17  5:45             ` Pankaj Raghav
  2022-06-17  6:12               ` Damien Le Moal
  0 siblings, 1 reply; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-17  5:45 UTC (permalink / raw)
  To: Damien Le Moal, hch, snitzer, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Damien Le Moal

On 2022-06-17 01:49, Damien Le Moal wrote:
>>> Why do you need to change dm_set_zones_restrictions() at all ?
>>>
>> When the device mapper is created, the q->limits gets inherited from the
>> underlying device. The chunk sectors of the target and the device will
>> be the same but we want the chunk sector of the target to be different
>> (rounded to po2) compared to the underlying device's chunk sector. This
>> needs to be done only for the dm-po2z target and not for other targets
>> that uses npo2 zoned devices (like dm-linear). So to perform this
>> operation in a target independent way in dm-zone.c, I chose to always
>> revalidate npo2 zoned device and update the chunk sector and nr_zones in
>> dm_zone_revalidate_cb based on the zone information from the target.
>> This allows to set the limits correctly for dm-po2z target.
> 
> But DM revalidate will be called for the target AFTER it is setup (after
> its gendisk is added). So how can DM revalidate see the incorrect zone
> size ? If that is the case, then the target constructor is broken or
> missing something. DM revalidate zone is generic and only allocates the
> zone bitmaps for the target device. There should be not need at all to
> touch that function.
> 
I think this is a cleaner approach using features flag and io_hints
instead of messing with the revalidate zone function:

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 135c0cc190fb..c97a71e0473f 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1618,6 +1618,9 @@ static int device_not_matches_zone_sectors(struct
dm_target *ti, struct dm_dev *
 	if (!blk_queue_is_zoned(q))
 		return 0;

+	if(dm_target_supports_emulated_zone_size(ti->type))
+		return 0;
+
 	return blk_queue_zone_sectors(q) != *zone_sectors;
 }

diff --git a/drivers/md/dm-zoned-npo2-target.c
b/drivers/md/dm-zoned-npo2-target.c
index dad135964e09..b203be808f09 100644
--- a/drivers/md/dm-zoned-npo2-target.c
+++ b/drivers/md/dm-zoned-npo2-target.c
@@ -187,6 +187,12 @@ static int dmz_npo2_end_io(struct dm_target *ti,
struct bio *bio,
 	return DM_ENDIO_DONE;
 }

+static void dmz_npo2_io_hints(struct dm_target *ti, struct queue_limits
*limits)
+{
+	struct dmz_npo2_target *dmh = ti->private;
+	limits->chunk_sectors = dmh->zsze_po2;
+}
+
 static int dmz_npo2_map(struct dm_target *ti, struct bio *bio)
 {
 	struct dmz_npo2_target *dmh = ti->private;
@@ -233,12 +239,13 @@ static int dmz_npo2_iterate_devices(struct
dm_target *ti,
 static struct target_type dmz_npo2_target = {
 	.name = "zoned-npo2",
 	.version = { 1, 0, 0 },
-	.features = DM_TARGET_ZONED_HM,
+	.features = DM_TARGET_ZONED_HM | DM_TARGET_EMULATED_ZONE_SIZE,
 	.map = dmz_npo2_map,
 	.end_io = dmz_npo2_end_io,
 	.report_zones = dmz_npo2_report_zones,
 	.iterate_devices = dmz_npo2_iterate_devices,
 	.module = THIS_MODULE,
+	.io_hints = dmz_npo2_io_hints,
 	.ctr = dmz_npo2_ctr,
 };

diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index c2a3758c4aaa..9f3a4d98a22a 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -283,6 +283,15 @@ struct target_type {
 #define dm_target_supports_mixed_zoned_model(type) (false)
 #endif

+#ifdef CONFIG_BLK_DEV_ZONED
+#define DM_TARGET_EMULATED_ZONE_SIZE	0x00000400
+#define dm_target_supports_emulated_zone_size(type) \
+	((type)->features & DM_TARGET_EMULATED_ZONE_SIZE)
+#else
+#define DM_TARGET_EMULATED_ZONE_SIZE	0x00000000
+#define dm_target_supports_emulated_zone_size(type) (false)
+#endif
+
 struct dm_target {
 	struct dm_table *table;
 	struct target_type *type;


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 13/13] dm: add non power of 2 zoned target
  2022-06-17  5:45             ` Pankaj Raghav
@ 2022-06-17  6:12               ` Damien Le Moal
  2022-06-17  6:40                 ` Pankaj Raghav
  0 siblings, 1 reply; 37+ messages in thread
From: Damien Le Moal @ 2022-06-17  6:12 UTC (permalink / raw)
  To: Pankaj Raghav, hch, snitzer, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Damien Le Moal

On 6/17/22 14:45, Pankaj Raghav wrote:
> On 2022-06-17 01:49, Damien Le Moal wrote:
>>>> Why do you need to change dm_set_zones_restrictions() at all ?
>>>>
>>> When the device mapper is created, the q->limits gets inherited from the
>>> underlying device. The chunk sectors of the target and the device will
>>> be the same but we want the chunk sector of the target to be different
>>> (rounded to po2) compared to the underlying device's chunk sector. This
>>> needs to be done only for the dm-po2z target and not for other targets
>>> that uses npo2 zoned devices (like dm-linear). So to perform this
>>> operation in a target independent way in dm-zone.c, I chose to always
>>> revalidate npo2 zoned device and update the chunk sector and nr_zones in
>>> dm_zone_revalidate_cb based on the zone information from the target.
>>> This allows to set the limits correctly for dm-po2z target.
>>
>> But DM revalidate will be called for the target AFTER it is setup (after
>> its gendisk is added). So how can DM revalidate see the incorrect zone
>> size ? If that is the case, then the target constructor is broken or
>> missing something. DM revalidate zone is generic and only allocates the
>> zone bitmaps for the target device. There should be not need at all to
>> touch that function.
>>
> I think this is a cleaner approach using features flag and io_hints
> instead of messing with the revalidate zone function:
> 
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 135c0cc190fb..c97a71e0473f 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -1618,6 +1618,9 @@ static int device_not_matches_zone_sectors(struct
> dm_target *ti, struct dm_dev *
>  	if (!blk_queue_is_zoned(q))
>  		return 0;
> 
> +	if(dm_target_supports_emulated_zone_size(ti->type))
> +		return 0;
> +

This should be in validate_hardware_zoned_model(), not here.

>  	return blk_queue_zone_sectors(q) != *zone_sectors;
>  }
> 
> diff --git a/drivers/md/dm-zoned-npo2-target.c
> b/drivers/md/dm-zoned-npo2-target.c
> index dad135964e09..b203be808f09 100644
> --- a/drivers/md/dm-zoned-npo2-target.c
> +++ b/drivers/md/dm-zoned-npo2-target.c
> @@ -187,6 +187,12 @@ static int dmz_npo2_end_io(struct dm_target *ti,
> struct bio *bio,
>  	return DM_ENDIO_DONE;
>  }
> 
> +static void dmz_npo2_io_hints(struct dm_target *ti, struct queue_limits
> *limits)
> +{
> +	struct dmz_npo2_target *dmh = ti->private;
> +	limits->chunk_sectors = dmh->zsze_po2;
> +}
> +
>  static int dmz_npo2_map(struct dm_target *ti, struct bio *bio)
>  {
>  	struct dmz_npo2_target *dmh = ti->private;
> @@ -233,12 +239,13 @@ static int dmz_npo2_iterate_devices(struct
> dm_target *ti,
>  static struct target_type dmz_npo2_target = {
>  	.name = "zoned-npo2",
>  	.version = { 1, 0, 0 },
> -	.features = DM_TARGET_ZONED_HM,
> +	.features = DM_TARGET_ZONED_HM | DM_TARGET_EMULATED_ZONE_SIZE,
>  	.map = dmz_npo2_map,
>  	.end_io = dmz_npo2_end_io,
>  	.report_zones = dmz_npo2_report_zones,
>  	.iterate_devices = dmz_npo2_iterate_devices,
>  	.module = THIS_MODULE,
> +	.io_hints = dmz_npo2_io_hints,
>  	.ctr = dmz_npo2_ctr,
>  };
> 
> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> index c2a3758c4aaa..9f3a4d98a22a 100644
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -283,6 +283,15 @@ struct target_type {
>  #define dm_target_supports_mixed_zoned_model(type) (false)
>  #endif
> 
> +#ifdef CONFIG_BLK_DEV_ZONED
> +#define DM_TARGET_EMULATED_ZONE_SIZE	0x00000400

Make it general: DM_TARGET_EMULATED_ZONES

> +#define dm_target_supports_emulated_zone_size(type) \
> +	((type)->features & DM_TARGET_EMULATED_ZONE_SIZE)
> +#else
> +#define DM_TARGET_EMULATED_ZONE_SIZE	0x00000000
> +#define dm_target_supports_emulated_zone_size(type) (false)
> +#endif
> +
>  struct dm_target {
>  	struct dm_table *table;
>  	struct target_type *type;
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 13/13] dm: add non power of 2 zoned target
  2022-06-17  6:12               ` Damien Le Moal
@ 2022-06-17  6:40                 ` Pankaj Raghav
  2022-06-17  6:56                   ` Damien Le Moal
  0 siblings, 1 reply; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-17  6:40 UTC (permalink / raw)
  To: Damien Le Moal, hch, snitzer, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block, Damien Le Moal

On 2022-06-17 08:12, Damien Le Moal wrote:
>> I think this is a cleaner approach using features flag and io_hints
>> instead of messing with the revalidate zone function:
>>
>> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
>> index 135c0cc190fb..c97a71e0473f 100644
>> --- a/drivers/md/dm-table.c
>> +++ b/drivers/md/dm-table.c
>> @@ -1618,6 +1618,9 @@ static int device_not_matches_zone_sectors(struct
>> dm_target *ti, struct dm_dev *
>>  	if (!blk_queue_is_zoned(q))
>>  		return 0;
>>
>> +	if(dm_target_supports_emulated_zone_size(ti->type))
>> +		return 0;
>> +
> 
> This should be in validate_hardware_zoned_model(), not here.
> 
I am not sure about this comment. We need to peek into the individual
target from the table to check for this feature right?

if (dm_table_any_dev_attr(table, device_not_matches_zone_sectors,
&zone_sectors)) {
	DMERR("%s: zone sectors is not consistent across all zoned devices",
        dm_device_name(table->md));
	return -EINVAL;
	}

So we call this function device_not_matches_zone_sectors() from
validate_hardware_zoned_model() for each target and we let the validate
succeed even if the target's zone size is different from the underlying
device zone size if this feature flag is set. Let me know if I am
missing something and how this can be moved to
validate_hardware_zoned_model().


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 13/13] dm: add non power of 2 zoned target
  2022-06-17  6:40                 ` Pankaj Raghav
@ 2022-06-17  6:56                   ` Damien Le Moal
  2022-06-17  8:03                     ` Pankaj Raghav
  0 siblings, 1 reply; 37+ messages in thread
From: Damien Le Moal @ 2022-06-17  6:56 UTC (permalink / raw)
  To: Pankaj Raghav, hch, snitzer, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block

On 6/17/22 15:40, Pankaj Raghav wrote:
> On 2022-06-17 08:12, Damien Le Moal wrote:
>>> I think this is a cleaner approach using features flag and io_hints
>>> instead of messing with the revalidate zone function:
>>>
>>> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
>>> index 135c0cc190fb..c97a71e0473f 100644
>>> --- a/drivers/md/dm-table.c
>>> +++ b/drivers/md/dm-table.c
>>> @@ -1618,6 +1618,9 @@ static int device_not_matches_zone_sectors(struct
>>> dm_target *ti, struct dm_dev *
>>>  	if (!blk_queue_is_zoned(q))
>>>  		return 0;
>>>
>>> +	if(dm_target_supports_emulated_zone_size(ti->type))
>>> +		return 0;
>>> +
>>
>> This should be in validate_hardware_zoned_model(), not here.
>>
> I am not sure about this comment. We need to peek into the individual
> target from the table to check for this feature right?
> 
> if (dm_table_any_dev_attr(table, device_not_matches_zone_sectors,
> &zone_sectors)) {
> 	DMERR("%s: zone sectors is not consistent across all zoned devices",
>         dm_device_name(table->md));
> 	return -EINVAL;
> 	}
> 
> So we call this function device_not_matches_zone_sectors() from
> validate_hardware_zoned_model() for each target and we let the validate
> succeed even if the target's zone size is different from the underlying
> device zone size if this feature flag is set. Let me know if I am
> missing something and how this can be moved to
> validate_hardware_zoned_model().

Your change does not match the function name
device_not_matches_zone_sectors(), at all. So I think this is wrong.

The fact is that zone support in DM has been built under the following
assumptions:
1) A zoned device can be used to create a *zoned* target (e.g. dm-linear,
dm-flakey, dm-crypt). For this case, the target *must* use the same zone
size as the underlying devices and all devices used for the target must
have the same zone size.
2) A zoned device can be used to create a *regular* device target (e.g.
dm-zoned). All zoned devices used for the target must have the same zone size.

This new target driver completely breaks (1) and does not fit with (2). I
suspect this is why you are seeing problems with dm_revalidate_zones() as
that one uses the underlying device instead of the target report zones.

Based on this analysis, validate_hardware_zoned_model() definitely needs
to be changed. But device_not_matches_zone_sectors() is to check the
assumptions (1) and (2) so changing it for your new case is wrong in my
opinion. You need another set of assumptions (3) (define that well please)
and modify validate_hardware_zoned_model() so that the defined constraints
are checked. Using a target flag to indicate the type of zoned target is
fine by me.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v7 13/13] dm: add non power of 2 zoned target
  2022-06-17  6:56                   ` Damien Le Moal
@ 2022-06-17  8:03                     ` Pankaj Raghav
  0 siblings, 0 replies; 37+ messages in thread
From: Pankaj Raghav @ 2022-06-17  8:03 UTC (permalink / raw)
  To: Damien Le Moal, hch, snitzer, axboe
  Cc: bvanassche, linux-kernel, jiangbo.365, hare, pankydev8, dm-devel,
	jonathan.derrick, gost.dev, dsterba, jaegeuk, linux-nvme,
	Johannes.Thumshirn, linux-block

On 2022-06-17 08:56, Damien Le Moal wrote:
>>
>> So we call this function device_not_matches_zone_sectors() from
>> validate_hardware_zoned_model() for each target and we let the validate
>> succeed even if the target's zone size is different from the underlying
>> device zone size if this feature flag is set. Let me know if I am
>> missing something and how this can be moved to
>> validate_hardware_zoned_model().
> 
> Your change does not match the function name
> device_not_matches_zone_sectors(), at all. So I think this is wrong.
> 
> The fact is that zone support in DM has been built under the following
> assumptions:
> 1) A zoned device can be used to create a *zoned* target (e.g. dm-linear,
> dm-flakey, dm-crypt). For this case, the target *must* use the same zone
> size as the underlying devices and all devices used for the target must
> have the same zone size.
> 2) A zoned device can be used to create a *regular* device target (e.g.
> dm-zoned). All zoned devices used for the target must have the same zone size.
> 
> This new target driver completely breaks (1) and does not fit with (2). I
> suspect this is why you are seeing problems with dm_revalidate_zones() as
> that one uses the underlying device instead of the target report zones.
> 
> Based on this analysis, validate_hardware_zoned_model() definitely needs
> to be changed. But device_not_matches_zone_sectors() is to check the
> assumptions (1) and (2) so changing it for your new case is wrong in my
> opinion. You need another set of assumptions (3) (define that well please)
> and modify validate_hardware_zoned_model() so that the defined constraints
> are checked. Using a target flag to indicate the type of zoned target is
> fine by me.
> 
Got it. Thanks for the explanation. Renaming
device_not_matches_zone_sectors() function to something meaningful with
my changes should address what you have pointed out to accommodate all
three types.

I see that something similar was done to dm_table_supports_zoned_model()
to accommodate type 2(dm-zoned) with different underlying zoned models
even though the initial impl. supported only type 1.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2022-06-17  8:03 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20220615101924eucas1p27fbce623c0e1b3097169bf23dd6266d8@eucas1p2.samsung.com>
2022-06-15 10:19 ` [PATCH v7 00/13] support non power of 2 zoned device Pankaj Raghav
     [not found]   ` <CGME20220615101927eucas1p17220c7a36f69f59ff8ddd560b42967ec@eucas1p1.samsung.com>
2022-06-15 10:19     ` [PATCH v7 01/13] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze Pankaj Raghav
2022-06-15 20:18       ` [dm-devel] " Bart Van Assche
     [not found]   ` <CGME20220615101931eucas1p15ed09ae433a2c378b599e9086130d8eb@eucas1p1.samsung.com>
2022-06-15 10:19     ` [PATCH v7 02/13] block: allow blk-zoned devices to have non-power-of-2 zone size Pankaj Raghav
2022-06-15 20:28       ` [dm-devel] " Bart Van Assche
2022-06-16 10:09         ` Pankaj Raghav
2022-06-16 16:04           ` Luis Chamberlain
2022-06-16 23:30           ` Damien Le Moal
     [not found]   ` <CGME20220615101935eucas1p26a7bc245d88a89312158d7a265f64aef@eucas1p2.samsung.com>
2022-06-15 10:19     ` [PATCH v7 03/13] nvme: zns: Allow ZNS drives that have non-power_of_2 " Pankaj Raghav
     [not found]   ` <CGME20220615101938eucas1p26ab159a1ffd0fa5a16d7f202ba7206e7@eucas1p2.samsung.com>
2022-06-15 10:19     ` [PATCH v7 04/13] nvmet: Allow ZNS target to support non-power_of_2 zone sizes Pankaj Raghav
     [not found]   ` <CGME20220615101941eucas1p25e1c27b363e6b288b848521298e31705@eucas1p2.samsung.com>
2022-06-15 10:19     ` [PATCH v7 05/13] null_blk: allow non power of 2 zoned devices Pankaj Raghav
     [not found]   ` <CGME20220615101945eucas1p16fa264e81d9b6027ff131dd311ed91e2@eucas1p1.samsung.com>
2022-06-15 10:19     ` [PATCH v7 06/13] null_blk: use zone_size_sects_shift for " Pankaj Raghav
2022-06-15 11:56       ` Damien Le Moal
     [not found]   ` <CGME20220615101948eucas1p2d8d801735c39b25256a019134adb0c6f@eucas1p2.samsung.com>
2022-06-15 10:19     ` [PATCH v7 07/13] zonefs: allow non " Pankaj Raghav
     [not found]   ` <CGME20220615101951eucas1p238eb45e563bd9645af81bf16c56d98ec@eucas1p2.samsung.com>
2022-06-15 10:19     ` [PATCH v7 08/13] dm-zoned: ensure only power of 2 zone sizes are allowed Pankaj Raghav
     [not found]   ` <CGME20220615101955eucas1p19b9d42ead7331f69f7dad1ec100312c2@eucas1p1.samsung.com>
2022-06-15 10:19     ` [PATCH v7 09/13] dm-zone: use generic helpers to calculate offset from zone start Pankaj Raghav
     [not found]   ` <CGME20220615102000eucas1p27720aaa3c309327b2b9a33c5f840f498@eucas1p2.samsung.com>
2022-06-15 10:19     ` [PATCH v7 10/13] dm-table: use bdev_is_zone_start helper in device_area_is_invalid() Pankaj Raghav
2022-06-15 11:53       ` Damien Le Moal
2022-06-16  9:55         ` Pankaj Raghav
2022-06-16 23:29           ` Damien Le Moal
     [not found]   ` <CGME20220615102004eucas1p1e458ea097d381058b16fc6daa3eec998@eucas1p1.samsung.com>
2022-06-15 10:19     ` [PATCH v7 11/13] dm-table: allow non po2 zoned devices Pankaj Raghav
     [not found]   ` <CGME20220615102007eucas1p1106f9520e2a86beb3792107dffd8071b@eucas1p1.samsung.com>
2022-06-15 10:19     ` [PATCH v7 12/13] dm: call dm_zone_endio after the target endio callback for " Pankaj Raghav
2022-06-15 11:01       ` [dm-devel] " Damien Le Moal
2022-06-16 12:24         ` Pankaj Raghav
2022-06-16 23:33           ` Damien Le Moal
     [not found]   ` <CGME20220615102011eucas1p220368db4a186181b1927dea50a79e5d4@eucas1p2.samsung.com>
2022-06-15 10:19     ` [PATCH v7 13/13] dm: add non power of 2 zoned target Pankaj Raghav
2022-06-15 11:49       ` Damien Le Moal
2022-06-16 16:12         ` Pankaj Raghav
2022-06-16 23:49           ` Damien Le Moal
2022-06-17  5:45             ` Pankaj Raghav
2022-06-17  6:12               ` Damien Le Moal
2022-06-17  6:40                 ` Pankaj Raghav
2022-06-17  6:56                   ` Damien Le Moal
2022-06-17  8:03                     ` Pankaj Raghav
2022-06-15 14:19       ` kernel test robot
2022-06-15 19:54       ` Randy Dunlap
2022-06-16 10:28         ` Pankaj Raghav

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).