All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 00/11] support non power of 2 zoned device
       [not found] <CGME20220727162246eucas1p1a758799f13d36ba99d30bf92cc5e2754@eucas1p1.samsung.com>
@ 2022-07-27 16:22   ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Pankaj Raghav

- Background and Motivation:

The zone storage implementation in Linux, introduced since v4.10, first
targetted SMR drives which have a power of 2 (po2) zone size alignment
requirement. The po2 zone size was further imposed implicitly by the
block layer's blk_queue_chunk_sectors(), used to prevent IO merging
across chunks beyond the specified size, since v3.16 through commit
762380ad9322 ("block: add notion of a chunk size for request merging").
But this same general block layer po2 requirement for blk_queue_chunk_sectors()
was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
to be non-power-of-2").

NAND, which is the media used in newer zoned storage devices, does not
naturally align to po2. In these devices, zone cap is not the same as the
po2 zone size. When the zone cap != zone size, then unmapped LBAs are
introduced to cover the space between the zone cap and zone size. po2
requirement does not make sense for these type of zone storage devices.
This patch series aims to remove these unmapped LBAs for zoned devices when
zone cap is npo2. This is done by relaxing the po2 zone size constraint
in the kernel and allowing zoned device with npo2 zone sizes if zone cap
== zone size.

Removing the po2 requirement from zone storage should be possible
now provided that no userspace regression and no performance regressions are
introduced. Stop-gap patches have been already merged into f2fs-tools to
proactively not allow npo2 zone sizes until proper support is added [1].

There were two efforts previously to add support to npo2 devices: 1) via
device level emulation [2] but that was rejected with a final conclusion
to add support for non po2 zoned device in the complete stack[3] 2)
adding support to the complete stack by removing the constraint in the
block layer and NVMe layer with support to btrfs, zonefs, etc which was
rejected with a conclusion to add a dm target for FS support [0]
to reduce the regression impact.

This series adds support to npo2 zoned devices in the block and nvme
layer and a new **dm target** is added: dm-po2z-target. This new
target will be initially used for filesystems such as btrfs and
f2fs that does not have native npo2 zone support.

- Patchset description:
Patches 1-2 deals with removing the po2 constraint from the
block layer.

Patches 3-4 deals with removing the constraint from nvme zns.

Patch 5 removes the po2 contraint in null blk

Patch 6 adds npo2 support to zonefs

Patches 7-11 adds support for npo2 zoned devices in the DM layer and
adds a new target dm-po2z-target which converts a npo2 zoned device into
a po2 zoned target.

The patch series is based on linux-next tag: next-20220727

Testing:
The new target was tested with blktest and zonefs test suite in qemu and
on a real ZNS device.

[0] https://lore.kernel.org/lkml/PH0PR04MB74166C87F694B150A5AE0F009BD09@PH0PR04MB7416.namprd04.prod.outlook.com/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git/commit/?h=dev-test&id=6afcf6493578e77528abe65ab8b12f3e1c16749f
[2] https://lore.kernel.org/all/20220310094725.GA28499@lst.de/T/
[3] https://lore.kernel.org/all/20220315135245.eqf4tqngxxb7ymqa@unifi/

Changes since v1:
- Put the function declaration and its usage in the same commit (Bart)
- Remove bdev_zone_aligned function (Bart)
- Change the name from blk_queue_zone_aligned to blk_queue_is_zone_start
  (Damien)
- q is never null in from bdev_get_queue (Damien)
- Add condition during bringup and check for zsze == zcap for npo2
  drives (Damien)
- Rounddown operation should be made generic to work in 32 bits arch
  (bart)
- Add comments where generic calculation is directly used instead having
  special handling for po2 zone sizes (Hannes)
- Make the minimum zone size alignment requirement for btrfs to be 1M
  instead of BTRFS_STRIPE_LEN(David)

Changes since v2:
- Minor formatting changes

Changes since v3:
- Make superblock mirror align with the existing superblock log offsets
  (David)
- DM change return value and remove extra newline
- Optimize null blk zone index lookup with shift for po2 zone size

Changes since v4:
- Remove direct filesystems support for npo2 devices (Johannes, Hannes,
  Damien)

Changes since v5:
- Use DIV_ROUND_UP* helper instead of round_up as it breaks 32bit arch
  build in null blk(kernel-test-robot, Nathan)
- Use DIV_ROUND_UP_SECTOR_T also in blkdev_nr_zones function instead of
  open coding it with div64_u64
- Added extra condition in dm-zoned and in dm to reject non power of 2
  zone sizes.

Changes since v6:
- Added a new dm target for non power of 2 devices
- Added support for non power of 2 devices in the DM layer.

Changes since v7:
- Improved dm target for non power of 2 zoned devices with some bug
  fixes and rearrangement
- Removed some unnecessary comments.

Luis Chamberlain (1):
  dm-zoned: ensure only power of 2 zone sizes are allowed

Pankaj Raghav (10):
  block: make bdev_nr_zones and disk_zone_no generic for npo2 zsze
  block: allow blk-zoned devices to have non-power-of-2 zone size
  nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  nvmet: Allow ZNS target to support non-power_of_2 zone sizes
  null_blk: allow non power of 2 zoned devices
  zonefs: allow non power of 2 zoned devices
  dm-zone: use generic helpers to calculate offset from zone start
  dm-table: allow non po2 zoned devices
  dm: call dm_zone_endio after the target endio callback for zoned
    devices
  dm: add power-of-2 zoned target for non-power-of-2 zoned devices

 block/blk-core.c                  |   2 +-
 block/blk-zoned.c                 |  37 +++--
 drivers/block/null_blk/main.c     |   5 +-
 drivers/block/null_blk/null_blk.h |   6 +
 drivers/block/null_blk/zoned.c    |  18 ++-
 drivers/md/Kconfig                |   9 ++
 drivers/md/Makefile               |   2 +
 drivers/md/dm-po2z-target.c       | 261 ++++++++++++++++++++++++++++++
 drivers/md/dm-table.c             |  21 ++-
 drivers/md/dm-zone.c              |  10 +-
 drivers/md/dm-zoned-target.c      |   8 +
 drivers/md/dm.c                   |   8 +-
 drivers/nvme/host/zns.c           |  16 +-
 drivers/nvme/target/zns.c         |   3 +-
 fs/zonefs/super.c                 |   6 +-
 fs/zonefs/zonefs.h                |   1 -
 include/linux/blkdev.h            |  91 ++++++++---
 include/linux/device-mapper.h     |   9 ++
 18 files changed, 438 insertions(+), 75 deletions(-)
 create mode 100644 drivers/md/dm-po2z-target.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [dm-devel] [PATCH v8 00/11] support non power of 2 zoned device
@ 2022-07-27 16:22   ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: Pankaj Raghav, bvanassche, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, jaegeuk, matias.bjorling

- Background and Motivation:

The zone storage implementation in Linux, introduced since v4.10, first
targetted SMR drives which have a power of 2 (po2) zone size alignment
requirement. The po2 zone size was further imposed implicitly by the
block layer's blk_queue_chunk_sectors(), used to prevent IO merging
across chunks beyond the specified size, since v3.16 through commit
762380ad9322 ("block: add notion of a chunk size for request merging").
But this same general block layer po2 requirement for blk_queue_chunk_sectors()
was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
to be non-power-of-2").

NAND, which is the media used in newer zoned storage devices, does not
naturally align to po2. In these devices, zone cap is not the same as the
po2 zone size. When the zone cap != zone size, then unmapped LBAs are
introduced to cover the space between the zone cap and zone size. po2
requirement does not make sense for these type of zone storage devices.
This patch series aims to remove these unmapped LBAs for zoned devices when
zone cap is npo2. This is done by relaxing the po2 zone size constraint
in the kernel and allowing zoned device with npo2 zone sizes if zone cap
== zone size.

Removing the po2 requirement from zone storage should be possible
now provided that no userspace regression and no performance regressions are
introduced. Stop-gap patches have been already merged into f2fs-tools to
proactively not allow npo2 zone sizes until proper support is added [1].

There were two efforts previously to add support to npo2 devices: 1) via
device level emulation [2] but that was rejected with a final conclusion
to add support for non po2 zoned device in the complete stack[3] 2)
adding support to the complete stack by removing the constraint in the
block layer and NVMe layer with support to btrfs, zonefs, etc which was
rejected with a conclusion to add a dm target for FS support [0]
to reduce the regression impact.

This series adds support to npo2 zoned devices in the block and nvme
layer and a new **dm target** is added: dm-po2z-target. This new
target will be initially used for filesystems such as btrfs and
f2fs that does not have native npo2 zone support.

- Patchset description:
Patches 1-2 deals with removing the po2 constraint from the
block layer.

Patches 3-4 deals with removing the constraint from nvme zns.

Patch 5 removes the po2 contraint in null blk

Patch 6 adds npo2 support to zonefs

Patches 7-11 adds support for npo2 zoned devices in the DM layer and
adds a new target dm-po2z-target which converts a npo2 zoned device into
a po2 zoned target.

The patch series is based on linux-next tag: next-20220727

Testing:
The new target was tested with blktest and zonefs test suite in qemu and
on a real ZNS device.

[0] https://lore.kernel.org/lkml/PH0PR04MB74166C87F694B150A5AE0F009BD09@PH0PR04MB7416.namprd04.prod.outlook.com/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git/commit/?h=dev-test&id=6afcf6493578e77528abe65ab8b12f3e1c16749f
[2] https://lore.kernel.org/all/20220310094725.GA28499@lst.de/T/
[3] https://lore.kernel.org/all/20220315135245.eqf4tqngxxb7ymqa@unifi/

Changes since v1:
- Put the function declaration and its usage in the same commit (Bart)
- Remove bdev_zone_aligned function (Bart)
- Change the name from blk_queue_zone_aligned to blk_queue_is_zone_start
  (Damien)
- q is never null in from bdev_get_queue (Damien)
- Add condition during bringup and check for zsze == zcap for npo2
  drives (Damien)
- Rounddown operation should be made generic to work in 32 bits arch
  (bart)
- Add comments where generic calculation is directly used instead having
  special handling for po2 zone sizes (Hannes)
- Make the minimum zone size alignment requirement for btrfs to be 1M
  instead of BTRFS_STRIPE_LEN(David)

Changes since v2:
- Minor formatting changes

Changes since v3:
- Make superblock mirror align with the existing superblock log offsets
  (David)
- DM change return value and remove extra newline
- Optimize null blk zone index lookup with shift for po2 zone size

Changes since v4:
- Remove direct filesystems support for npo2 devices (Johannes, Hannes,
  Damien)

Changes since v5:
- Use DIV_ROUND_UP* helper instead of round_up as it breaks 32bit arch
  build in null blk(kernel-test-robot, Nathan)
- Use DIV_ROUND_UP_SECTOR_T also in blkdev_nr_zones function instead of
  open coding it with div64_u64
- Added extra condition in dm-zoned and in dm to reject non power of 2
  zone sizes.

Changes since v6:
- Added a new dm target for non power of 2 devices
- Added support for non power of 2 devices in the DM layer.

Changes since v7:
- Improved dm target for non power of 2 zoned devices with some bug
  fixes and rearrangement
- Removed some unnecessary comments.

Luis Chamberlain (1):
  dm-zoned: ensure only power of 2 zone sizes are allowed

Pankaj Raghav (10):
  block: make bdev_nr_zones and disk_zone_no generic for npo2 zsze
  block: allow blk-zoned devices to have non-power-of-2 zone size
  nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  nvmet: Allow ZNS target to support non-power_of_2 zone sizes
  null_blk: allow non power of 2 zoned devices
  zonefs: allow non power of 2 zoned devices
  dm-zone: use generic helpers to calculate offset from zone start
  dm-table: allow non po2 zoned devices
  dm: call dm_zone_endio after the target endio callback for zoned
    devices
  dm: add power-of-2 zoned target for non-power-of-2 zoned devices

 block/blk-core.c                  |   2 +-
 block/blk-zoned.c                 |  37 +++--
 drivers/block/null_blk/main.c     |   5 +-
 drivers/block/null_blk/null_blk.h |   6 +
 drivers/block/null_blk/zoned.c    |  18 ++-
 drivers/md/Kconfig                |   9 ++
 drivers/md/Makefile               |   2 +
 drivers/md/dm-po2z-target.c       | 261 ++++++++++++++++++++++++++++++
 drivers/md/dm-table.c             |  21 ++-
 drivers/md/dm-zone.c              |  10 +-
 drivers/md/dm-zoned-target.c      |   8 +
 drivers/md/dm.c                   |   8 +-
 drivers/nvme/host/zns.c           |  16 +-
 drivers/nvme/target/zns.c         |   3 +-
 fs/zonefs/super.c                 |   6 +-
 fs/zonefs/zonefs.h                |   1 -
 include/linux/blkdev.h            |  91 ++++++++---
 include/linux/device-mapper.h     |   9 ++
 18 files changed, 438 insertions(+), 75 deletions(-)
 create mode 100644 drivers/md/dm-po2z-target.c

-- 
2.25.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v8 01/11] block: make bdev_nr_zones and disk_zone_no generic for npo2 zsze
       [not found]   ` <CGME20220727162247eucas1p203fc14aa17ecbcb3e6215d5304bb0c85@eucas1p2.samsung.com>
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Pankaj Raghav, Luis Chamberlain, Adam Manzanares

Adapt bdev_nr_zones and disk_zone_no function so that it can
also work for non-power-of-2 zone sizes.

As the existing deployments of zoned devices had power-of-2
assumption, power-of-2 optimized calculation is kept for those devices.

There are no direct hot paths modified and the changes just
introduce one new branch per call.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c      | 13 +++++++++----
 include/linux/blkdev.h |  8 +++++++-
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index a264621d4905..dce9c95b4bcd 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -111,17 +111,22 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
  * bdev_nr_zones - Get number of zones
  * @bdev:	Target device
  *
- * Return the total number of zones of a zoned block device.  For a block
- * device without zone capabilities, the number of zones is always 0.
+ * Return the total number of zones of a zoned block device, including the
+ * eventual small last zone if present. For a block device without zone
+ * capabilities, the number of zones is always 0.
  */
 unsigned int bdev_nr_zones(struct block_device *bdev)
 {
 	sector_t zone_sectors = bdev_zone_sectors(bdev);
+	sector_t capacity = bdev_nr_sectors(bdev);
 
 	if (!bdev_is_zoned(bdev))
 		return 0;
-	return (bdev_nr_sectors(bdev) + zone_sectors - 1) >>
-		ilog2(zone_sectors);
+
+	if (is_power_of_2(zone_sectors))
+		return (capacity + zone_sectors - 1) >> ilog2(zone_sectors);
+
+	return DIV_ROUND_UP_SECTOR_T(capacity, zone_sectors);
 }
 EXPORT_SYMBOL_GPL(bdev_nr_zones);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index dccdf1551c62..85b832908f28 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -673,9 +673,15 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
 
 static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
 {
+	sector_t zone_sectors = disk->queue->limits.chunk_sectors;
+
 	if (!blk_queue_is_zoned(disk->queue))
 		return 0;
-	return sector >> ilog2(disk->queue->limits.chunk_sectors);
+
+	if (is_power_of_2(zone_sectors))
+		return sector >> ilog2(zone_sectors);
+
+	return div64_u64(sector, zone_sectors);
 }
 
 static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [dm-devel] [PATCH v8 01/11] block: make bdev_nr_zones and disk_zone_no generic for npo2 zsze
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: Pankaj Raghav, bvanassche, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, Adam Manzanares, jaegeuk,
	matias.bjorling, Luis Chamberlain

Adapt bdev_nr_zones and disk_zone_no function so that it can
also work for non-power-of-2 zone sizes.

As the existing deployments of zoned devices had power-of-2
assumption, power-of-2 optimized calculation is kept for those devices.

There are no direct hot paths modified and the changes just
introduce one new branch per call.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c      | 13 +++++++++----
 include/linux/blkdev.h |  8 +++++++-
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index a264621d4905..dce9c95b4bcd 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -111,17 +111,22 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
  * bdev_nr_zones - Get number of zones
  * @bdev:	Target device
  *
- * Return the total number of zones of a zoned block device.  For a block
- * device without zone capabilities, the number of zones is always 0.
+ * Return the total number of zones of a zoned block device, including the
+ * eventual small last zone if present. For a block device without zone
+ * capabilities, the number of zones is always 0.
  */
 unsigned int bdev_nr_zones(struct block_device *bdev)
 {
 	sector_t zone_sectors = bdev_zone_sectors(bdev);
+	sector_t capacity = bdev_nr_sectors(bdev);
 
 	if (!bdev_is_zoned(bdev))
 		return 0;
-	return (bdev_nr_sectors(bdev) + zone_sectors - 1) >>
-		ilog2(zone_sectors);
+
+	if (is_power_of_2(zone_sectors))
+		return (capacity + zone_sectors - 1) >> ilog2(zone_sectors);
+
+	return DIV_ROUND_UP_SECTOR_T(capacity, zone_sectors);
 }
 EXPORT_SYMBOL_GPL(bdev_nr_zones);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index dccdf1551c62..85b832908f28 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -673,9 +673,15 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
 
 static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
 {
+	sector_t zone_sectors = disk->queue->limits.chunk_sectors;
+
 	if (!blk_queue_is_zoned(disk->queue))
 		return 0;
-	return sector >> ilog2(disk->queue->limits.chunk_sectors);
+
+	if (is_power_of_2(zone_sectors))
+		return sector >> ilog2(zone_sectors);
+
+	return div64_u64(sector, zone_sectors);
 }
 
 static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
-- 
2.25.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
       [not found]   ` <CGME20220727162248eucas1p2ff8c3c2b021bedcae3960024b4e269e9@eucas1p2.samsung.com>
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Pankaj Raghav, Luis Chamberlain

Checking if a given sector is aligned to a zone is a common
operation that is performed for zoned devices. Add
bdev_is_zone_start helper to check for this instead of opencoding it
everywhere.

Convert the calculations on zone size to be generic instead of relying on
power_of_2 based logic in the block layer using the helpers wherever
possible.

The only hot path affected by this change for power_of_2 zoned devices
is in blk_check_zone_append() but bdev_is_zone_start() helper is
used to optimize the calculation for po2 zone sizes. Note that the append
path cannot be accessed by direct raw access to the block device but only
through a filesystem abstraction.

Finally, allow non power of 2 zoned devices provided that their zone
capacity and zone size are equal. The main motivation to allow non
power_of_2 zoned device is to remove the unmapped LBA between zcap and
zsze for devices that cannot have a power_of_2 zcap.

To make this work bdev_get_queue(), bdev_zone_sectors() and
bdev_is_zoned() are moved earlier without modifications.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 block/blk-core.c       |  2 +-
 block/blk-zoned.c      | 24 +++++++++---
 include/linux/blkdev.h | 84 ++++++++++++++++++++++++++++++------------
 3 files changed, 79 insertions(+), 31 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3d286a256d3d..1f7e9a90e198 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -570,7 +570,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 		return BLK_STS_NOTSUPP;
 
 	/* The bio sector must point to the start of a sequential zone */
-	if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) - 1) ||
+	if (!bdev_is_zone_aligned(bio->bi_bdev, bio->bi_iter.bi_sector) ||
 	    !bio_zone_is_seq(bio))
 		return BLK_STS_IOERR;
 
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index dce9c95b4bcd..a01a231ad328 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -285,10 +285,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
 		return -EINVAL;
 
 	/* Check alignment (handle eventual smaller last zone) */
-	if (sector & (zone_sectors - 1))
+	if (!bdev_is_zone_aligned(bdev, sector))
 		return -EINVAL;
 
-	if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
+	if (!bdev_is_zone_aligned(bdev, nr_sectors) && end_sector != capacity)
 		return -EINVAL;
 
 	/*
@@ -486,14 +486,26 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 	 * smaller last zone.
 	 */
 	if (zone->start == 0) {
-		if (zone->len == 0 || !is_power_of_2(zone->len)) {
-			pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
-				disk->disk_name, zone->len);
+		if (zone->len == 0) {
+			pr_warn("%s: Invalid zone size", disk->disk_name);
+			return -ENODEV;
+		}
+
+		/*
+		 * Non power-of-2 zone size support was added to remove the
+		 * gap between zone capacity and zone size. Though it is technically
+		 * possible to have gaps in a non power-of-2 device, Linux requires
+		 * the zone size to be equal to zone capacity for non power-of-2
+		 * zoned devices.
+		 */
+		if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
+			pr_warn("%s: Invalid zone capacity for non power of 2 zone size",
+				disk->disk_name);
 			return -ENODEV;
 		}
 
 		args->zone_sectors = zone->len;
-		args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
+		args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
 	} else if (zone->start + args->zone_sectors < capacity) {
 		if (zone->len != args->zone_sectors) {
 			pr_warn("%s: Invalid zoned device with non constant zone size\n",
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 85b832908f28..1be805223026 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -634,6 +634,11 @@ static inline bool queue_is_mq(struct request_queue *q)
 	return q->mq_ops;
 }
 
+static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
+{
+	return bdev->bd_queue;	/* this is never NULL */
+}
+
 #ifdef CONFIG_PM
 static inline enum rpm_status queue_rpm_status(struct request_queue *q)
 {
@@ -665,6 +670,25 @@ static inline bool blk_queue_is_zoned(struct request_queue *q)
 	}
 }
 
+static inline bool bdev_is_zoned(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (q)
+		return blk_queue_is_zoned(q);
+
+	return false;
+}
+
+static inline sector_t bdev_zone_sectors(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (!blk_queue_is_zoned(q))
+		return 0;
+	return q->limits.chunk_sectors;
+}
+
 #ifdef CONFIG_BLK_DEV_ZONED
 static inline unsigned int disk_nr_zones(struct gendisk *disk)
 {
@@ -684,6 +708,30 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
 	return div64_u64(sector, zone_sectors);
 }
 
+static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
+						   sector_t sec)
+{
+	sector_t zone_sectors = bdev_zone_sectors(bdev);
+	u64 remainder = 0;
+
+	if (!bdev_is_zoned(bdev))
+		return 0;
+
+	if (is_power_of_2(zone_sectors))
+		return sec & (zone_sectors - 1);
+
+	div64_u64_rem(sec, zone_sectors, &remainder);
+	return remainder;
+}
+
+static inline bool bdev_is_zone_aligned(struct block_device *bdev, sector_t sec)
+{
+	if (!bdev_is_zoned(bdev))
+		return false;
+
+	return bdev_offset_from_zone_start(bdev, sec) == 0;
+}
+
 static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
 {
 	if (!blk_queue_is_zoned(disk->queue))
@@ -728,6 +776,18 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
 {
 	return 0;
 }
+
+static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
+						   sector_t sec)
+{
+	return 0;
+}
+
+static inline bool bdev_is_zone_aligned(struct block_device *bdev, sector_t sec)
+{
+	return false;
+}
+
 static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
 {
 	return 0;
@@ -891,11 +951,6 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags);
 int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
 			unsigned int flags);
 
-static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
-{
-	return bdev->bd_queue;	/* this is never NULL */
-}
-
 /* Helper to convert BLK_ZONE_ZONE_XXX to its string format XXX */
 const char *blk_zone_cond_str(enum blk_zone_cond zone_cond);
 
@@ -1295,25 +1350,6 @@ static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
 	return BLK_ZONED_NONE;
 }
 
-static inline bool bdev_is_zoned(struct block_device *bdev)
-{
-	struct request_queue *q = bdev_get_queue(bdev);
-
-	if (q)
-		return blk_queue_is_zoned(q);
-
-	return false;
-}
-
-static inline sector_t bdev_zone_sectors(struct block_device *bdev)
-{
-	struct request_queue *q = bdev_get_queue(bdev);
-
-	if (!blk_queue_is_zoned(q))
-		return 0;
-	return q->limits.chunk_sectors;
-}
-
 static inline int queue_dma_alignment(const struct request_queue *q)
 {
 	return q ? q->dma_alignment : 511;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [dm-devel] [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: Pankaj Raghav, bvanassche, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, jaegeuk, matias.bjorling,
	Luis Chamberlain

Checking if a given sector is aligned to a zone is a common
operation that is performed for zoned devices. Add
bdev_is_zone_start helper to check for this instead of opencoding it
everywhere.

Convert the calculations on zone size to be generic instead of relying on
power_of_2 based logic in the block layer using the helpers wherever
possible.

The only hot path affected by this change for power_of_2 zoned devices
is in blk_check_zone_append() but bdev_is_zone_start() helper is
used to optimize the calculation for po2 zone sizes. Note that the append
path cannot be accessed by direct raw access to the block device but only
through a filesystem abstraction.

Finally, allow non power of 2 zoned devices provided that their zone
capacity and zone size are equal. The main motivation to allow non
power_of_2 zoned device is to remove the unmapped LBA between zcap and
zsze for devices that cannot have a power_of_2 zcap.

To make this work bdev_get_queue(), bdev_zone_sectors() and
bdev_is_zoned() are moved earlier without modifications.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 block/blk-core.c       |  2 +-
 block/blk-zoned.c      | 24 +++++++++---
 include/linux/blkdev.h | 84 ++++++++++++++++++++++++++++++------------
 3 files changed, 79 insertions(+), 31 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3d286a256d3d..1f7e9a90e198 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -570,7 +570,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 		return BLK_STS_NOTSUPP;
 
 	/* The bio sector must point to the start of a sequential zone */
-	if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) - 1) ||
+	if (!bdev_is_zone_aligned(bio->bi_bdev, bio->bi_iter.bi_sector) ||
 	    !bio_zone_is_seq(bio))
 		return BLK_STS_IOERR;
 
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index dce9c95b4bcd..a01a231ad328 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -285,10 +285,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
 		return -EINVAL;
 
 	/* Check alignment (handle eventual smaller last zone) */
-	if (sector & (zone_sectors - 1))
+	if (!bdev_is_zone_aligned(bdev, sector))
 		return -EINVAL;
 
-	if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
+	if (!bdev_is_zone_aligned(bdev, nr_sectors) && end_sector != capacity)
 		return -EINVAL;
 
 	/*
@@ -486,14 +486,26 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 	 * smaller last zone.
 	 */
 	if (zone->start == 0) {
-		if (zone->len == 0 || !is_power_of_2(zone->len)) {
-			pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
-				disk->disk_name, zone->len);
+		if (zone->len == 0) {
+			pr_warn("%s: Invalid zone size", disk->disk_name);
+			return -ENODEV;
+		}
+
+		/*
+		 * Non power-of-2 zone size support was added to remove the
+		 * gap between zone capacity and zone size. Though it is technically
+		 * possible to have gaps in a non power-of-2 device, Linux requires
+		 * the zone size to be equal to zone capacity for non power-of-2
+		 * zoned devices.
+		 */
+		if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
+			pr_warn("%s: Invalid zone capacity for non power of 2 zone size",
+				disk->disk_name);
 			return -ENODEV;
 		}
 
 		args->zone_sectors = zone->len;
-		args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
+		args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
 	} else if (zone->start + args->zone_sectors < capacity) {
 		if (zone->len != args->zone_sectors) {
 			pr_warn("%s: Invalid zoned device with non constant zone size\n",
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 85b832908f28..1be805223026 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -634,6 +634,11 @@ static inline bool queue_is_mq(struct request_queue *q)
 	return q->mq_ops;
 }
 
+static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
+{
+	return bdev->bd_queue;	/* this is never NULL */
+}
+
 #ifdef CONFIG_PM
 static inline enum rpm_status queue_rpm_status(struct request_queue *q)
 {
@@ -665,6 +670,25 @@ static inline bool blk_queue_is_zoned(struct request_queue *q)
 	}
 }
 
+static inline bool bdev_is_zoned(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (q)
+		return blk_queue_is_zoned(q);
+
+	return false;
+}
+
+static inline sector_t bdev_zone_sectors(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (!blk_queue_is_zoned(q))
+		return 0;
+	return q->limits.chunk_sectors;
+}
+
 #ifdef CONFIG_BLK_DEV_ZONED
 static inline unsigned int disk_nr_zones(struct gendisk *disk)
 {
@@ -684,6 +708,30 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
 	return div64_u64(sector, zone_sectors);
 }
 
+static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
+						   sector_t sec)
+{
+	sector_t zone_sectors = bdev_zone_sectors(bdev);
+	u64 remainder = 0;
+
+	if (!bdev_is_zoned(bdev))
+		return 0;
+
+	if (is_power_of_2(zone_sectors))
+		return sec & (zone_sectors - 1);
+
+	div64_u64_rem(sec, zone_sectors, &remainder);
+	return remainder;
+}
+
+static inline bool bdev_is_zone_aligned(struct block_device *bdev, sector_t sec)
+{
+	if (!bdev_is_zoned(bdev))
+		return false;
+
+	return bdev_offset_from_zone_start(bdev, sec) == 0;
+}
+
 static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
 {
 	if (!blk_queue_is_zoned(disk->queue))
@@ -728,6 +776,18 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
 {
 	return 0;
 }
+
+static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
+						   sector_t sec)
+{
+	return 0;
+}
+
+static inline bool bdev_is_zone_aligned(struct block_device *bdev, sector_t sec)
+{
+	return false;
+}
+
 static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
 {
 	return 0;
@@ -891,11 +951,6 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags);
 int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
 			unsigned int flags);
 
-static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
-{
-	return bdev->bd_queue;	/* this is never NULL */
-}
-
 /* Helper to convert BLK_ZONE_ZONE_XXX to its string format XXX */
 const char *blk_zone_cond_str(enum blk_zone_cond zone_cond);
 
@@ -1295,25 +1350,6 @@ static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
 	return BLK_ZONED_NONE;
 }
 
-static inline bool bdev_is_zoned(struct block_device *bdev)
-{
-	struct request_queue *q = bdev_get_queue(bdev);
-
-	if (q)
-		return blk_queue_is_zoned(q);
-
-	return false;
-}
-
-static inline sector_t bdev_zone_sectors(struct block_device *bdev)
-{
-	struct request_queue *q = bdev_get_queue(bdev);
-
-	if (!blk_queue_is_zoned(q))
-		return 0;
-	return q->limits.chunk_sectors;
-}
-
 static inline int queue_dma_alignment(const struct request_queue *q)
 {
 	return q ? q->dma_alignment : 511;
-- 
2.25.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v8 03/11] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
       [not found]   ` <CGME20220727162249eucas1p28fa44c840e590f6f1b53e0cc12ee3771@eucas1p2.samsung.com>
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Pankaj Raghav, Luis Chamberlain

Remove the condition which disallows non-power_of_2 zone size ZNS drive
to be updated and use generic method to calculate number of zones
instead of relying on log and shift based calculation on zone size.

The power_of_2 calculation has been replaced directly with generic
calculation without special handling. Both modified functions are not
used in hot paths, they are only used during initialization &
revalidation of the ZNS device.

As rounddown macro from math.h does not work for 32 bit architectures,
round down operation is open coded.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/host/zns.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 12316ab51bda..73e4ad495ae8 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
 	}
 
 	ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
-	if (!is_power_of_2(ns->zsze)) {
-		dev_warn(ns->ctrl->device,
-			"invalid zone size:%llu for namespace:%u\n",
-			ns->zsze, ns->head->ns_id);
-		status = -ENODEV;
-		goto free_data;
-	}
 
 	disk_set_zoned(ns->disk, BLK_ZONED_HM);
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
@@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
 				   sizeof(struct nvme_zone_descriptor);
 
 	nr_zones = min_t(unsigned int, nr_zones,
-			 get_capacity(ns->disk) >> ilog2(ns->zsze));
+			 div64_u64(get_capacity(ns->disk), ns->zsze));
 
 	bufsize = sizeof(struct nvme_zone_report) +
 		nr_zones * sizeof(struct nvme_zone_descriptor);
@@ -182,6 +175,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 	int ret, zone_idx = 0;
 	unsigned int nz, i;
 	size_t buflen;
+	u64 remainder = 0;
 
 	if (ns->head->ids.csi != NVME_CSI_ZNS)
 		return -EINVAL;
@@ -197,7 +191,11 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
 	c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
 
-	sector &= ~(ns->zsze - 1);
+	/*
+	 * Round down the sector value to the nearest zone start
+	 */
+	div64_u64_rem(sector, ns->zsze, &remainder);
+	sector -= remainder;
 	while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
 		memset(report, 0, buflen);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [dm-devel] [PATCH v8 03/11] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: Pankaj Raghav, bvanassche, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, jaegeuk, matias.bjorling,
	Luis Chamberlain

Remove the condition which disallows non-power_of_2 zone size ZNS drive
to be updated and use generic method to calculate number of zones
instead of relying on log and shift based calculation on zone size.

The power_of_2 calculation has been replaced directly with generic
calculation without special handling. Both modified functions are not
used in hot paths, they are only used during initialization &
revalidation of the ZNS device.

As rounddown macro from math.h does not work for 32 bit architectures,
round down operation is open coded.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/host/zns.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 12316ab51bda..73e4ad495ae8 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
 	}
 
 	ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
-	if (!is_power_of_2(ns->zsze)) {
-		dev_warn(ns->ctrl->device,
-			"invalid zone size:%llu for namespace:%u\n",
-			ns->zsze, ns->head->ns_id);
-		status = -ENODEV;
-		goto free_data;
-	}
 
 	disk_set_zoned(ns->disk, BLK_ZONED_HM);
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
@@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
 				   sizeof(struct nvme_zone_descriptor);
 
 	nr_zones = min_t(unsigned int, nr_zones,
-			 get_capacity(ns->disk) >> ilog2(ns->zsze));
+			 div64_u64(get_capacity(ns->disk), ns->zsze));
 
 	bufsize = sizeof(struct nvme_zone_report) +
 		nr_zones * sizeof(struct nvme_zone_descriptor);
@@ -182,6 +175,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 	int ret, zone_idx = 0;
 	unsigned int nz, i;
 	size_t buflen;
+	u64 remainder = 0;
 
 	if (ns->head->ids.csi != NVME_CSI_ZNS)
 		return -EINVAL;
@@ -197,7 +191,11 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
 	c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
 
-	sector &= ~(ns->zsze - 1);
+	/*
+	 * Round down the sector value to the nearest zone start
+	 */
+	div64_u64_rem(sector, ns->zsze, &remainder);
+	sector -= remainder;
 	while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
 		memset(report, 0, buflen);
 
-- 
2.25.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v8 04/11] nvmet: Allow ZNS target to support non-power_of_2 zone sizes
       [not found]   ` <CGME20220727162250eucas1p133e8a814fee934f7161866122ef93273@eucas1p1.samsung.com>
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Pankaj Raghav, Johannes Thumshirn, Luis Chamberlain

A generic bdev_zone_no() helper is added to calculate zone number for a
given sector in a block device. This helper internally uses disk_zone_no()
to find the zone number.

Use the helper bdev_zone_no() to calculate nr of zones. This let's us
make modifications to the math if needed in one place and adds now
support for npo2 zone devices.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/target/zns.c | 3 +--
 include/linux/blkdev.h    | 5 +++++
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
index c7ef69f29fe4..662f1a92f39b 100644
--- a/drivers/nvme/target/zns.c
+++ b/drivers/nvme/target/zns.c
@@ -241,8 +241,7 @@ static unsigned long nvmet_req_nr_zones_from_slba(struct nvmet_req *req)
 {
 	unsigned int sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);
 
-	return bdev_nr_zones(req->ns->bdev) -
-		(sect >> ilog2(bdev_zone_sectors(req->ns->bdev)));
+	return bdev_nr_zones(req->ns->bdev) - bdev_zone_no(req->ns->bdev, sect);
 }
 
 static unsigned long get_nr_zones_from_buf(struct nvmet_req *req, u32 bufsize)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1be805223026..d1ef9b9552ed 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1350,6 +1350,11 @@ static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
 	return BLK_ZONED_NONE;
 }
 
+static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
+{
+	return disk_zone_no(bdev->bd_disk, sec);
+}
+
 static inline int queue_dma_alignment(const struct request_queue *q)
 {
 	return q ? q->dma_alignment : 511;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [dm-devel] [PATCH v8 04/11] nvmet: Allow ZNS target to support non-power_of_2 zone sizes
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: Pankaj Raghav, bvanassche, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, Johannes Thumshirn, jaegeuk,
	matias.bjorling, Luis Chamberlain

A generic bdev_zone_no() helper is added to calculate zone number for a
given sector in a block device. This helper internally uses disk_zone_no()
to find the zone number.

Use the helper bdev_zone_no() to calculate nr of zones. This let's us
make modifications to the math if needed in one place and adds now
support for npo2 zone devices.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/target/zns.c | 3 +--
 include/linux/blkdev.h    | 5 +++++
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
index c7ef69f29fe4..662f1a92f39b 100644
--- a/drivers/nvme/target/zns.c
+++ b/drivers/nvme/target/zns.c
@@ -241,8 +241,7 @@ static unsigned long nvmet_req_nr_zones_from_slba(struct nvmet_req *req)
 {
 	unsigned int sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);
 
-	return bdev_nr_zones(req->ns->bdev) -
-		(sect >> ilog2(bdev_zone_sectors(req->ns->bdev)));
+	return bdev_nr_zones(req->ns->bdev) - bdev_zone_no(req->ns->bdev, sect);
 }
 
 static unsigned long get_nr_zones_from_buf(struct nvmet_req *req, u32 bufsize)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1be805223026..d1ef9b9552ed 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1350,6 +1350,11 @@ static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
 	return BLK_ZONED_NONE;
 }
 
+static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
+{
+	return disk_zone_no(bdev->bd_disk, sec);
+}
+
 static inline int queue_dma_alignment(const struct request_queue *q)
 {
 	return q ? q->dma_alignment : 511;
-- 
2.25.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v8 05/11] null_blk: allow non power of 2 zoned devices
       [not found]   ` <CGME20220727162251eucas1p12939ac3864fd8705ae139eb2d1087d8f@eucas1p1.samsung.com>
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Pankaj Raghav, Luis Chamberlain

Convert the power of 2 based calculation with zone size to be generic in
null_zone_no with optimization for power of 2 based zone sizes.

The nr_zones calculation in null_init_zoned_dev has been replaced with a
division without special handling for power of 2 based zone sizes as
this function is called only during the initialization and will not
invoked in the hot path.

Performance Measurement:

Device:
zone size = 128M, blocksize=4k

FIO cmd:

fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
--loops=4

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Sequential Write:

x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  578     |  2257    |   12.80   |  576     |  2248    |   25.78   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  581     |  2268    |   12.74   |  576     |  2248    |   25.85   |
x-----------------x---------------------------------x---------------------------------x

Sequential read:

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  667     |  2605    |   11.79   |  675     |  2637    |   23.49   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  667     |  2605    |   11.79   |  675     |  2638    |   23.48   |
x-----------------x---------------------------------x---------------------------------x

Random read:

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  522     |  2038    |   15.05   |  514     |  2006    |   30.87   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  522     |  2039    |   15.04   |  523     |  2042    |   30.33   |
x-----------------x---------------------------------x---------------------------------x

Minor variations are noticed in Sequential write with io depth 8 and
in random read with io depth 16. But overall no noticeable differences
were noticed

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/block/null_blk/main.c     |  5 ++---
 drivers/block/null_blk/null_blk.h |  6 ++++++
 drivers/block/null_blk/zoned.c    | 18 +++++++++++-------
 3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index c451c477978f..f1e0605dee94 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -1976,9 +1976,8 @@ static int null_validate_conf(struct nullb_device *dev)
 	if (dev->queue_mode == NULL_Q_BIO)
 		dev->mbps = 0;
 
-	if (dev->zoned &&
-	    (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
-		pr_err("zone_size must be power-of-two\n");
+	if (dev->zoned && !dev->zone_size) {
+		pr_err("Invalid zero zone size\n");
 		return -EINVAL;
 	}
 
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 94ff68052b1e..ece6dded9508 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -83,6 +83,12 @@ struct nullb_device {
 	unsigned int imp_close_zone_no;
 	struct nullb_zone *zones;
 	sector_t zone_size_sects;
+	/*
+	 * zone_size_sects_shift is only useful when the zone size is
+	 * power of 2. This variable is set to zero when zone size is non
+	 * power of 2.
+	 */
+	unsigned int zone_size_sects_shift;
 	bool need_zone_res_mgmt;
 	spinlock_t zone_res_lock;
 
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 55a69e48ef8b..015f6823706c 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -16,7 +16,10 @@ static inline sector_t mb_to_sects(unsigned long mb)
 
 static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
 {
-	return sect >> ilog2(dev->zone_size_sects);
+	if (dev->zone_size_sects_shift)
+		return sect >> dev->zone_size_sects_shift;
+
+	return div64_u64(sect, dev->zone_size_sects);
 }
 
 static inline void null_lock_zone_res(struct nullb_device *dev)
@@ -65,10 +68,6 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	sector_t sector = 0;
 	unsigned int i;
 
-	if (!is_power_of_2(dev->zone_size)) {
-		pr_err("zone_size must be power-of-two\n");
-		return -EINVAL;
-	}
 	if (dev->zone_size > dev->size) {
 		pr_err("Zone size larger than device capacity\n");
 		return -EINVAL;
@@ -86,9 +85,14 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	zone_capacity_sects = mb_to_sects(dev->zone_capacity);
 	dev_capacity_sects = mb_to_sects(dev->size);
 	dev->zone_size_sects = mb_to_sects(dev->zone_size);
-	dev->nr_zones = round_up(dev_capacity_sects, dev->zone_size_sects)
-		>> ilog2(dev->zone_size_sects);
 
+	if (is_power_of_2(dev->zone_size_sects))
+		dev->zone_size_sects_shift = ilog2(dev->zone_size_sects);
+	else
+		dev->zone_size_sects_shift = 0;
+
+	dev->nr_zones =	DIV_ROUND_UP_SECTOR_T(dev_capacity_sects,
+					      dev->zone_size_sects);
 	dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct nullb_zone),
 				    GFP_KERNEL | __GFP_ZERO);
 	if (!dev->zones)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [dm-devel] [PATCH v8 05/11] null_blk: allow non power of 2 zoned devices
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: Pankaj Raghav, bvanassche, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, jaegeuk, matias.bjorling,
	Luis Chamberlain

Convert the power of 2 based calculation with zone size to be generic in
null_zone_no with optimization for power of 2 based zone sizes.

The nr_zones calculation in null_init_zoned_dev has been replaced with a
division without special handling for power of 2 based zone sizes as
this function is called only during the initialization and will not
invoked in the hot path.

Performance Measurement:

Device:
zone size = 128M, blocksize=4k

FIO cmd:

fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
--loops=4

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Sequential Write:

x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  578     |  2257    |   12.80   |  576     |  2248    |   25.78   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  581     |  2268    |   12.74   |  576     |  2248    |   25.85   |
x-----------------x---------------------------------x---------------------------------x

Sequential read:

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  667     |  2605    |   11.79   |  675     |  2637    |   23.49   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  667     |  2605    |   11.79   |  675     |  2638    |   23.48   |
x-----------------x---------------------------------x---------------------------------x

Random read:

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  522     |  2038    |   15.05   |  514     |  2006    |   30.87   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  522     |  2039    |   15.04   |  523     |  2042    |   30.33   |
x-----------------x---------------------------------x---------------------------------x

Minor variations are noticed in Sequential write with io depth 8 and
in random read with io depth 16. But overall no noticeable differences
were noticed

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/block/null_blk/main.c     |  5 ++---
 drivers/block/null_blk/null_blk.h |  6 ++++++
 drivers/block/null_blk/zoned.c    | 18 +++++++++++-------
 3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index c451c477978f..f1e0605dee94 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -1976,9 +1976,8 @@ static int null_validate_conf(struct nullb_device *dev)
 	if (dev->queue_mode == NULL_Q_BIO)
 		dev->mbps = 0;
 
-	if (dev->zoned &&
-	    (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
-		pr_err("zone_size must be power-of-two\n");
+	if (dev->zoned && !dev->zone_size) {
+		pr_err("Invalid zero zone size\n");
 		return -EINVAL;
 	}
 
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 94ff68052b1e..ece6dded9508 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -83,6 +83,12 @@ struct nullb_device {
 	unsigned int imp_close_zone_no;
 	struct nullb_zone *zones;
 	sector_t zone_size_sects;
+	/*
+	 * zone_size_sects_shift is only useful when the zone size is
+	 * power of 2. This variable is set to zero when zone size is non
+	 * power of 2.
+	 */
+	unsigned int zone_size_sects_shift;
 	bool need_zone_res_mgmt;
 	spinlock_t zone_res_lock;
 
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 55a69e48ef8b..015f6823706c 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -16,7 +16,10 @@ static inline sector_t mb_to_sects(unsigned long mb)
 
 static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
 {
-	return sect >> ilog2(dev->zone_size_sects);
+	if (dev->zone_size_sects_shift)
+		return sect >> dev->zone_size_sects_shift;
+
+	return div64_u64(sect, dev->zone_size_sects);
 }
 
 static inline void null_lock_zone_res(struct nullb_device *dev)
@@ -65,10 +68,6 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	sector_t sector = 0;
 	unsigned int i;
 
-	if (!is_power_of_2(dev->zone_size)) {
-		pr_err("zone_size must be power-of-two\n");
-		return -EINVAL;
-	}
 	if (dev->zone_size > dev->size) {
 		pr_err("Zone size larger than device capacity\n");
 		return -EINVAL;
@@ -86,9 +85,14 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	zone_capacity_sects = mb_to_sects(dev->zone_capacity);
 	dev_capacity_sects = mb_to_sects(dev->size);
 	dev->zone_size_sects = mb_to_sects(dev->zone_size);
-	dev->nr_zones = round_up(dev_capacity_sects, dev->zone_size_sects)
-		>> ilog2(dev->zone_size_sects);
 
+	if (is_power_of_2(dev->zone_size_sects))
+		dev->zone_size_sects_shift = ilog2(dev->zone_size_sects);
+	else
+		dev->zone_size_sects_shift = 0;
+
+	dev->nr_zones =	DIV_ROUND_UP_SECTOR_T(dev_capacity_sects,
+					      dev->zone_size_sects);
 	dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct nullb_zone),
 				    GFP_KERNEL | __GFP_ZERO);
 	if (!dev->zones)
-- 
2.25.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v8 06/11] zonefs: allow non power of 2 zoned devices
       [not found]   ` <CGME20220727162252eucas1p25be8b79231334fa0c759c2475859e93b@eucas1p2.samsung.com>
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Pankaj Raghav, Luis Chamberlain

The zone size shift variable is useful only if the zone sizes are known
to be power of 2. Remove that variable and use generic helpers from
block layer to calculate zone index in zonefs.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/zonefs/super.c  | 6 ++----
 fs/zonefs/zonefs.h | 1 -
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 860f0b1032c6..e549ef16738c 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -476,10 +476,9 @@ static void __zonefs_io_error(struct inode *inode, bool write)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	struct super_block *sb = inode->i_sb;
-	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
 	unsigned int noio_flag;
 	unsigned int nr_zones =
-		zi->i_zone_size >> (sbi->s_zone_sectors_shift + SECTOR_SHIFT);
+		bdev_zone_no(sb->s_bdev, zi->i_zone_size >> SECTOR_SHIFT);
 	struct zonefs_ioerr_data err = {
 		.inode = inode,
 		.write = write,
@@ -1401,7 +1400,7 @@ static int zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	int ret = 0;
 
-	inode->i_ino = zone->start >> sbi->s_zone_sectors_shift;
+	inode->i_ino = bdev_zone_no(sb->s_bdev, zone->start);
 	inode->i_mode = S_IFREG | sbi->s_perm;
 
 	zi->i_ztype = type;
@@ -1776,7 +1775,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 	 * interface constraints.
 	 */
 	sb_set_blocksize(sb, bdev_zone_write_granularity(sb->s_bdev));
-	sbi->s_zone_sectors_shift = ilog2(bdev_zone_sectors(sb->s_bdev));
 	sbi->s_uid = GLOBAL_ROOT_UID;
 	sbi->s_gid = GLOBAL_ROOT_GID;
 	sbi->s_perm = 0640;
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 4b3de66c3233..39895195cda6 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -177,7 +177,6 @@ struct zonefs_sb_info {
 	kgid_t			s_gid;
 	umode_t			s_perm;
 	uuid_t			s_uuid;
-	unsigned int		s_zone_sectors_shift;
 
 	unsigned int		s_nr_files[ZONEFS_ZTYPE_MAX];
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [dm-devel] [PATCH v8 06/11] zonefs: allow non power of 2 zoned devices
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: Pankaj Raghav, bvanassche, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, jaegeuk, matias.bjorling,
	Luis Chamberlain

The zone size shift variable is useful only if the zone sizes are known
to be power of 2. Remove that variable and use generic helpers from
block layer to calculate zone index in zonefs.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/zonefs/super.c  | 6 ++----
 fs/zonefs/zonefs.h | 1 -
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 860f0b1032c6..e549ef16738c 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -476,10 +476,9 @@ static void __zonefs_io_error(struct inode *inode, bool write)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	struct super_block *sb = inode->i_sb;
-	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
 	unsigned int noio_flag;
 	unsigned int nr_zones =
-		zi->i_zone_size >> (sbi->s_zone_sectors_shift + SECTOR_SHIFT);
+		bdev_zone_no(sb->s_bdev, zi->i_zone_size >> SECTOR_SHIFT);
 	struct zonefs_ioerr_data err = {
 		.inode = inode,
 		.write = write,
@@ -1401,7 +1400,7 @@ static int zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	int ret = 0;
 
-	inode->i_ino = zone->start >> sbi->s_zone_sectors_shift;
+	inode->i_ino = bdev_zone_no(sb->s_bdev, zone->start);
 	inode->i_mode = S_IFREG | sbi->s_perm;
 
 	zi->i_ztype = type;
@@ -1776,7 +1775,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 	 * interface constraints.
 	 */
 	sb_set_blocksize(sb, bdev_zone_write_granularity(sb->s_bdev));
-	sbi->s_zone_sectors_shift = ilog2(bdev_zone_sectors(sb->s_bdev));
 	sbi->s_uid = GLOBAL_ROOT_UID;
 	sbi->s_gid = GLOBAL_ROOT_GID;
 	sbi->s_perm = 0640;
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 4b3de66c3233..39895195cda6 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -177,7 +177,6 @@ struct zonefs_sb_info {
 	kgid_t			s_gid;
 	umode_t			s_perm;
 	uuid_t			s_uuid;
-	unsigned int		s_zone_sectors_shift;
 
 	unsigned int		s_nr_files[ZONEFS_ZTYPE_MAX];
 
-- 
2.25.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v8 07/11] dm-zoned: ensure only power of 2 zone sizes are allowed
       [not found]   ` <CGME20220727162253eucas1p1a5912e0494f6918504cc8ff15ad5d31f@eucas1p1.samsung.com>
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Luis Chamberlain, Pankaj Raghav

From: Luis Chamberlain <mcgrof@kernel.org>

dm-zoned relies on the assumption that the zone size is a
power-of-2(po2) and the zone capacity is same as the zone size.

Ensure only po2 devices can be used as dm-zoned target until a native
non po2 support is added.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/md/dm-zoned-target.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
index 95b132b52f33..16499b75c5ee 100644
--- a/drivers/md/dm-zoned-target.c
+++ b/drivers/md/dm-zoned-target.c
@@ -792,6 +792,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
 				return -EINVAL;
 			}
 			zone_nr_sectors = bdev_zone_sectors(bdev);
+			if (!is_power_of_2(zone_nr_sectors)) {
+				ti->error = "Zone size is not power of 2";
+				return -EINVAL;
+			}
 			zoned_dev->zone_nr_sectors = zone_nr_sectors;
 			zoned_dev->nr_zones = bdev_nr_zones(bdev);
 		}
@@ -804,6 +808,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
 			return -EINVAL;
 		}
 		zoned_dev->zone_nr_sectors = bdev_zone_sectors(bdev);
+		if (!is_power_of_2(zoned_dev->zone_nr_sectors)) {
+			ti->error = "Zone size is not power of 2";
+			return -EINVAL;
+		}
 		zoned_dev->nr_zones = bdev_nr_zones(bdev);
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [dm-devel] [PATCH v8 07/11] dm-zoned: ensure only power of 2 zone sizes are allowed
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: Pankaj Raghav, bvanassche, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, jaegeuk, matias.bjorling,
	Luis Chamberlain

From: Luis Chamberlain <mcgrof@kernel.org>

dm-zoned relies on the assumption that the zone size is a
power-of-2(po2) and the zone capacity is same as the zone size.

Ensure only po2 devices can be used as dm-zoned target until a native
non po2 support is added.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/md/dm-zoned-target.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
index 95b132b52f33..16499b75c5ee 100644
--- a/drivers/md/dm-zoned-target.c
+++ b/drivers/md/dm-zoned-target.c
@@ -792,6 +792,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
 				return -EINVAL;
 			}
 			zone_nr_sectors = bdev_zone_sectors(bdev);
+			if (!is_power_of_2(zone_nr_sectors)) {
+				ti->error = "Zone size is not power of 2";
+				return -EINVAL;
+			}
 			zoned_dev->zone_nr_sectors = zone_nr_sectors;
 			zoned_dev->nr_zones = bdev_nr_zones(bdev);
 		}
@@ -804,6 +808,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
 			return -EINVAL;
 		}
 		zoned_dev->zone_nr_sectors = bdev_zone_sectors(bdev);
+		if (!is_power_of_2(zoned_dev->zone_nr_sectors)) {
+			ti->error = "Zone size is not power of 2";
+			return -EINVAL;
+		}
 		zoned_dev->nr_zones = bdev_nr_zones(bdev);
 	}
 
-- 
2.25.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v8 08/11] dm-zone: use generic helpers to calculate offset from zone start
       [not found]   ` <CGME20220727162254eucas1p1fd990f746d9f9870b8d58ee0bd01fedd@eucas1p1.samsung.com>
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Pankaj Raghav, Luis Chamberlain

Use the bdev_offset_from_zone_start() helper function to calculate
the offset from zone start instead of using power of 2 based
calculation.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/md/dm-zone.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 3dafc0e8b7a9..31c16aafdbfc 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -390,7 +390,9 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
 	case REQ_OP_WRITE_ZEROES:
 	case REQ_OP_WRITE:
 		/* Writes must be aligned to the zone write pointer */
-		if ((clone->bi_iter.bi_sector & (zsectors - 1)) != zwp_offset)
+		if ((bdev_offset_from_zone_start(md->disk->part0,
+						 clone->bi_iter.bi_sector)) != zwp_offset)
+
 			return false;
 		break;
 	case REQ_OP_ZONE_APPEND:
@@ -602,11 +604,8 @@ void dm_zone_endio(struct dm_io *io, struct bio *clone)
 		 */
 		if (clone->bi_status == BLK_STS_OK &&
 		    bio_op(clone) == REQ_OP_ZONE_APPEND) {
-			sector_t mask =
-				(sector_t)bdev_zone_sectors(disk->part0) - 1;
-
 			orig_bio->bi_iter.bi_sector +=
-				clone->bi_iter.bi_sector & mask;
+				bdev_offset_from_zone_start(disk->part0, clone->bi_iter.bi_sector);
 		}
 
 		return;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [dm-devel] [PATCH v8 08/11] dm-zone: use generic helpers to calculate offset from zone start
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: Pankaj Raghav, bvanassche, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, jaegeuk, matias.bjorling,
	Luis Chamberlain

Use the bdev_offset_from_zone_start() helper function to calculate
the offset from zone start instead of using power of 2 based
calculation.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/md/dm-zone.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 3dafc0e8b7a9..31c16aafdbfc 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -390,7 +390,9 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
 	case REQ_OP_WRITE_ZEROES:
 	case REQ_OP_WRITE:
 		/* Writes must be aligned to the zone write pointer */
-		if ((clone->bi_iter.bi_sector & (zsectors - 1)) != zwp_offset)
+		if ((bdev_offset_from_zone_start(md->disk->part0,
+						 clone->bi_iter.bi_sector)) != zwp_offset)
+
 			return false;
 		break;
 	case REQ_OP_ZONE_APPEND:
@@ -602,11 +604,8 @@ void dm_zone_endio(struct dm_io *io, struct bio *clone)
 		 */
 		if (clone->bi_status == BLK_STS_OK &&
 		    bio_op(clone) == REQ_OP_ZONE_APPEND) {
-			sector_t mask =
-				(sector_t)bdev_zone_sectors(disk->part0) - 1;
-
 			orig_bio->bi_iter.bi_sector +=
-				clone->bi_iter.bi_sector & mask;
+				bdev_offset_from_zone_start(disk->part0, clone->bi_iter.bi_sector);
 		}
 
 		return;
-- 
2.25.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v8 09/11] dm-table: allow non po2 zoned devices
       [not found]   ` <CGME20220727162255eucas1p2945c6dca42b799bb3b4abf3edb83dde8@eucas1p2.samsung.com>
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Pankaj Raghav

As the block layer now supports non po2 zoned devices, allow dm to
support non po2 zoned device.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/md/dm-table.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 332f96b58252..534fddfc2b42 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -250,7 +250,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
 	if (bdev_is_zoned(bdev)) {
 		unsigned int zone_sectors = bdev_zone_sectors(bdev);
 
-		if (start & (zone_sectors - 1)) {
+		if (!bdev_is_zone_aligned(bdev, start)) {
 			DMWARN("%s: start=%llu not aligned to h/w zone size %u of %pg",
 			       dm_device_name(ti->table->md),
 			       (unsigned long long)start,
@@ -267,7 +267,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
 		 * devices do not end up with a smaller zone in the middle of
 		 * the sector range.
 		 */
-		if (len & (zone_sectors - 1)) {
+		if (!bdev_is_zone_aligned(bdev, len)) {
 			DMWARN("%s: len=%llu not aligned to h/w zone size %u of %pg",
 			       dm_device_name(ti->table->md),
 			       (unsigned long long)len,
@@ -1642,8 +1642,8 @@ static int validate_hardware_zoned_model(struct dm_table *t,
 		return -EINVAL;
 	}
 
-	/* Check zone size validity and compatibility */
-	if (!zone_sectors || !is_power_of_2(zone_sectors))
+	/* Check zone size validity */
+	if (!zone_sectors)
 		return -EINVAL;
 
 	if (dm_table_any_dev_attr(t, device_not_matches_zone_sectors, &zone_sectors)) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [dm-devel] [PATCH v8 09/11] dm-table: allow non po2 zoned devices
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: Pankaj Raghav, bvanassche, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, jaegeuk, matias.bjorling

As the block layer now supports non po2 zoned devices, allow dm to
support non po2 zoned device.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/md/dm-table.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 332f96b58252..534fddfc2b42 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -250,7 +250,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
 	if (bdev_is_zoned(bdev)) {
 		unsigned int zone_sectors = bdev_zone_sectors(bdev);
 
-		if (start & (zone_sectors - 1)) {
+		if (!bdev_is_zone_aligned(bdev, start)) {
 			DMWARN("%s: start=%llu not aligned to h/w zone size %u of %pg",
 			       dm_device_name(ti->table->md),
 			       (unsigned long long)start,
@@ -267,7 +267,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
 		 * devices do not end up with a smaller zone in the middle of
 		 * the sector range.
 		 */
-		if (len & (zone_sectors - 1)) {
+		if (!bdev_is_zone_aligned(bdev, len)) {
 			DMWARN("%s: len=%llu not aligned to h/w zone size %u of %pg",
 			       dm_device_name(ti->table->md),
 			       (unsigned long long)len,
@@ -1642,8 +1642,8 @@ static int validate_hardware_zoned_model(struct dm_table *t,
 		return -EINVAL;
 	}
 
-	/* Check zone size validity and compatibility */
-	if (!zone_sectors || !is_power_of_2(zone_sectors))
+	/* Check zone size validity */
+	if (!zone_sectors)
 		return -EINVAL;
 
 	if (dm_table_any_dev_attr(t, device_not_matches_zone_sectors, &zone_sectors)) {
-- 
2.25.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v8 10/11] dm: call dm_zone_endio after the target endio callback for zoned devices
       [not found]   ` <CGME20220727162256eucas1p284a15532173cce3eca46eee0cee3acdd@eucas1p2.samsung.com>
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Pankaj Raghav

dm_zone_endio() updates the bi_sector of orig bio for zoned devices that
uses either native append or append emulation and it is called before the
endio of the target. But target endio can still update the clone bio
after dm_zone_endio is called, thereby, the orig bio does not contain
the updated information anymore. Call dm_zone_endio for zoned devices
after calling the target's endio function

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/md/dm.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 03ac6143b8aa..bc410ee04004 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1123,10 +1123,6 @@ static void clone_endio(struct bio *bio)
 			disable_write_zeroes(md);
 	}
 
-	if (static_branch_unlikely(&zoned_enabled) &&
-	    unlikely(bdev_is_zoned(bio->bi_bdev)))
-		dm_zone_endio(io, bio);
-
 	if (endio) {
 		int r = endio(ti, bio, &error);
 		switch (r) {
@@ -1155,6 +1151,10 @@ static void clone_endio(struct bio *bio)
 		}
 	}
 
+	if (static_branch_unlikely(&zoned_enabled) &&
+	    unlikely(bdev_is_zoned(bio->bi_bdev)))
+		dm_zone_endio(io, bio);
+
 	if (static_branch_unlikely(&swap_bios_enabled) &&
 	    unlikely(swap_bios_limit(ti, bio)))
 		up(&md->swap_bios_semaphore);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [dm-devel] [PATCH v8 10/11] dm: call dm_zone_endio after the target endio callback for zoned devices
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: Pankaj Raghav, bvanassche, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, jaegeuk, matias.bjorling

dm_zone_endio() updates the bi_sector of orig bio for zoned devices that
uses either native append or append emulation and it is called before the
endio of the target. But target endio can still update the clone bio
after dm_zone_endio is called, thereby, the orig bio does not contain
the updated information anymore. Call dm_zone_endio for zoned devices
after calling the target's endio function

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/md/dm.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 03ac6143b8aa..bc410ee04004 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1123,10 +1123,6 @@ static void clone_endio(struct bio *bio)
 			disable_write_zeroes(md);
 	}
 
-	if (static_branch_unlikely(&zoned_enabled) &&
-	    unlikely(bdev_is_zoned(bio->bi_bdev)))
-		dm_zone_endio(io, bio);
-
 	if (endio) {
 		int r = endio(ti, bio, &error);
 		switch (r) {
@@ -1155,6 +1151,10 @@ static void clone_endio(struct bio *bio)
 		}
 	}
 
+	if (static_branch_unlikely(&zoned_enabled) &&
+	    unlikely(bdev_is_zoned(bio->bi_bdev)))
+		dm_zone_endio(io, bio);
+
 	if (static_branch_unlikely(&swap_bios_enabled) &&
 	    unlikely(swap_bios_limit(ti, bio)))
 		up(&md->swap_bios_semaphore);
-- 
2.25.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v8 11/11] dm: add power-of-2 zoned target for non-power-of-2 zoned devices
       [not found]   ` <CGME20220727162257eucas1p2848a75c4aa7e559abb5d9ae0fbd374c1@eucas1p2.samsung.com>
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Pankaj Raghav, Johannes Thumshirn, Damien Le Moal

Only power-of-2(po2) zoned devices were supported in linux but now non
power-of-2(npo2) zoned device support has been added to the block layer.

Filesystems such as F2FS and btrfs have support for zoned devices with
po2 zone size assumption. Before adding native support for npo2 zoned
devices, it was suggested to create a dm target for npo2 zoned device to
appear as a po2 target so that file systems can initially work without any
explicit changes by using this target.

The design of this target is very simple: introduce gaps between the
underlying device's zone size and the nearest po2 device zone size.
Device's actual zone size becomes the zone capacity of the target and
device's nearest po2 zone size becomes the target's zone size.

For e.g., a device with a zone size/capacity of 3M will have an equivalent
target layout as follows:

Device layout :-
zone capacity = 3M
zone size = 3M

|--------------|-------------|
0             3M            6M

Target layout :-
zone capacity=3M
zone size = 4M

|--------------|---|--------------|---|
0             3M  4M             7M  8M

All IOs will be remapped from target to the actual device location.

The area between target's zone capacity and zone size will be emulated
in the target.
The read IOs that fall in the emulated gap area will return 0 filled
bio and all the other IOs in that area will result in an error.
If a read IO span across the emulated area boundary, then the IOs are
split across them. All other IO operations that span across the emulated
area boundary will result in an error.

The target can be easily created as follows:
dmsetup create <label> --table '0 <size_sects> po2z /dev/nvme<id>'

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Suggested-by: Damien Le Moal <damien.lemoal@wdc.com>
Suggested-by: Hannes Reinecke <hare@suse.de>
---
 drivers/md/Kconfig            |   9 ++
 drivers/md/Makefile           |   2 +
 drivers/md/dm-po2z-target.c   | 261 ++++++++++++++++++++++++++++++++++
 drivers/md/dm-table.c         |  13 +-
 drivers/md/dm-zone.c          |   1 +
 include/linux/device-mapper.h |   9 ++
 6 files changed, 292 insertions(+), 3 deletions(-)
 create mode 100644 drivers/md/dm-po2z-target.c

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 998a5cfdbc4e..d58ccfee765b 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -518,6 +518,15 @@ config DM_FLAKEY
 	help
 	 A target that intermittently fails I/O for debugging purposes.
 
+config DM_PO2Z
+	tristate "Power-of-2 target for non power-of-2 zoned devices"
+	depends on BLK_DEV_DM
+	depends on BLK_DEV_ZONED
+	help
+	  A target that converts a zoned device with non power-of-2 zone size to
+	  be power of 2. This is done by introducing gaps in between the zone
+	  capacity and the power of 2 zone size.
+
 config DM_VERITY
 	tristate "Verity target support"
 	depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 270f694850ec..98af4ed98f73 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -26,6 +26,7 @@ dm-era-y	+= dm-era-target.o
 dm-clone-y	+= dm-clone-target.o dm-clone-metadata.o
 dm-verity-y	+= dm-verity-target.o
 dm-zoned-y	+= dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
+dm-po2z-y       += dm-po2z-target.o
 
 md-mod-y	+= md.o md-bitmap.o
 raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
@@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
 obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
 obj-$(CONFIG_DM_DUST)		+= dm-dust.o
 obj-$(CONFIG_DM_FLAKEY)		+= dm-flakey.o
+obj-$(CONFIG_DM_PO2Z)		+= dm-po2z.o
 obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
 obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
 obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
diff --git a/drivers/md/dm-po2z-target.c b/drivers/md/dm-po2z-target.c
new file mode 100644
index 000000000000..e9c5b7d00eda
--- /dev/null
+++ b/drivers/md/dm-po2z-target.c
@@ -0,0 +1,261 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Samsung Electronics Co., Ltd.
+ */
+
+#include <linux/device-mapper.h>
+
+#define DM_MSG_PREFIX "po2z"
+
+struct dm_po2z_target {
+	struct dm_dev *dev;
+	sector_t zone_size; /* Actual zone size of the underlying dev*/
+	sector_t zone_size_po2; /* zone_size rounded to the nearest po2 value */
+	sector_t zone_size_diff; /* diff between zone_size_po2 and zone_size */
+	u32 nr_zones;
+};
+
+static inline u32 npo2_zone_no(struct dm_po2z_target *dmh, sector_t sect)
+{
+	return div64_u64(sect, dmh->zone_size);
+}
+
+static inline u32 po2_zone_no(struct dm_po2z_target *dmh, sector_t sect)
+{
+	return sect >> ilog2(dmh->zone_size_po2);
+}
+
+static inline sector_t target_to_device_sect(struct dm_po2z_target *dmh,
+					     sector_t sect)
+{
+	u32 zone_idx = po2_zone_no(dmh, sect);
+
+	return sect - (zone_idx * dmh->zone_size_diff);
+}
+
+static inline sector_t device_to_target_sect(struct dm_po2z_target *dmh,
+					     sector_t sect)
+{
+	u32 zone_idx = npo2_zone_no(dmh, sect);
+
+	return sect + (zone_idx * dmh->zone_size_diff);
+}
+
+/*
+ * This target works on the complete zoned device. Partial mapping is not
+ * supported.
+ * Construct a zoned po2 logical device: <dev-path>
+ */
+static int dm_po2z_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct dm_po2z_target *dmh = NULL;
+	int ret;
+	sector_t zone_size;
+	sector_t dev_capacity;
+
+	if (argc < 1)
+		return -EINVAL;
+
+	dmh = kmalloc(sizeof(*dmh), GFP_KERNEL);
+	if (!dmh)
+		return -ENOMEM;
+
+	ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
+			    &dmh->dev);
+
+	if (ret) {
+		ti->error = "Device lookup failed";
+		return ret;
+	}
+
+	zone_size = bdev_zone_sectors(dmh->dev->bdev);
+	dev_capacity = get_capacity(dmh->dev->bdev->bd_disk);
+	if (ti->len != dev_capacity || ti->begin) {
+		DMERR("%pg Partial mapping of the target not supported",
+		      dmh->dev->bdev);
+		return -EINVAL;
+	}
+
+	if (is_power_of_2(zone_size)) {
+		DMERR("%pg: this target is not useful for power-of-2 zoned devices",
+		      dmh->dev->bdev);
+		return -EINVAL;
+	}
+
+	dmh->zone_size = zone_size;
+	dmh->zone_size_po2 = 1 << get_count_order_long(zone_size);
+	dmh->zone_size_diff = dmh->zone_size_po2 - dmh->zone_size;
+	ti->private = dmh;
+	dmh->nr_zones = npo2_zone_no(dmh, ti->len);
+	ti->len = dmh->zone_size_po2 * dmh->nr_zones;
+
+	return 0;
+}
+
+static int dm_po2z_report_zones_cb(struct blk_zone *zone, unsigned int idx,
+				   void *data)
+{
+	struct dm_report_zones_args *args = data;
+	struct dm_po2z_target *dmh = args->tgt->private;
+
+	zone->start = device_to_target_sect(dmh, zone->start);
+	zone->wp = device_to_target_sect(dmh, zone->wp);
+	zone->len = dmh->zone_size_po2;
+	args->next_sector = zone->start + zone->len;
+
+	return args->orig_cb(zone, args->zone_idx++, args->orig_data);
+}
+
+static int dm_po2z_report_zones(struct dm_target *ti,
+				struct dm_report_zones_args *args,
+				unsigned int nr_zones)
+{
+	struct dm_po2z_target *dmh = ti->private;
+	sector_t sect = po2_zone_no(dmh, args->next_sector) * dmh->zone_size;
+
+	return blkdev_report_zones(dmh->dev->bdev, sect, nr_zones,
+				   dm_po2z_report_zones_cb, args);
+}
+
+static int dm_po2z_end_io(struct dm_target *ti, struct bio *bio,
+			  blk_status_t *error)
+{
+	struct dm_po2z_target *dmh = ti->private;
+
+	if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND)
+		bio->bi_iter.bi_sector =
+			device_to_target_sect(dmh, bio->bi_iter.bi_sector);
+
+	return DM_ENDIO_DONE;
+}
+
+static void dm_po2z_io_hints(struct dm_target *ti, struct queue_limits *limits)
+{
+	struct dm_po2z_target *dmh = ti->private;
+
+	limits->chunk_sectors = dmh->zone_size_po2;
+}
+
+static bool bio_across_emulated_zone_area(struct dm_po2z_target *dmh,
+					  struct bio *bio)
+{
+	u32 zone_idx = po2_zone_no(dmh, bio->bi_iter.bi_sector);
+	sector_t size = bio->bi_iter.bi_size >> SECTOR_SHIFT;
+
+	return (bio->bi_iter.bi_sector - (zone_idx * dmh->zone_size_po2) +
+		size) > dmh->zone_size;
+}
+
+static int dm_po2z_map_read(struct dm_po2z_target *dmh, struct bio *bio)
+{
+	sector_t start_sect, size, relative_sect_in_zone, split_io_pos;
+	u32 zone_idx;
+
+	/*
+	 * Read does not extend into the emulated zone area (area between
+	 * zone capacity and zone size)
+	 */
+	if (!bio_across_emulated_zone_area(dmh, bio)) {
+		bio->bi_iter.bi_sector =
+			target_to_device_sect(dmh, bio->bi_iter.bi_sector);
+		return DM_MAPIO_REMAPPED;
+	}
+
+	start_sect = bio->bi_iter.bi_sector;
+	size = bio->bi_iter.bi_size >> SECTOR_SHIFT;
+	zone_idx = po2_zone_no(dmh, start_sect);
+	relative_sect_in_zone = start_sect - (zone_idx * dmh->zone_size_po2);
+
+	/*
+	 * If the starting sector is in the emulated area then fill
+	 * all the bio with zeros. If bio is across emulated zone boundary,
+	 * split the bio across boundaries and fill zeros only for the
+	 * bio that is between the zone capacity and the zone size.
+	 */
+	if (relative_sect_in_zone < dmh->zone_size &&
+	    ((relative_sect_in_zone + size) > dmh->zone_size)) {
+		split_io_pos = (zone_idx * dmh->zone_size_po2) + dmh->zone_size;
+		dm_accept_partial_bio(bio, split_io_pos - start_sect);
+		bio->bi_iter.bi_sector = target_to_device_sect(dmh, start_sect);
+
+		return DM_MAPIO_REMAPPED;
+	} else if (relative_sect_in_zone >= dmh->zone_size &&
+		   ((relative_sect_in_zone + size) > dmh->zone_size_po2)) {
+		split_io_pos = (zone_idx + 1) * dmh->zone_size_po2;
+		dm_accept_partial_bio(bio, split_io_pos - start_sect);
+	}
+
+	zero_fill_bio(bio);
+	bio_endio(bio);
+	return DM_MAPIO_SUBMITTED;
+}
+
+static int dm_po2z_map(struct dm_target *ti, struct bio *bio)
+{
+	struct dm_po2z_target *dmh = ti->private;
+
+	bio_set_dev(bio, dmh->dev->bdev);
+	if (bio_sectors(bio) || op_is_zone_mgmt(bio_op(bio))) {
+		/*
+		 * Read operation on the emulated zone area (between zone capacity
+		 * and zone size) will fill the bio with zeroes. Any other operation
+		 * in the emulated area should return an error.
+		 */
+		if (bio_op(bio) == REQ_OP_READ)
+			return dm_po2z_map_read(dmh, bio);
+
+		if (!bio_across_emulated_zone_area(dmh, bio)) {
+			bio->bi_iter.bi_sector = target_to_device_sect(dmh,
+								       bio->bi_iter.bi_sector);
+			return DM_MAPIO_REMAPPED;
+		}
+		return DM_MAPIO_KILL;
+	}
+	return DM_MAPIO_REMAPPED;
+}
+
+static int dm_po2z_iterate_devices(struct dm_target *ti,
+				   iterate_devices_callout_fn fn, void *data)
+{
+	struct dm_po2z_target *dmh = ti->private;
+	sector_t len = dmh->nr_zones * dmh->zone_size;
+
+	return fn(ti, dmh->dev, 0, len, data);
+}
+
+static struct target_type dm_po2z_target = {
+	.name = "po2z",
+	.version = { 1, 0, 0 },
+	.features = DM_TARGET_ZONED_HM | DM_TARGET_EMULATED_ZONES,
+	.map = dm_po2z_map,
+	.end_io = dm_po2z_end_io,
+	.report_zones = dm_po2z_report_zones,
+	.iterate_devices = dm_po2z_iterate_devices,
+	.module = THIS_MODULE,
+	.io_hints = dm_po2z_io_hints,
+	.ctr = dm_po2z_ctr,
+};
+
+static int __init dm_po2z_init(void)
+{
+	int r = dm_register_target(&dm_po2z_target);
+
+	if (r < 0)
+		DMERR("register failed %d", r);
+
+	return r;
+}
+
+static void __exit dm_po2z_exit(void)
+{
+	dm_unregister_target(&dm_po2z_target);
+}
+
+/* Module hooks */
+module_init(dm_po2z_init);
+module_exit(dm_po2z_exit);
+
+MODULE_DESCRIPTION(DM_NAME "power-of-2 zoned target");
+MODULE_AUTHOR("Pankaj Raghav <p.raghav@samsung.com>");
+MODULE_LICENSE("GPL");
+
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 534fddfc2b42..d77feff124af 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1614,13 +1614,20 @@ static bool dm_table_supports_zoned_model(struct dm_table *t,
 	return true;
 }
 
-static int device_not_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
+/*
+ * Callback function to check for device zone sector across devices. If the
+ * DM_TARGET_EMULATED_ZONES target feature flag is not set, then the target
+ * should have the same zone sector as the underlying devices.
+ */
+static int check_valid_device_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
 					   sector_t start, sector_t len, void *data)
 {
 	unsigned int *zone_sectors = data;
 
-	if (!bdev_is_zoned(dev->bdev))
+	if (!bdev_is_zoned(dev->bdev) ||
+	    dm_target_supports_emulated_zones(ti->type))
 		return 0;
+
 	return bdev_zone_sectors(dev->bdev) != *zone_sectors;
 }
 
@@ -1646,7 +1653,7 @@ static int validate_hardware_zoned_model(struct dm_table *t,
 	if (!zone_sectors)
 		return -EINVAL;
 
-	if (dm_table_any_dev_attr(t, device_not_matches_zone_sectors, &zone_sectors)) {
+	if (dm_table_any_dev_attr(t, check_valid_device_zone_sectors, &zone_sectors)) {
 		DMERR("%s: zone sectors is not consistent across all zoned devices",
 		      dm_device_name(t->md));
 		return -EINVAL;
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 31c16aafdbfc..2b6b3883471f 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -302,6 +302,7 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
 	if (dm_table_supports_zone_append(t)) {
 		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
 		dm_cleanup_zoned_dev(md);
+
 		return 0;
 	}
 
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 920085dd7f3b..7dbd28b8de01 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -294,6 +294,15 @@ struct target_type {
 #define dm_target_supports_mixed_zoned_model(type) (false)
 #endif
 
+#ifdef CONFIG_BLK_DEV_ZONED
+#define DM_TARGET_EMULATED_ZONES	0x00000400
+#define dm_target_supports_emulated_zones(type) \
+	((type)->features & DM_TARGET_EMULATED_ZONES)
+#else
+#define DM_TARGET_EMULATED_ZONES	0x00000000
+#define dm_target_supports_emulated_zones(type) (false)
+#endif
+
 struct dm_target {
 	struct dm_table *table;
 	struct target_type *type;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [dm-devel] [PATCH v8 11/11] dm: add power-of-2 zoned target for non-power-of-2 zoned devices
@ 2022-07-27 16:22       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-27 16:22 UTC (permalink / raw)
  To: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: Pankaj Raghav, Damien Le Moal, bvanassche, pankydev8, gost.dev,
	linux-kernel, linux-nvme, linux-block, dm-devel,
	Johannes Thumshirn, jaegeuk, matias.bjorling

Only power-of-2(po2) zoned devices were supported in linux but now non
power-of-2(npo2) zoned device support has been added to the block layer.

Filesystems such as F2FS and btrfs have support for zoned devices with
po2 zone size assumption. Before adding native support for npo2 zoned
devices, it was suggested to create a dm target for npo2 zoned device to
appear as a po2 target so that file systems can initially work without any
explicit changes by using this target.

The design of this target is very simple: introduce gaps between the
underlying device's zone size and the nearest po2 device zone size.
Device's actual zone size becomes the zone capacity of the target and
device's nearest po2 zone size becomes the target's zone size.

For e.g., a device with a zone size/capacity of 3M will have an equivalent
target layout as follows:

Device layout :-
zone capacity = 3M
zone size = 3M

|--------------|-------------|
0             3M            6M

Target layout :-
zone capacity=3M
zone size = 4M

|--------------|---|--------------|---|
0             3M  4M             7M  8M

All IOs will be remapped from target to the actual device location.

The area between target's zone capacity and zone size will be emulated
in the target.
The read IOs that fall in the emulated gap area will return 0 filled
bio and all the other IOs in that area will result in an error.
If a read IO span across the emulated area boundary, then the IOs are
split across them. All other IO operations that span across the emulated
area boundary will result in an error.

The target can be easily created as follows:
dmsetup create <label> --table '0 <size_sects> po2z /dev/nvme<id>'

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Suggested-by: Damien Le Moal <damien.lemoal@wdc.com>
Suggested-by: Hannes Reinecke <hare@suse.de>
---
 drivers/md/Kconfig            |   9 ++
 drivers/md/Makefile           |   2 +
 drivers/md/dm-po2z-target.c   | 261 ++++++++++++++++++++++++++++++++++
 drivers/md/dm-table.c         |  13 +-
 drivers/md/dm-zone.c          |   1 +
 include/linux/device-mapper.h |   9 ++
 6 files changed, 292 insertions(+), 3 deletions(-)
 create mode 100644 drivers/md/dm-po2z-target.c

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 998a5cfdbc4e..d58ccfee765b 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -518,6 +518,15 @@ config DM_FLAKEY
 	help
 	 A target that intermittently fails I/O for debugging purposes.
 
+config DM_PO2Z
+	tristate "Power-of-2 target for non power-of-2 zoned devices"
+	depends on BLK_DEV_DM
+	depends on BLK_DEV_ZONED
+	help
+	  A target that converts a zoned device with non power-of-2 zone size to
+	  be power of 2. This is done by introducing gaps in between the zone
+	  capacity and the power of 2 zone size.
+
 config DM_VERITY
 	tristate "Verity target support"
 	depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 270f694850ec..98af4ed98f73 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -26,6 +26,7 @@ dm-era-y	+= dm-era-target.o
 dm-clone-y	+= dm-clone-target.o dm-clone-metadata.o
 dm-verity-y	+= dm-verity-target.o
 dm-zoned-y	+= dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
+dm-po2z-y       += dm-po2z-target.o
 
 md-mod-y	+= md.o md-bitmap.o
 raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
@@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
 obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
 obj-$(CONFIG_DM_DUST)		+= dm-dust.o
 obj-$(CONFIG_DM_FLAKEY)		+= dm-flakey.o
+obj-$(CONFIG_DM_PO2Z)		+= dm-po2z.o
 obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
 obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
 obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
diff --git a/drivers/md/dm-po2z-target.c b/drivers/md/dm-po2z-target.c
new file mode 100644
index 000000000000..e9c5b7d00eda
--- /dev/null
+++ b/drivers/md/dm-po2z-target.c
@@ -0,0 +1,261 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Samsung Electronics Co., Ltd.
+ */
+
+#include <linux/device-mapper.h>
+
+#define DM_MSG_PREFIX "po2z"
+
+struct dm_po2z_target {
+	struct dm_dev *dev;
+	sector_t zone_size; /* Actual zone size of the underlying dev*/
+	sector_t zone_size_po2; /* zone_size rounded to the nearest po2 value */
+	sector_t zone_size_diff; /* diff between zone_size_po2 and zone_size */
+	u32 nr_zones;
+};
+
+static inline u32 npo2_zone_no(struct dm_po2z_target *dmh, sector_t sect)
+{
+	return div64_u64(sect, dmh->zone_size);
+}
+
+static inline u32 po2_zone_no(struct dm_po2z_target *dmh, sector_t sect)
+{
+	return sect >> ilog2(dmh->zone_size_po2);
+}
+
+static inline sector_t target_to_device_sect(struct dm_po2z_target *dmh,
+					     sector_t sect)
+{
+	u32 zone_idx = po2_zone_no(dmh, sect);
+
+	return sect - (zone_idx * dmh->zone_size_diff);
+}
+
+static inline sector_t device_to_target_sect(struct dm_po2z_target *dmh,
+					     sector_t sect)
+{
+	u32 zone_idx = npo2_zone_no(dmh, sect);
+
+	return sect + (zone_idx * dmh->zone_size_diff);
+}
+
+/*
+ * This target works on the complete zoned device. Partial mapping is not
+ * supported.
+ * Construct a zoned po2 logical device: <dev-path>
+ */
+static int dm_po2z_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct dm_po2z_target *dmh = NULL;
+	int ret;
+	sector_t zone_size;
+	sector_t dev_capacity;
+
+	if (argc < 1)
+		return -EINVAL;
+
+	dmh = kmalloc(sizeof(*dmh), GFP_KERNEL);
+	if (!dmh)
+		return -ENOMEM;
+
+	ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
+			    &dmh->dev);
+
+	if (ret) {
+		ti->error = "Device lookup failed";
+		return ret;
+	}
+
+	zone_size = bdev_zone_sectors(dmh->dev->bdev);
+	dev_capacity = get_capacity(dmh->dev->bdev->bd_disk);
+	if (ti->len != dev_capacity || ti->begin) {
+		DMERR("%pg Partial mapping of the target not supported",
+		      dmh->dev->bdev);
+		return -EINVAL;
+	}
+
+	if (is_power_of_2(zone_size)) {
+		DMERR("%pg: this target is not useful for power-of-2 zoned devices",
+		      dmh->dev->bdev);
+		return -EINVAL;
+	}
+
+	dmh->zone_size = zone_size;
+	dmh->zone_size_po2 = 1 << get_count_order_long(zone_size);
+	dmh->zone_size_diff = dmh->zone_size_po2 - dmh->zone_size;
+	ti->private = dmh;
+	dmh->nr_zones = npo2_zone_no(dmh, ti->len);
+	ti->len = dmh->zone_size_po2 * dmh->nr_zones;
+
+	return 0;
+}
+
+static int dm_po2z_report_zones_cb(struct blk_zone *zone, unsigned int idx,
+				   void *data)
+{
+	struct dm_report_zones_args *args = data;
+	struct dm_po2z_target *dmh = args->tgt->private;
+
+	zone->start = device_to_target_sect(dmh, zone->start);
+	zone->wp = device_to_target_sect(dmh, zone->wp);
+	zone->len = dmh->zone_size_po2;
+	args->next_sector = zone->start + zone->len;
+
+	return args->orig_cb(zone, args->zone_idx++, args->orig_data);
+}
+
+static int dm_po2z_report_zones(struct dm_target *ti,
+				struct dm_report_zones_args *args,
+				unsigned int nr_zones)
+{
+	struct dm_po2z_target *dmh = ti->private;
+	sector_t sect = po2_zone_no(dmh, args->next_sector) * dmh->zone_size;
+
+	return blkdev_report_zones(dmh->dev->bdev, sect, nr_zones,
+				   dm_po2z_report_zones_cb, args);
+}
+
+static int dm_po2z_end_io(struct dm_target *ti, struct bio *bio,
+			  blk_status_t *error)
+{
+	struct dm_po2z_target *dmh = ti->private;
+
+	if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND)
+		bio->bi_iter.bi_sector =
+			device_to_target_sect(dmh, bio->bi_iter.bi_sector);
+
+	return DM_ENDIO_DONE;
+}
+
+static void dm_po2z_io_hints(struct dm_target *ti, struct queue_limits *limits)
+{
+	struct dm_po2z_target *dmh = ti->private;
+
+	limits->chunk_sectors = dmh->zone_size_po2;
+}
+
+static bool bio_across_emulated_zone_area(struct dm_po2z_target *dmh,
+					  struct bio *bio)
+{
+	u32 zone_idx = po2_zone_no(dmh, bio->bi_iter.bi_sector);
+	sector_t size = bio->bi_iter.bi_size >> SECTOR_SHIFT;
+
+	return (bio->bi_iter.bi_sector - (zone_idx * dmh->zone_size_po2) +
+		size) > dmh->zone_size;
+}
+
+static int dm_po2z_map_read(struct dm_po2z_target *dmh, struct bio *bio)
+{
+	sector_t start_sect, size, relative_sect_in_zone, split_io_pos;
+	u32 zone_idx;
+
+	/*
+	 * Read does not extend into the emulated zone area (area between
+	 * zone capacity and zone size)
+	 */
+	if (!bio_across_emulated_zone_area(dmh, bio)) {
+		bio->bi_iter.bi_sector =
+			target_to_device_sect(dmh, bio->bi_iter.bi_sector);
+		return DM_MAPIO_REMAPPED;
+	}
+
+	start_sect = bio->bi_iter.bi_sector;
+	size = bio->bi_iter.bi_size >> SECTOR_SHIFT;
+	zone_idx = po2_zone_no(dmh, start_sect);
+	relative_sect_in_zone = start_sect - (zone_idx * dmh->zone_size_po2);
+
+	/*
+	 * If the starting sector is in the emulated area then fill
+	 * all the bio with zeros. If bio is across emulated zone boundary,
+	 * split the bio across boundaries and fill zeros only for the
+	 * bio that is between the zone capacity and the zone size.
+	 */
+	if (relative_sect_in_zone < dmh->zone_size &&
+	    ((relative_sect_in_zone + size) > dmh->zone_size)) {
+		split_io_pos = (zone_idx * dmh->zone_size_po2) + dmh->zone_size;
+		dm_accept_partial_bio(bio, split_io_pos - start_sect);
+		bio->bi_iter.bi_sector = target_to_device_sect(dmh, start_sect);
+
+		return DM_MAPIO_REMAPPED;
+	} else if (relative_sect_in_zone >= dmh->zone_size &&
+		   ((relative_sect_in_zone + size) > dmh->zone_size_po2)) {
+		split_io_pos = (zone_idx + 1) * dmh->zone_size_po2;
+		dm_accept_partial_bio(bio, split_io_pos - start_sect);
+	}
+
+	zero_fill_bio(bio);
+	bio_endio(bio);
+	return DM_MAPIO_SUBMITTED;
+}
+
+static int dm_po2z_map(struct dm_target *ti, struct bio *bio)
+{
+	struct dm_po2z_target *dmh = ti->private;
+
+	bio_set_dev(bio, dmh->dev->bdev);
+	if (bio_sectors(bio) || op_is_zone_mgmt(bio_op(bio))) {
+		/*
+		 * Read operation on the emulated zone area (between zone capacity
+		 * and zone size) will fill the bio with zeroes. Any other operation
+		 * in the emulated area should return an error.
+		 */
+		if (bio_op(bio) == REQ_OP_READ)
+			return dm_po2z_map_read(dmh, bio);
+
+		if (!bio_across_emulated_zone_area(dmh, bio)) {
+			bio->bi_iter.bi_sector = target_to_device_sect(dmh,
+								       bio->bi_iter.bi_sector);
+			return DM_MAPIO_REMAPPED;
+		}
+		return DM_MAPIO_KILL;
+	}
+	return DM_MAPIO_REMAPPED;
+}
+
+static int dm_po2z_iterate_devices(struct dm_target *ti,
+				   iterate_devices_callout_fn fn, void *data)
+{
+	struct dm_po2z_target *dmh = ti->private;
+	sector_t len = dmh->nr_zones * dmh->zone_size;
+
+	return fn(ti, dmh->dev, 0, len, data);
+}
+
+static struct target_type dm_po2z_target = {
+	.name = "po2z",
+	.version = { 1, 0, 0 },
+	.features = DM_TARGET_ZONED_HM | DM_TARGET_EMULATED_ZONES,
+	.map = dm_po2z_map,
+	.end_io = dm_po2z_end_io,
+	.report_zones = dm_po2z_report_zones,
+	.iterate_devices = dm_po2z_iterate_devices,
+	.module = THIS_MODULE,
+	.io_hints = dm_po2z_io_hints,
+	.ctr = dm_po2z_ctr,
+};
+
+static int __init dm_po2z_init(void)
+{
+	int r = dm_register_target(&dm_po2z_target);
+
+	if (r < 0)
+		DMERR("register failed %d", r);
+
+	return r;
+}
+
+static void __exit dm_po2z_exit(void)
+{
+	dm_unregister_target(&dm_po2z_target);
+}
+
+/* Module hooks */
+module_init(dm_po2z_init);
+module_exit(dm_po2z_exit);
+
+MODULE_DESCRIPTION(DM_NAME "power-of-2 zoned target");
+MODULE_AUTHOR("Pankaj Raghav <p.raghav@samsung.com>");
+MODULE_LICENSE("GPL");
+
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 534fddfc2b42..d77feff124af 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1614,13 +1614,20 @@ static bool dm_table_supports_zoned_model(struct dm_table *t,
 	return true;
 }
 
-static int device_not_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
+/*
+ * Callback function to check for device zone sector across devices. If the
+ * DM_TARGET_EMULATED_ZONES target feature flag is not set, then the target
+ * should have the same zone sector as the underlying devices.
+ */
+static int check_valid_device_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
 					   sector_t start, sector_t len, void *data)
 {
 	unsigned int *zone_sectors = data;
 
-	if (!bdev_is_zoned(dev->bdev))
+	if (!bdev_is_zoned(dev->bdev) ||
+	    dm_target_supports_emulated_zones(ti->type))
 		return 0;
+
 	return bdev_zone_sectors(dev->bdev) != *zone_sectors;
 }
 
@@ -1646,7 +1653,7 @@ static int validate_hardware_zoned_model(struct dm_table *t,
 	if (!zone_sectors)
 		return -EINVAL;
 
-	if (dm_table_any_dev_attr(t, device_not_matches_zone_sectors, &zone_sectors)) {
+	if (dm_table_any_dev_attr(t, check_valid_device_zone_sectors, &zone_sectors)) {
 		DMERR("%s: zone sectors is not consistent across all zoned devices",
 		      dm_device_name(t->md));
 		return -EINVAL;
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 31c16aafdbfc..2b6b3883471f 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -302,6 +302,7 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
 	if (dm_table_supports_zone_append(t)) {
 		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
 		dm_cleanup_zoned_dev(md);
+
 		return 0;
 	}
 
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 920085dd7f3b..7dbd28b8de01 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -294,6 +294,15 @@ struct target_type {
 #define dm_target_supports_mixed_zoned_model(type) (false)
 #endif
 
+#ifdef CONFIG_BLK_DEV_ZONED
+#define DM_TARGET_EMULATED_ZONES	0x00000400
+#define dm_target_supports_emulated_zones(type) \
+	((type)->features & DM_TARGET_EMULATED_ZONES)
+#else
+#define DM_TARGET_EMULATED_ZONES	0x00000000
+#define dm_target_supports_emulated_zones(type) (false)
+#endif
+
 struct dm_target {
 	struct dm_table *table;
 	struct target_type *type;
-- 
2.25.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 05/11] null_blk: allow non power of 2 zoned devices
  2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
@ 2022-07-27 21:59         ` Chaitanya Kulkarni
  -1 siblings, 0 replies; 80+ messages in thread
From: Chaitanya Kulkarni @ 2022-07-27 21:59 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: snitzer, matias.bjorling, Johannes.Thumshirn, gost.dev,
	linux-kernel, hare, hch, axboe, damien.lemoal, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Luis Chamberlain


> Performance Measurement:
> 
> Device:
> zone size = 128M, blocksize=4k
> 
> FIO cmd:
> 
> fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
> --io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
> --loops=4
> 
> The following results are an average of 4 runs on AMD Ryzen 5 5600X with
> 32GB of RAM:
> 
> Sequential Write:
> 
> x-----------------x---------------------------------x---------------------------------x
> |     IOdepth     |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  578     |  2257    |   12.80   |  576     |  2248    |   25.78   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  581     |  2268    |   12.74   |  576     |  2248    |   25.85   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Sequential read:
> 
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth         |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  667     |  2605    |   11.79   |  675     |  2637    |   23.49   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  667     |  2605    |   11.79   |  675     |  2638    |   23.48   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Random read:
> 
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth         |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  522     |  2038    |   15.05   |  514     |  2006    |   30.87   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  522     |  2039    |   15.04   |  523     |  2042    |   30.33   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Minor variations are noticed in Sequential write with io depth 8 and
> in random read with io depth 16. But overall no noticeable differences
> were noticed

minor variations in with aspect of the performance ?
are these documented somewhere ?

move the large table of performance numbers to the cover letter it looks 
ugly in the git log...



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 05/11] null_blk: allow non power of 2 zoned devices
@ 2022-07-27 21:59         ` Chaitanya Kulkarni
  0 siblings, 0 replies; 80+ messages in thread
From: Chaitanya Kulkarni @ 2022-07-27 21:59 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: axboe, bvanassche, pankydev8, Johannes.Thumshirn, damien.lemoal,
	snitzer, linux-kernel, linux-nvme, linux-block, dm-devel,
	gost.dev, jaegeuk, hch, matias.bjorling, Luis Chamberlain


> Performance Measurement:
> 
> Device:
> zone size = 128M, blocksize=4k
> 
> FIO cmd:
> 
> fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
> --io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
> --loops=4
> 
> The following results are an average of 4 runs on AMD Ryzen 5 5600X with
> 32GB of RAM:
> 
> Sequential Write:
> 
> x-----------------x---------------------------------x---------------------------------x
> |     IOdepth     |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  578     |  2257    |   12.80   |  576     |  2248    |   25.78   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  581     |  2268    |   12.74   |  576     |  2248    |   25.85   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Sequential read:
> 
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth         |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  667     |  2605    |   11.79   |  675     |  2637    |   23.49   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  667     |  2605    |   11.79   |  675     |  2638    |   23.48   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Random read:
> 
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth         |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  522     |  2038    |   15.05   |  514     |  2006    |   30.87   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  522     |  2039    |   15.04   |  523     |  2042    |   30.33   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Minor variations are noticed in Sequential write with io depth 8 and
> in random read with io depth 16. But overall no noticeable differences
> were noticed

minor variations in with aspect of the performance ?
are these documented somewhere ?

move the large table of performance numbers to the cover letter it looks 
ugly in the git log...


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
@ 2022-07-27 23:16         ` Bart Van Assche
  -1 siblings, 0 replies; 80+ messages in thread
From: Bart Van Assche @ 2022-07-27 23:16 UTC (permalink / raw)
  To: Pankaj Raghav, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, jaegeuk, dm-devel, linux-nvme, Luis Chamberlain

On 7/27/22 09:22, Pankaj Raghav wrote:
> Checking if a given sector is aligned to a zone is a common
> operation that is performed for zoned devices. Add
> bdev_is_zone_start helper to check for this instead of opencoding it
> everywhere.

I can't find the bdev_is_zone_start() function in this patch?

> To make this work bdev_get_queue(), bdev_zone_sectors() and
> bdev_is_zoned() are moved earlier without modifications.

Can that change perhaps be isolated into a separate patch?

> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3d286a256d3d..1f7e9a90e198 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -570,7 +570,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>   		return BLK_STS_NOTSUPP;
>   
>   	/* The bio sector must point to the start of a sequential zone */
> -	if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) - 1) ||
> +	if (!bdev_is_zone_aligned(bio->bi_bdev, bio->bi_iter.bi_sector) ||
>   	    !bio_zone_is_seq(bio))
>   		return BLK_STS_IOERR;

The bdev_is_zone_start() name seems more clear to me than 
bdev_is_zone_aligned(). Has there already been a discussion about which 
name to use for this function?

> +		/*
> +		 * Non power-of-2 zone size support was added to remove the
> +		 * gap between zone capacity and zone size. Though it is technically
> +		 * possible to have gaps in a non power-of-2 device, Linux requires
> +		 * the zone size to be equal to zone capacity for non power-of-2
> +		 * zoned devices.
> +		 */
> +		if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
> +			pr_warn("%s: Invalid zone capacity for non power of 2 zone size",
> +				disk->disk_name);

Given the severity of this error, shouldn't the zone capacity and length 
be reported in the error message?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
@ 2022-07-27 23:16         ` Bart Van Assche
  0 siblings, 0 replies; 80+ messages in thread
From: Bart Van Assche @ 2022-07-27 23:16 UTC (permalink / raw)
  To: Pankaj Raghav, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: pankydev8, gost.dev, linux-kernel, linux-nvme, linux-block,
	dm-devel, jaegeuk, matias.bjorling, Luis Chamberlain

On 7/27/22 09:22, Pankaj Raghav wrote:
> Checking if a given sector is aligned to a zone is a common
> operation that is performed for zoned devices. Add
> bdev_is_zone_start helper to check for this instead of opencoding it
> everywhere.

I can't find the bdev_is_zone_start() function in this patch?

> To make this work bdev_get_queue(), bdev_zone_sectors() and
> bdev_is_zoned() are moved earlier without modifications.

Can that change perhaps be isolated into a separate patch?

> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3d286a256d3d..1f7e9a90e198 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -570,7 +570,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>   		return BLK_STS_NOTSUPP;
>   
>   	/* The bio sector must point to the start of a sequential zone */
> -	if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) - 1) ||
> +	if (!bdev_is_zone_aligned(bio->bi_bdev, bio->bi_iter.bi_sector) ||
>   	    !bio_zone_is_seq(bio))
>   		return BLK_STS_IOERR;

The bdev_is_zone_start() name seems more clear to me than 
bdev_is_zone_aligned(). Has there already been a discussion about which 
name to use for this function?

> +		/*
> +		 * Non power-of-2 zone size support was added to remove the
> +		 * gap between zone capacity and zone size. Though it is technically
> +		 * possible to have gaps in a non power-of-2 device, Linux requires
> +		 * the zone size to be equal to zone capacity for non power-of-2
> +		 * zoned devices.
> +		 */
> +		if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
> +			pr_warn("%s: Invalid zone capacity for non power of 2 zone size",
> +				disk->disk_name);

Given the severity of this error, shouldn't the zone capacity and length 
be reported in the error message?

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 00/11] support non power of 2 zoned device
  2022-07-27 16:22   ` [dm-devel] " Pankaj Raghav
@ 2022-07-27 23:19     ` Bart Van Assche
  -1 siblings, 0 replies; 80+ messages in thread
From: Bart Van Assche @ 2022-07-27 23:19 UTC (permalink / raw)
  To: Pankaj Raghav, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, jaegeuk, dm-devel, linux-nvme

On 7/27/22 09:22, Pankaj Raghav wrote:
> This series adds support to npo2 zoned devices in the block and nvme
> layer and a new **dm target** is added: dm-po2z-target. This new
> target will be initially used for filesystems such as btrfs and
> f2fs that does not have native npo2 zone support.

Should any SCSI changes be included in this patch series? From sd_zbc.c:

	if (!is_power_of_2(zone_blocks)) {
		sd_printk(KERN_ERR, sdkp,
			  "Zone size %llu is not a power of two.\n",
			  zone_blocks);
		return -EINVAL;
	}

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 00/11] support non power of 2 zoned device
@ 2022-07-27 23:19     ` Bart Van Assche
  0 siblings, 0 replies; 80+ messages in thread
From: Bart Van Assche @ 2022-07-27 23:19 UTC (permalink / raw)
  To: Pankaj Raghav, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: pankydev8, gost.dev, linux-kernel, linux-nvme, linux-block,
	dm-devel, jaegeuk, matias.bjorling

On 7/27/22 09:22, Pankaj Raghav wrote:
> This series adds support to npo2 zoned devices in the block and nvme
> layer and a new **dm target** is added: dm-po2z-target. This new
> target will be initially used for filesystems such as btrfs and
> f2fs that does not have native npo2 zone support.

Should any SCSI changes be included in this patch series? From sd_zbc.c:

	if (!is_power_of_2(zone_blocks)) {
		sd_printk(KERN_ERR, sdkp,
			  "Zone size %llu is not a power of two.\n",
			  zone_blocks);
		return -EINVAL;
	}

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 00/11] support non power of 2 zoned device
  2022-07-27 23:19     ` [dm-devel] " Bart Van Assche
@ 2022-07-28  1:52       ` Damien Le Moal
  -1 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  1:52 UTC (permalink / raw)
  To: Bart Van Assche, Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, jaegeuk, dm-devel, linux-nvme

On 7/28/22 08:19, Bart Van Assche wrote:
> On 7/27/22 09:22, Pankaj Raghav wrote:
>> This series adds support to npo2 zoned devices in the block and nvme
>> layer and a new **dm target** is added: dm-po2z-target. This new
>> target will be initially used for filesystems such as btrfs and
>> f2fs that does not have native npo2 zone support.
> 
> Should any SCSI changes be included in this patch series? From sd_zbc.c:
> 
> 	if (!is_power_of_2(zone_blocks)) {
> 		sd_printk(KERN_ERR, sdkp,
> 			  "Zone size %llu is not a power of two.\n",
> 			  zone_blocks);
> 		return -EINVAL;
> 	}

There are no non-power of 2 SMR drives on the market and no plans to have
any as far as I know. Users want power of 2 zone size. So I think it is
better to leave sd_zbc & scsi_debug as is for now.

> 
> Thanks,
> 
> Bart.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 00/11] support non power of 2 zoned device
@ 2022-07-28  1:52       ` Damien Le Moal
  0 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  1:52 UTC (permalink / raw)
  To: Bart Van Assche, Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: pankydev8, gost.dev, linux-kernel, linux-nvme, linux-block,
	dm-devel, jaegeuk, matias.bjorling

On 7/28/22 08:19, Bart Van Assche wrote:
> On 7/27/22 09:22, Pankaj Raghav wrote:
>> This series adds support to npo2 zoned devices in the block and nvme
>> layer and a new **dm target** is added: dm-po2z-target. This new
>> target will be initially used for filesystems such as btrfs and
>> f2fs that does not have native npo2 zone support.
> 
> Should any SCSI changes be included in this patch series? From sd_zbc.c:
> 
> 	if (!is_power_of_2(zone_blocks)) {
> 		sd_printk(KERN_ERR, sdkp,
> 			  "Zone size %llu is not a power of two.\n",
> 			  zone_blocks);
> 		return -EINVAL;
> 	}

There are no non-power of 2 SMR drives on the market and no plans to have
any as far as I know. Users want power of 2 zone size. So I think it is
better to leave sd_zbc & scsi_debug as is for now.

> 
> Thanks,
> 
> Bart.


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 01/11] block: make bdev_nr_zones and disk_zone_no generic for npo2 zsze
  2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
@ 2022-07-28  2:54         ` Damien Le Moal
  -1 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  2:54 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Luis Chamberlain, Adam Manzanares

On 7/28/22 01:22, Pankaj Raghav wrote:
> Adapt bdev_nr_zones and disk_zone_no function so that it can

s/function/functions
s/it/they

> also work for non-power-of-2 zone sizes.
> 
> As the existing deployments of zoned devices had power-of-2

had power-of-2 assumption -> assume that a device zone size is a power of
2 number of sectors

Existing deployments still exist, so do not use the past tense please.

> assumption, power-of-2 optimized calculation is kept for those devices.
> 
> There are no direct hot paths modified and the changes just
> introduce one new branch per call.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Reviewed-by: Adam Manzanares <a.manzanares@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-zoned.c      | 13 +++++++++----
>  include/linux/blkdev.h |  8 +++++++-
>  2 files changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index a264621d4905..dce9c95b4bcd 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -111,17 +111,22 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
>   * bdev_nr_zones - Get number of zones
>   * @bdev:	Target device
>   *
> - * Return the total number of zones of a zoned block device.  For a block
> - * device without zone capabilities, the number of zones is always 0.
> + * Return the total number of zones of a zoned block device, including the
> + * eventual small last zone if present. For a block device without zone
> + * capabilities, the number of zones is always 0.
>   */
>  unsigned int bdev_nr_zones(struct block_device *bdev)
>  {
>  	sector_t zone_sectors = bdev_zone_sectors(bdev);
> +	sector_t capacity = bdev_nr_sectors(bdev);
>  
>  	if (!bdev_is_zoned(bdev))
>  		return 0;
> -	return (bdev_nr_sectors(bdev) + zone_sectors - 1) >>
> -		ilog2(zone_sectors);
> +
> +	if (is_power_of_2(zone_sectors))
> +		return (capacity + zone_sectors - 1) >> ilog2(zone_sectors);
> +
> +	return DIV_ROUND_UP_SECTOR_T(capacity, zone_sectors);
>  }
>  EXPORT_SYMBOL_GPL(bdev_nr_zones);
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index dccdf1551c62..85b832908f28 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -673,9 +673,15 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
>  
>  static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
>  {
> +	sector_t zone_sectors = disk->queue->limits.chunk_sectors;
> +
>  	if (!blk_queue_is_zoned(disk->queue))
>  		return 0;
> -	return sector >> ilog2(disk->queue->limits.chunk_sectors);
> +
> +	if (is_power_of_2(zone_sectors))
> +		return sector >> ilog2(zone_sectors);
> +
> +	return div64_u64(sector, zone_sectors);
>  }
>  
>  static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 01/11] block: make bdev_nr_zones and disk_zone_no generic for npo2 zsze
@ 2022-07-28  2:54         ` Damien Le Moal
  0 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  2:54 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: bvanassche, pankydev8, gost.dev, linux-kernel, linux-nvme,
	linux-block, dm-devel, Adam Manzanares, jaegeuk, matias.bjorling,
	Luis Chamberlain

On 7/28/22 01:22, Pankaj Raghav wrote:
> Adapt bdev_nr_zones and disk_zone_no function so that it can

s/function/functions
s/it/they

> also work for non-power-of-2 zone sizes.
> 
> As the existing deployments of zoned devices had power-of-2

had power-of-2 assumption -> assume that a device zone size is a power of
2 number of sectors

Existing deployments still exist, so do not use the past tense please.

> assumption, power-of-2 optimized calculation is kept for those devices.
> 
> There are no direct hot paths modified and the changes just
> introduce one new branch per call.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Reviewed-by: Adam Manzanares <a.manzanares@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-zoned.c      | 13 +++++++++----
>  include/linux/blkdev.h |  8 +++++++-
>  2 files changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index a264621d4905..dce9c95b4bcd 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -111,17 +111,22 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
>   * bdev_nr_zones - Get number of zones
>   * @bdev:	Target device
>   *
> - * Return the total number of zones of a zoned block device.  For a block
> - * device without zone capabilities, the number of zones is always 0.
> + * Return the total number of zones of a zoned block device, including the
> + * eventual small last zone if present. For a block device without zone
> + * capabilities, the number of zones is always 0.
>   */
>  unsigned int bdev_nr_zones(struct block_device *bdev)
>  {
>  	sector_t zone_sectors = bdev_zone_sectors(bdev);
> +	sector_t capacity = bdev_nr_sectors(bdev);
>  
>  	if (!bdev_is_zoned(bdev))
>  		return 0;
> -	return (bdev_nr_sectors(bdev) + zone_sectors - 1) >>
> -		ilog2(zone_sectors);
> +
> +	if (is_power_of_2(zone_sectors))
> +		return (capacity + zone_sectors - 1) >> ilog2(zone_sectors);
> +
> +	return DIV_ROUND_UP_SECTOR_T(capacity, zone_sectors);
>  }
>  EXPORT_SYMBOL_GPL(bdev_nr_zones);
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index dccdf1551c62..85b832908f28 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -673,9 +673,15 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
>  
>  static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
>  {
> +	sector_t zone_sectors = disk->queue->limits.chunk_sectors;
> +
>  	if (!blk_queue_is_zoned(disk->queue))
>  		return 0;
> -	return sector >> ilog2(disk->queue->limits.chunk_sectors);
> +
> +	if (is_power_of_2(zone_sectors))
> +		return sector >> ilog2(zone_sectors);
> +
> +	return div64_u64(sector, zone_sectors);
>  }
>  
>  static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 00/11] support non power of 2 zoned device
  2022-07-28  1:52       ` [dm-devel] " Damien Le Moal
@ 2022-07-28  2:58         ` Bart Van Assche
  -1 siblings, 0 replies; 80+ messages in thread
From: Bart Van Assche @ 2022-07-28  2:58 UTC (permalink / raw)
  To: Damien Le Moal, Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, jaegeuk, dm-devel, linux-nvme

On 7/27/22 18:52, Damien Le Moal wrote:
> On 7/28/22 08:19, Bart Van Assche wrote:
>> On 7/27/22 09:22, Pankaj Raghav wrote:
>>> This series adds support to npo2 zoned devices in the block and nvme
>>> layer and a new **dm target** is added: dm-po2z-target. This new
>>> target will be initially used for filesystems such as btrfs and
>>> f2fs that does not have native npo2 zone support.
>>
>> Should any SCSI changes be included in this patch series? From sd_zbc.c:
>>
>> 	if (!is_power_of_2(zone_blocks)) {
>> 		sd_printk(KERN_ERR, sdkp,
>> 			  "Zone size %llu is not a power of two.\n",
>> 			  zone_blocks);
>> 		return -EINVAL;
>> 	}
> 
> There are no non-power of 2 SMR drives on the market and no plans to have
> any as far as I know. Users want power of 2 zone size. So I think it is
> better to leave sd_zbc & scsi_debug as is for now.

Zoned UFS devices will support ZBC and may have a zone size that is not 
a power of two.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 00/11] support non power of 2 zoned device
@ 2022-07-28  2:58         ` Bart Van Assche
  0 siblings, 0 replies; 80+ messages in thread
From: Bart Van Assche @ 2022-07-28  2:58 UTC (permalink / raw)
  To: Damien Le Moal, Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: pankydev8, gost.dev, linux-kernel, linux-nvme, linux-block,
	dm-devel, jaegeuk, matias.bjorling

On 7/27/22 18:52, Damien Le Moal wrote:
> On 7/28/22 08:19, Bart Van Assche wrote:
>> On 7/27/22 09:22, Pankaj Raghav wrote:
>>> This series adds support to npo2 zoned devices in the block and nvme
>>> layer and a new **dm target** is added: dm-po2z-target. This new
>>> target will be initially used for filesystems such as btrfs and
>>> f2fs that does not have native npo2 zone support.
>>
>> Should any SCSI changes be included in this patch series? From sd_zbc.c:
>>
>> 	if (!is_power_of_2(zone_blocks)) {
>> 		sd_printk(KERN_ERR, sdkp,
>> 			  "Zone size %llu is not a power of two.\n",
>> 			  zone_blocks);
>> 		return -EINVAL;
>> 	}
> 
> There are no non-power of 2 SMR drives on the market and no plans to have
> any as far as I know. Users want power of 2 zone size. So I think it is
> better to leave sd_zbc & scsi_debug as is for now.

Zoned UFS devices will support ZBC and may have a zone size that is not 
a power of two.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
@ 2022-07-28  3:07         ` Damien Le Moal
  -1 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:07 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Luis Chamberlain

On 7/28/22 01:22, Pankaj Raghav wrote:
> Checking if a given sector is aligned to a zone is a common
> operation that is performed for zoned devices. Add
> bdev_is_zone_start helper to check for this instead of opencoding it

The patch actually introduces bdev_is_zone_aligned(). I agree with Bart
that bdev_is_zone_start() is a better name.

> everywhere.
> 
> Convert the calculations on zone size to be generic instead of relying on
> power_of_2 based logic in the block layer using the helpers wherever

s/based logic/arithmetics

> possible.
> 
> The only hot path affected by this change for power_of_2 zoned devices
> is in blk_check_zone_append() but bdev_is_zone_start() helper is
> used to optimize the calculation for po2 zone sizes. Note that the append
> path cannot be accessed by direct raw access to the block device but only
> through a filesystem abstraction.

And so what ? What is the point here ?

> 
> Finally, allow non power of 2 zoned devices provided that their zone

Please spell things out clearly: ...allow zoned devices with a zone size
that is not a power of 2 number of sectors...

> capacity and zone size are equal. The main motivation to allow non
> power_of_2 zoned device is to remove the unmapped LBA between zcap and
> zsze for devices that cannot have a power_of_2 zcap.

zcap, zsze are nvme field names. Please phrase these in plain english to
clarify.

> 
> To make this work bdev_get_queue(), bdev_zone_sectors() and
> bdev_is_zoned() are moved earlier without modifications.

"moved earlier" -> declared earlier in xxx.h ?

> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  block/blk-core.c       |  2 +-
>  block/blk-zoned.c      | 24 +++++++++---
>  include/linux/blkdev.h | 84 ++++++++++++++++++++++++++++++------------
>  3 files changed, 79 insertions(+), 31 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3d286a256d3d..1f7e9a90e198 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -570,7 +570,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>  		return BLK_STS_NOTSUPP;
>  
>  	/* The bio sector must point to the start of a sequential zone */
> -	if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) - 1) ||
> +	if (!bdev_is_zone_aligned(bio->bi_bdev, bio->bi_iter.bi_sector) ||
>  	    !bio_zone_is_seq(bio))
>  		return BLK_STS_IOERR;
>  
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index dce9c95b4bcd..a01a231ad328 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -285,10 +285,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
>  		return -EINVAL;
>  
>  	/* Check alignment (handle eventual smaller last zone) */
> -	if (sector & (zone_sectors - 1))
> +	if (!bdev_is_zone_aligned(bdev, sector))
>  		return -EINVAL;
>  
> -	if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
> +	if (!bdev_is_zone_aligned(bdev, nr_sectors) && end_sector != capacity)
>  		return -EINVAL;
>  
>  	/*
> @@ -486,14 +486,26 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>  	 * smaller last zone.
>  	 */
>  	if (zone->start == 0) {
> -		if (zone->len == 0 || !is_power_of_2(zone->len)) {
> -			pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
> -				disk->disk_name, zone->len);
> +		if (zone->len == 0) {
> +			pr_warn("%s: Invalid zone size", disk->disk_name);

You removed the zone size value print, so please update the message to
something like:

pr_warn("%s: Invalid zero zone size", disk->disk_name);

> +			return -ENODEV;
> +		}
> +
> +		/*
> +		 * Non power-of-2 zone size support was added to remove the
> +		 * gap between zone capacity and zone size. Though it is technically
> +		 * possible to have gaps in a non power-of-2 device, Linux requires
> +		 * the zone size to be equal to zone capacity for non power-of-2
> +		 * zoned devices.
> +		 */
> +		if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
> +			pr_warn("%s: Invalid zone capacity for non power of 2 zone size",
> +				disk->disk_name);

As Bart suggested, please print the zone capacity and zone size values.

>  			return -ENODEV;
>  		}
>  
>  		args->zone_sectors = zone->len;
> -		args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
> +		args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);

		args->nr_zones = disk_zone_no(disk, capacity);

>  	} else if (zone->start + args->zone_sectors < capacity) {
>  		if (zone->len != args->zone_sectors) {
>  			pr_warn("%s: Invalid zoned device with non constant zone size\n",
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 85b832908f28..1be805223026 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -634,6 +634,11 @@ static inline bool queue_is_mq(struct request_queue *q)
>  	return q->mq_ops;
>  }
>  
> +static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
> +{
> +	return bdev->bd_queue;	/* this is never NULL */
> +}
> +
>  #ifdef CONFIG_PM
>  static inline enum rpm_status queue_rpm_status(struct request_queue *q)
>  {
> @@ -665,6 +670,25 @@ static inline bool blk_queue_is_zoned(struct request_queue *q)
>  	}
>  }
>  
> +static inline bool bdev_is_zoned(struct block_device *bdev)
> +{
> +	struct request_queue *q = bdev_get_queue(bdev);
> +
> +	if (q)
> +		return blk_queue_is_zoned(q);
> +
> +	return false;
> +}
> +
> +static inline sector_t bdev_zone_sectors(struct block_device *bdev)
> +{
> +	struct request_queue *q = bdev_get_queue(bdev);
> +
> +	if (!blk_queue_is_zoned(q))
> +		return 0;
> +	return q->limits.chunk_sectors;
> +}
> +
>  #ifdef CONFIG_BLK_DEV_ZONED
>  static inline unsigned int disk_nr_zones(struct gendisk *disk)
>  {
> @@ -684,6 +708,30 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
>  	return div64_u64(sector, zone_sectors);
>  }
>  
> +static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
> +						   sector_t sec)
> +{
> +	sector_t zone_sectors = bdev_zone_sectors(bdev);
> +	u64 remainder = 0;
> +
> +	if (!bdev_is_zoned(bdev))
> +		return 0;
> +
> +	if (is_power_of_2(zone_sectors))
> +		return sec & (zone_sectors - 1);
> +
> +	div64_u64_rem(sec, zone_sectors, &remainder);
> +	return remainder;
> +}
> +
> +static inline bool bdev_is_zone_aligned(struct block_device *bdev, sector_t sec)
> +{
> +	if (!bdev_is_zoned(bdev))
> +		return false;

This is checked in bdev_offset_from_zone_start(). No need to add it again
here.

> +
> +	return bdev_offset_from_zone_start(bdev, sec) == 0;
> +}
> +
>  static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
>  {
>  	if (!blk_queue_is_zoned(disk->queue))
> @@ -728,6 +776,18 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
>  {
>  	return 0;
>  }
> +
> +static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
> +						   sector_t sec)
> +{
> +	return 0;
> +}

This one is not used when CONFIG_BLK_DEV_ZONED is not set. No need to
define it.

> +
> +static inline bool bdev_is_zone_aligned(struct block_device *bdev, sector_t sec)
> +{
> +	return false;
> +}
> +
>  static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
>  {
>  	return 0;
> @@ -891,11 +951,6 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags);
>  int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
>  			unsigned int flags);
>  
> -static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
> -{
> -	return bdev->bd_queue;	/* this is never NULL */
> -}
> -
>  /* Helper to convert BLK_ZONE_ZONE_XXX to its string format XXX */
>  const char *blk_zone_cond_str(enum blk_zone_cond zone_cond);
>  
> @@ -1295,25 +1350,6 @@ static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
>  	return BLK_ZONED_NONE;
>  }
>  
> -static inline bool bdev_is_zoned(struct block_device *bdev)
> -{
> -	struct request_queue *q = bdev_get_queue(bdev);
> -
> -	if (q)
> -		return blk_queue_is_zoned(q);
> -
> -	return false;
> -}
> -
> -static inline sector_t bdev_zone_sectors(struct block_device *bdev)
> -{
> -	struct request_queue *q = bdev_get_queue(bdev);
> -
> -	if (!blk_queue_is_zoned(q))
> -		return 0;
> -	return q->limits.chunk_sectors;
> -}
> -
>  static inline int queue_dma_alignment(const struct request_queue *q)
>  {
>  	return q ? q->dma_alignment : 511;


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
@ 2022-07-28  3:07         ` Damien Le Moal
  0 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:07 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: bvanassche, pankydev8, gost.dev, linux-kernel, linux-nvme,
	linux-block, dm-devel, jaegeuk, matias.bjorling,
	Luis Chamberlain

On 7/28/22 01:22, Pankaj Raghav wrote:
> Checking if a given sector is aligned to a zone is a common
> operation that is performed for zoned devices. Add
> bdev_is_zone_start helper to check for this instead of opencoding it

The patch actually introduces bdev_is_zone_aligned(). I agree with Bart
that bdev_is_zone_start() is a better name.

> everywhere.
> 
> Convert the calculations on zone size to be generic instead of relying on
> power_of_2 based logic in the block layer using the helpers wherever

s/based logic/arithmetics

> possible.
> 
> The only hot path affected by this change for power_of_2 zoned devices
> is in blk_check_zone_append() but bdev_is_zone_start() helper is
> used to optimize the calculation for po2 zone sizes. Note that the append
> path cannot be accessed by direct raw access to the block device but only
> through a filesystem abstraction.

And so what ? What is the point here ?

> 
> Finally, allow non power of 2 zoned devices provided that their zone

Please spell things out clearly: ...allow zoned devices with a zone size
that is not a power of 2 number of sectors...

> capacity and zone size are equal. The main motivation to allow non
> power_of_2 zoned device is to remove the unmapped LBA between zcap and
> zsze for devices that cannot have a power_of_2 zcap.

zcap, zsze are nvme field names. Please phrase these in plain english to
clarify.

> 
> To make this work bdev_get_queue(), bdev_zone_sectors() and
> bdev_is_zoned() are moved earlier without modifications.

"moved earlier" -> declared earlier in xxx.h ?

> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  block/blk-core.c       |  2 +-
>  block/blk-zoned.c      | 24 +++++++++---
>  include/linux/blkdev.h | 84 ++++++++++++++++++++++++++++++------------
>  3 files changed, 79 insertions(+), 31 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3d286a256d3d..1f7e9a90e198 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -570,7 +570,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>  		return BLK_STS_NOTSUPP;
>  
>  	/* The bio sector must point to the start of a sequential zone */
> -	if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) - 1) ||
> +	if (!bdev_is_zone_aligned(bio->bi_bdev, bio->bi_iter.bi_sector) ||
>  	    !bio_zone_is_seq(bio))
>  		return BLK_STS_IOERR;
>  
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index dce9c95b4bcd..a01a231ad328 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -285,10 +285,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
>  		return -EINVAL;
>  
>  	/* Check alignment (handle eventual smaller last zone) */
> -	if (sector & (zone_sectors - 1))
> +	if (!bdev_is_zone_aligned(bdev, sector))
>  		return -EINVAL;
>  
> -	if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
> +	if (!bdev_is_zone_aligned(bdev, nr_sectors) && end_sector != capacity)
>  		return -EINVAL;
>  
>  	/*
> @@ -486,14 +486,26 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>  	 * smaller last zone.
>  	 */
>  	if (zone->start == 0) {
> -		if (zone->len == 0 || !is_power_of_2(zone->len)) {
> -			pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
> -				disk->disk_name, zone->len);
> +		if (zone->len == 0) {
> +			pr_warn("%s: Invalid zone size", disk->disk_name);

You removed the zone size value print, so please update the message to
something like:

pr_warn("%s: Invalid zero zone size", disk->disk_name);

> +			return -ENODEV;
> +		}
> +
> +		/*
> +		 * Non power-of-2 zone size support was added to remove the
> +		 * gap between zone capacity and zone size. Though it is technically
> +		 * possible to have gaps in a non power-of-2 device, Linux requires
> +		 * the zone size to be equal to zone capacity for non power-of-2
> +		 * zoned devices.
> +		 */
> +		if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
> +			pr_warn("%s: Invalid zone capacity for non power of 2 zone size",
> +				disk->disk_name);

As Bart suggested, please print the zone capacity and zone size values.

>  			return -ENODEV;
>  		}
>  
>  		args->zone_sectors = zone->len;
> -		args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
> +		args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);

		args->nr_zones = disk_zone_no(disk, capacity);

>  	} else if (zone->start + args->zone_sectors < capacity) {
>  		if (zone->len != args->zone_sectors) {
>  			pr_warn("%s: Invalid zoned device with non constant zone size\n",
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 85b832908f28..1be805223026 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -634,6 +634,11 @@ static inline bool queue_is_mq(struct request_queue *q)
>  	return q->mq_ops;
>  }
>  
> +static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
> +{
> +	return bdev->bd_queue;	/* this is never NULL */
> +}
> +
>  #ifdef CONFIG_PM
>  static inline enum rpm_status queue_rpm_status(struct request_queue *q)
>  {
> @@ -665,6 +670,25 @@ static inline bool blk_queue_is_zoned(struct request_queue *q)
>  	}
>  }
>  
> +static inline bool bdev_is_zoned(struct block_device *bdev)
> +{
> +	struct request_queue *q = bdev_get_queue(bdev);
> +
> +	if (q)
> +		return blk_queue_is_zoned(q);
> +
> +	return false;
> +}
> +
> +static inline sector_t bdev_zone_sectors(struct block_device *bdev)
> +{
> +	struct request_queue *q = bdev_get_queue(bdev);
> +
> +	if (!blk_queue_is_zoned(q))
> +		return 0;
> +	return q->limits.chunk_sectors;
> +}
> +
>  #ifdef CONFIG_BLK_DEV_ZONED
>  static inline unsigned int disk_nr_zones(struct gendisk *disk)
>  {
> @@ -684,6 +708,30 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
>  	return div64_u64(sector, zone_sectors);
>  }
>  
> +static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
> +						   sector_t sec)
> +{
> +	sector_t zone_sectors = bdev_zone_sectors(bdev);
> +	u64 remainder = 0;
> +
> +	if (!bdev_is_zoned(bdev))
> +		return 0;
> +
> +	if (is_power_of_2(zone_sectors))
> +		return sec & (zone_sectors - 1);
> +
> +	div64_u64_rem(sec, zone_sectors, &remainder);
> +	return remainder;
> +}
> +
> +static inline bool bdev_is_zone_aligned(struct block_device *bdev, sector_t sec)
> +{
> +	if (!bdev_is_zoned(bdev))
> +		return false;

This is checked in bdev_offset_from_zone_start(). No need to add it again
here.

> +
> +	return bdev_offset_from_zone_start(bdev, sec) == 0;
> +}
> +
>  static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
>  {
>  	if (!blk_queue_is_zoned(disk->queue))
> @@ -728,6 +776,18 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
>  {
>  	return 0;
>  }
> +
> +static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
> +						   sector_t sec)
> +{
> +	return 0;
> +}

This one is not used when CONFIG_BLK_DEV_ZONED is not set. No need to
define it.

> +
> +static inline bool bdev_is_zone_aligned(struct block_device *bdev, sector_t sec)
> +{
> +	return false;
> +}
> +
>  static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
>  {
>  	return 0;
> @@ -891,11 +951,6 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags);
>  int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
>  			unsigned int flags);
>  
> -static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
> -{
> -	return bdev->bd_queue;	/* this is never NULL */
> -}
> -
>  /* Helper to convert BLK_ZONE_ZONE_XXX to its string format XXX */
>  const char *blk_zone_cond_str(enum blk_zone_cond zone_cond);
>  
> @@ -1295,25 +1350,6 @@ static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
>  	return BLK_ZONED_NONE;
>  }
>  
> -static inline bool bdev_is_zoned(struct block_device *bdev)
> -{
> -	struct request_queue *q = bdev_get_queue(bdev);
> -
> -	if (q)
> -		return blk_queue_is_zoned(q);
> -
> -	return false;
> -}
> -
> -static inline sector_t bdev_zone_sectors(struct block_device *bdev)
> -{
> -	struct request_queue *q = bdev_get_queue(bdev);
> -
> -	if (!blk_queue_is_zoned(q))
> -		return 0;
> -	return q->limits.chunk_sectors;
> -}
> -
>  static inline int queue_dma_alignment(const struct request_queue *q)
>  {
>  	return q ? q->dma_alignment : 511;


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 04/11] nvmet: Allow ZNS target to support non-power_of_2 zone sizes
  2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
@ 2022-07-28  3:09         ` Damien Le Moal
  -1 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:09 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Luis Chamberlain

On 7/28/22 01:22, Pankaj Raghav wrote:
> A generic bdev_zone_no() helper is added to calculate zone number for a
> given sector in a block device. This helper internally uses disk_zone_no()
> to find the zone number.
> 
> Use the helper bdev_zone_no() to calculate nr of zones. This let's us
> make modifications to the math if needed in one place and adds now
> support for npo2 zone devices.
> 
> Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/nvme/target/zns.c | 3 +--
>  include/linux/blkdev.h    | 5 +++++
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
> index c7ef69f29fe4..662f1a92f39b 100644
> --- a/drivers/nvme/target/zns.c
> +++ b/drivers/nvme/target/zns.c
> @@ -241,8 +241,7 @@ static unsigned long nvmet_req_nr_zones_from_slba(struct nvmet_req *req)
>  {
>  	unsigned int sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);
>  
> -	return bdev_nr_zones(req->ns->bdev) -
> -		(sect >> ilog2(bdev_zone_sectors(req->ns->bdev)));
> +	return bdev_nr_zones(req->ns->bdev) - bdev_zone_no(req->ns->bdev, sect);
>  }

This change should go into patch 3. Otherwise, adding patch 3 only will
break the nvme target zone code.

>  
>  static unsigned long get_nr_zones_from_buf(struct nvmet_req *req, u32 bufsize)
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 1be805223026..d1ef9b9552ed 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1350,6 +1350,11 @@ static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
>  	return BLK_ZONED_NONE;
>  }
>  
> +static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
> +{
> +	return disk_zone_no(bdev->bd_disk, sec);
> +}
> +

This should go into a prep patch before patch 3.

>  static inline int queue_dma_alignment(const struct request_queue *q)
>  {
>  	return q ? q->dma_alignment : 511;


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 04/11] nvmet: Allow ZNS target to support non-power_of_2 zone sizes
@ 2022-07-28  3:09         ` Damien Le Moal
  0 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:09 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: bvanassche, pankydev8, gost.dev, linux-kernel, linux-nvme,
	linux-block, dm-devel, jaegeuk, matias.bjorling,
	Luis Chamberlain

On 7/28/22 01:22, Pankaj Raghav wrote:
> A generic bdev_zone_no() helper is added to calculate zone number for a
> given sector in a block device. This helper internally uses disk_zone_no()
> to find the zone number.
> 
> Use the helper bdev_zone_no() to calculate nr of zones. This let's us
> make modifications to the math if needed in one place and adds now
> support for npo2 zone devices.
> 
> Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/nvme/target/zns.c | 3 +--
>  include/linux/blkdev.h    | 5 +++++
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
> index c7ef69f29fe4..662f1a92f39b 100644
> --- a/drivers/nvme/target/zns.c
> +++ b/drivers/nvme/target/zns.c
> @@ -241,8 +241,7 @@ static unsigned long nvmet_req_nr_zones_from_slba(struct nvmet_req *req)
>  {
>  	unsigned int sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);
>  
> -	return bdev_nr_zones(req->ns->bdev) -
> -		(sect >> ilog2(bdev_zone_sectors(req->ns->bdev)));
> +	return bdev_nr_zones(req->ns->bdev) - bdev_zone_no(req->ns->bdev, sect);
>  }

This change should go into patch 3. Otherwise, adding patch 3 only will
break the nvme target zone code.

>  
>  static unsigned long get_nr_zones_from_buf(struct nvmet_req *req, u32 bufsize)
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 1be805223026..d1ef9b9552ed 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1350,6 +1350,11 @@ static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
>  	return BLK_ZONED_NONE;
>  }
>  
> +static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
> +{
> +	return disk_zone_no(bdev->bd_disk, sec);
> +}
> +

This should go into a prep patch before patch 3.

>  static inline int queue_dma_alignment(const struct request_queue *q)
>  {
>  	return q ? q->dma_alignment : 511;


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 05/11] null_blk: allow non power of 2 zoned devices
  2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
@ 2022-07-28  3:14         ` Damien Le Moal
  -1 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:14 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Luis Chamberlain

On 7/28/22 01:22, Pankaj Raghav wrote:
> Convert the power of 2 based calculation with zone size to be generic in
> null_zone_no with optimization for power of 2 based zone sizes.
> 
> The nr_zones calculation in null_init_zoned_dev has been replaced with a
> division without special handling for power of 2 based zone sizes as
> this function is called only during the initialization and will not
> invoked in the hot path.
> 
> Performance Measurement:
> 
> Device:
> zone size = 128M, blocksize=4k
> 
> FIO cmd:
> 
> fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
> --io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
> --loops=4
> 
> The following results are an average of 4 runs on AMD Ryzen 5 5600X with
> 32GB of RAM:
> 
> Sequential Write:
> 
> x-----------------x---------------------------------x---------------------------------x
> |     IOdepth     |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  578     |  2257    |   12.80   |  576     |  2248    |   25.78   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  581     |  2268    |   12.74   |  576     |  2248    |   25.85   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Sequential read:
> 
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth         |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  667     |  2605    |   11.79   |  675     |  2637    |   23.49   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  667     |  2605    |   11.79   |  675     |  2638    |   23.48   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Random read:
> 
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth         |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  522     |  2038    |   15.05   |  514     |  2006    |   30.87   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  522     |  2039    |   15.04   |  523     |  2042    |   30.33   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Minor variations are noticed in Sequential write with io depth 8 and
> in random read with io depth 16. But overall no noticeable differences
> were noticed
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/block/null_blk/main.c     |  5 ++---
>  drivers/block/null_blk/null_blk.h |  6 ++++++
>  drivers/block/null_blk/zoned.c    | 18 +++++++++++-------
>  3 files changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
> index c451c477978f..f1e0605dee94 100644
> --- a/drivers/block/null_blk/main.c
> +++ b/drivers/block/null_blk/main.c
> @@ -1976,9 +1976,8 @@ static int null_validate_conf(struct nullb_device *dev)
>  	if (dev->queue_mode == NULL_Q_BIO)
>  		dev->mbps = 0;
>  
> -	if (dev->zoned &&
> -	    (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
> -		pr_err("zone_size must be power-of-two\n");
> +	if (dev->zoned && !dev->zone_size) {
> +		pr_err("Invalid zero zone size\n");
>  		return -EINVAL;
>  	}
>  
> diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
> index 94ff68052b1e..ece6dded9508 100644
> --- a/drivers/block/null_blk/null_blk.h
> +++ b/drivers/block/null_blk/null_blk.h
> @@ -83,6 +83,12 @@ struct nullb_device {
>  	unsigned int imp_close_zone_no;
>  	struct nullb_zone *zones;
>  	sector_t zone_size_sects;
> +	/*
> +	 * zone_size_sects_shift is only useful when the zone size is
> +	 * power of 2. This variable is set to zero when zone size is non
> +	 * power of 2.
> +	 */

Simplify:

	/*
	 * zone_size_sects_shift is used only when the zone size is a
	 * power of 2 number of sectors.
	 */

But personally, I would simply drop this comment. The name is obvious and
the code very clear.

> +	unsigned int zone_size_sects_shift;
>  	bool need_zone_res_mgmt;
>  	spinlock_t zone_res_lock;
>  
> diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
> index 55a69e48ef8b..015f6823706c 100644
> --- a/drivers/block/null_blk/zoned.c
> +++ b/drivers/block/null_blk/zoned.c
> @@ -16,7 +16,10 @@ static inline sector_t mb_to_sects(unsigned long mb)
>  
>  static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
>  {
> -	return sect >> ilog2(dev->zone_size_sects);
> +	if (dev->zone_size_sects_shift)
> +		return sect >> dev->zone_size_sects_shift;
> +
> +	return div64_u64(sect, dev->zone_size_sects);
>  }
>  
>  static inline void null_lock_zone_res(struct nullb_device *dev)
> @@ -65,10 +68,6 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
>  	sector_t sector = 0;
>  	unsigned int i;
>  
> -	if (!is_power_of_2(dev->zone_size)) {
> -		pr_err("zone_size must be power-of-two\n");
> -		return -EINVAL;
> -	}
>  	if (dev->zone_size > dev->size) {
>  		pr_err("Zone size larger than device capacity\n");
>  		return -EINVAL;
> @@ -86,9 +85,14 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
>  	zone_capacity_sects = mb_to_sects(dev->zone_capacity);
>  	dev_capacity_sects = mb_to_sects(dev->size);
>  	dev->zone_size_sects = mb_to_sects(dev->zone_size);
> -	dev->nr_zones = round_up(dev_capacity_sects, dev->zone_size_sects)
> -		>> ilog2(dev->zone_size_sects);
>  
> +	if (is_power_of_2(dev->zone_size_sects))
> +		dev->zone_size_sects_shift = ilog2(dev->zone_size_sects);
> +	else
> +		dev->zone_size_sects_shift = 0;
> +
> +	dev->nr_zones =	DIV_ROUND_UP_SECTOR_T(dev_capacity_sects,
> +					      dev->zone_size_sects);
>  	dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct nullb_zone),
>  				    GFP_KERNEL | __GFP_ZERO);
>  	if (!dev->zones)


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 05/11] null_blk: allow non power of 2 zoned devices
@ 2022-07-28  3:14         ` Damien Le Moal
  0 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:14 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: bvanassche, pankydev8, gost.dev, linux-kernel, linux-nvme,
	linux-block, dm-devel, jaegeuk, matias.bjorling,
	Luis Chamberlain

On 7/28/22 01:22, Pankaj Raghav wrote:
> Convert the power of 2 based calculation with zone size to be generic in
> null_zone_no with optimization for power of 2 based zone sizes.
> 
> The nr_zones calculation in null_init_zoned_dev has been replaced with a
> division without special handling for power of 2 based zone sizes as
> this function is called only during the initialization and will not
> invoked in the hot path.
> 
> Performance Measurement:
> 
> Device:
> zone size = 128M, blocksize=4k
> 
> FIO cmd:
> 
> fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
> --io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
> --loops=4
> 
> The following results are an average of 4 runs on AMD Ryzen 5 5600X with
> 32GB of RAM:
> 
> Sequential Write:
> 
> x-----------------x---------------------------------x---------------------------------x
> |     IOdepth     |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  578     |  2257    |   12.80   |  576     |  2248    |   25.78   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  581     |  2268    |   12.74   |  576     |  2248    |   25.85   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Sequential read:
> 
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth         |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  667     |  2605    |   11.79   |  675     |  2637    |   23.49   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  667     |  2605    |   11.79   |  675     |  2638    |   23.48   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Random read:
> 
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth         |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  522     |  2038    |   15.05   |  514     |  2006    |   30.87   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  522     |  2039    |   15.04   |  523     |  2042    |   30.33   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Minor variations are noticed in Sequential write with io depth 8 and
> in random read with io depth 16. But overall no noticeable differences
> were noticed
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/block/null_blk/main.c     |  5 ++---
>  drivers/block/null_blk/null_blk.h |  6 ++++++
>  drivers/block/null_blk/zoned.c    | 18 +++++++++++-------
>  3 files changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
> index c451c477978f..f1e0605dee94 100644
> --- a/drivers/block/null_blk/main.c
> +++ b/drivers/block/null_blk/main.c
> @@ -1976,9 +1976,8 @@ static int null_validate_conf(struct nullb_device *dev)
>  	if (dev->queue_mode == NULL_Q_BIO)
>  		dev->mbps = 0;
>  
> -	if (dev->zoned &&
> -	    (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
> -		pr_err("zone_size must be power-of-two\n");
> +	if (dev->zoned && !dev->zone_size) {
> +		pr_err("Invalid zero zone size\n");
>  		return -EINVAL;
>  	}
>  
> diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
> index 94ff68052b1e..ece6dded9508 100644
> --- a/drivers/block/null_blk/null_blk.h
> +++ b/drivers/block/null_blk/null_blk.h
> @@ -83,6 +83,12 @@ struct nullb_device {
>  	unsigned int imp_close_zone_no;
>  	struct nullb_zone *zones;
>  	sector_t zone_size_sects;
> +	/*
> +	 * zone_size_sects_shift is only useful when the zone size is
> +	 * power of 2. This variable is set to zero when zone size is non
> +	 * power of 2.
> +	 */

Simplify:

	/*
	 * zone_size_sects_shift is used only when the zone size is a
	 * power of 2 number of sectors.
	 */

But personally, I would simply drop this comment. The name is obvious and
the code very clear.

> +	unsigned int zone_size_sects_shift;
>  	bool need_zone_res_mgmt;
>  	spinlock_t zone_res_lock;
>  
> diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
> index 55a69e48ef8b..015f6823706c 100644
> --- a/drivers/block/null_blk/zoned.c
> +++ b/drivers/block/null_blk/zoned.c
> @@ -16,7 +16,10 @@ static inline sector_t mb_to_sects(unsigned long mb)
>  
>  static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
>  {
> -	return sect >> ilog2(dev->zone_size_sects);
> +	if (dev->zone_size_sects_shift)
> +		return sect >> dev->zone_size_sects_shift;
> +
> +	return div64_u64(sect, dev->zone_size_sects);
>  }
>  
>  static inline void null_lock_zone_res(struct nullb_device *dev)
> @@ -65,10 +68,6 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
>  	sector_t sector = 0;
>  	unsigned int i;
>  
> -	if (!is_power_of_2(dev->zone_size)) {
> -		pr_err("zone_size must be power-of-two\n");
> -		return -EINVAL;
> -	}
>  	if (dev->zone_size > dev->size) {
>  		pr_err("Zone size larger than device capacity\n");
>  		return -EINVAL;
> @@ -86,9 +85,14 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
>  	zone_capacity_sects = mb_to_sects(dev->zone_capacity);
>  	dev_capacity_sects = mb_to_sects(dev->size);
>  	dev->zone_size_sects = mb_to_sects(dev->zone_size);
> -	dev->nr_zones = round_up(dev_capacity_sects, dev->zone_size_sects)
> -		>> ilog2(dev->zone_size_sects);
>  
> +	if (is_power_of_2(dev->zone_size_sects))
> +		dev->zone_size_sects_shift = ilog2(dev->zone_size_sects);
> +	else
> +		dev->zone_size_sects_shift = 0;
> +
> +	dev->nr_zones =	DIV_ROUND_UP_SECTOR_T(dev_capacity_sects,
> +					      dev->zone_size_sects);
>  	dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct nullb_zone),
>  				    GFP_KERNEL | __GFP_ZERO);
>  	if (!dev->zones)


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 06/11] zonefs: allow non power of 2 zoned devices
  2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
@ 2022-07-28  3:16         ` Damien Le Moal
  -1 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:16 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Luis Chamberlain

On 7/28/22 01:22, Pankaj Raghav wrote:
> The zone size shift variable is useful only if the zone sizes are known
> to be power of 2. Remove that variable and use generic helpers from
> block layer to calculate zone index in zonefs.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>

Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

> ---
>  fs/zonefs/super.c  | 6 ++----
>  fs/zonefs/zonefs.h | 1 -
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index 860f0b1032c6..e549ef16738c 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -476,10 +476,9 @@ static void __zonefs_io_error(struct inode *inode, bool write)
>  {
>  	struct zonefs_inode_info *zi = ZONEFS_I(inode);
>  	struct super_block *sb = inode->i_sb;
> -	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
>  	unsigned int noio_flag;
>  	unsigned int nr_zones =
> -		zi->i_zone_size >> (sbi->s_zone_sectors_shift + SECTOR_SHIFT);
> +		bdev_zone_no(sb->s_bdev, zi->i_zone_size >> SECTOR_SHIFT);
>  	struct zonefs_ioerr_data err = {
>  		.inode = inode,
>  		.write = write,
> @@ -1401,7 +1400,7 @@ static int zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
>  	struct zonefs_inode_info *zi = ZONEFS_I(inode);
>  	int ret = 0;
>  
> -	inode->i_ino = zone->start >> sbi->s_zone_sectors_shift;
> +	inode->i_ino = bdev_zone_no(sb->s_bdev, zone->start);
>  	inode->i_mode = S_IFREG | sbi->s_perm;
>  
>  	zi->i_ztype = type;
> @@ -1776,7 +1775,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
>  	 * interface constraints.
>  	 */
>  	sb_set_blocksize(sb, bdev_zone_write_granularity(sb->s_bdev));
> -	sbi->s_zone_sectors_shift = ilog2(bdev_zone_sectors(sb->s_bdev));
>  	sbi->s_uid = GLOBAL_ROOT_UID;
>  	sbi->s_gid = GLOBAL_ROOT_GID;
>  	sbi->s_perm = 0640;
> diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
> index 4b3de66c3233..39895195cda6 100644
> --- a/fs/zonefs/zonefs.h
> +++ b/fs/zonefs/zonefs.h
> @@ -177,7 +177,6 @@ struct zonefs_sb_info {
>  	kgid_t			s_gid;
>  	umode_t			s_perm;
>  	uuid_t			s_uuid;
> -	unsigned int		s_zone_sectors_shift;
>  
>  	unsigned int		s_nr_files[ZONEFS_ZTYPE_MAX];
>  


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 06/11] zonefs: allow non power of 2 zoned devices
@ 2022-07-28  3:16         ` Damien Le Moal
  0 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:16 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: bvanassche, pankydev8, gost.dev, linux-kernel, linux-nvme,
	linux-block, dm-devel, jaegeuk, matias.bjorling,
	Luis Chamberlain

On 7/28/22 01:22, Pankaj Raghav wrote:
> The zone size shift variable is useful only if the zone sizes are known
> to be power of 2. Remove that variable and use generic helpers from
> block layer to calculate zone index in zonefs.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>

Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

> ---
>  fs/zonefs/super.c  | 6 ++----
>  fs/zonefs/zonefs.h | 1 -
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index 860f0b1032c6..e549ef16738c 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -476,10 +476,9 @@ static void __zonefs_io_error(struct inode *inode, bool write)
>  {
>  	struct zonefs_inode_info *zi = ZONEFS_I(inode);
>  	struct super_block *sb = inode->i_sb;
> -	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
>  	unsigned int noio_flag;
>  	unsigned int nr_zones =
> -		zi->i_zone_size >> (sbi->s_zone_sectors_shift + SECTOR_SHIFT);
> +		bdev_zone_no(sb->s_bdev, zi->i_zone_size >> SECTOR_SHIFT);
>  	struct zonefs_ioerr_data err = {
>  		.inode = inode,
>  		.write = write,
> @@ -1401,7 +1400,7 @@ static int zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
>  	struct zonefs_inode_info *zi = ZONEFS_I(inode);
>  	int ret = 0;
>  
> -	inode->i_ino = zone->start >> sbi->s_zone_sectors_shift;
> +	inode->i_ino = bdev_zone_no(sb->s_bdev, zone->start);
>  	inode->i_mode = S_IFREG | sbi->s_perm;
>  
>  	zi->i_ztype = type;
> @@ -1776,7 +1775,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
>  	 * interface constraints.
>  	 */
>  	sb_set_blocksize(sb, bdev_zone_write_granularity(sb->s_bdev));
> -	sbi->s_zone_sectors_shift = ilog2(bdev_zone_sectors(sb->s_bdev));
>  	sbi->s_uid = GLOBAL_ROOT_UID;
>  	sbi->s_gid = GLOBAL_ROOT_GID;
>  	sbi->s_perm = 0640;
> diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
> index 4b3de66c3233..39895195cda6 100644
> --- a/fs/zonefs/zonefs.h
> +++ b/fs/zonefs/zonefs.h
> @@ -177,7 +177,6 @@ struct zonefs_sb_info {
>  	kgid_t			s_gid;
>  	umode_t			s_perm;
>  	uuid_t			s_uuid;
> -	unsigned int		s_zone_sectors_shift;
>  
>  	unsigned int		s_nr_files[ZONEFS_ZTYPE_MAX];
>  


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 07/11] dm-zoned: ensure only power of 2 zone sizes are allowed
  2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
@ 2022-07-28  3:18         ` Damien Le Moal
  -1 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:18 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Luis Chamberlain

On 7/28/22 01:22, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> dm-zoned relies on the assumption that the zone size is a
> power-of-2(po2) and the zone capacity is same as the zone size.
> 
> Ensure only po2 devices can be used as dm-zoned target until a native
> non po2 support is added.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/md/dm-zoned-target.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
> index 95b132b52f33..16499b75c5ee 100644
> --- a/drivers/md/dm-zoned-target.c
> +++ b/drivers/md/dm-zoned-target.c
> @@ -792,6 +792,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
>  				return -EINVAL;
>  			}
>  			zone_nr_sectors = bdev_zone_sectors(bdev);
> +			if (!is_power_of_2(zone_nr_sectors)) {
> +				ti->error = "Zone size is not power of 2";

Let's clarify:

	ti->error = "Zone size is not a power of 2 number of sectors";

> +				return -EINVAL;
> +			}
>  			zoned_dev->zone_nr_sectors = zone_nr_sectors;
>  			zoned_dev->nr_zones = bdev_nr_zones(bdev);
>  		}
> @@ -804,6 +808,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
>  			return -EINVAL;
>  		}
>  		zoned_dev->zone_nr_sectors = bdev_zone_sectors(bdev);
> +		if (!is_power_of_2(zoned_dev->zone_nr_sectors)) {
> +			ti->error = "Zone size is not power of 2";

Same comment.

> +			return -EINVAL;
> +		}
>  		zoned_dev->nr_zones = bdev_nr_zones(bdev);
>  	}
>  


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 07/11] dm-zoned: ensure only power of 2 zone sizes are allowed
@ 2022-07-28  3:18         ` Damien Le Moal
  0 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:18 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: bvanassche, pankydev8, gost.dev, linux-kernel, linux-nvme,
	linux-block, dm-devel, jaegeuk, matias.bjorling,
	Luis Chamberlain

On 7/28/22 01:22, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> dm-zoned relies on the assumption that the zone size is a
> power-of-2(po2) and the zone capacity is same as the zone size.
> 
> Ensure only po2 devices can be used as dm-zoned target until a native
> non po2 support is added.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/md/dm-zoned-target.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
> index 95b132b52f33..16499b75c5ee 100644
> --- a/drivers/md/dm-zoned-target.c
> +++ b/drivers/md/dm-zoned-target.c
> @@ -792,6 +792,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
>  				return -EINVAL;
>  			}
>  			zone_nr_sectors = bdev_zone_sectors(bdev);
> +			if (!is_power_of_2(zone_nr_sectors)) {
> +				ti->error = "Zone size is not power of 2";

Let's clarify:

	ti->error = "Zone size is not a power of 2 number of sectors";

> +				return -EINVAL;
> +			}
>  			zoned_dev->zone_nr_sectors = zone_nr_sectors;
>  			zoned_dev->nr_zones = bdev_nr_zones(bdev);
>  		}
> @@ -804,6 +808,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
>  			return -EINVAL;
>  		}
>  		zoned_dev->zone_nr_sectors = bdev_zone_sectors(bdev);
> +		if (!is_power_of_2(zoned_dev->zone_nr_sectors)) {
> +			ti->error = "Zone size is not power of 2";

Same comment.

> +			return -EINVAL;
> +		}
>  		zoned_dev->nr_zones = bdev_nr_zones(bdev);
>  	}
>  


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 08/11] dm-zone: use generic helpers to calculate offset from zone start
  2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
@ 2022-07-28  3:20         ` Damien Le Moal
  -1 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:20 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Luis Chamberlain

On 7/28/22 01:22, Pankaj Raghav wrote:
> Use the bdev_offset_from_zone_start() helper function to calculate
> the offset from zone start instead of using power of 2 based
> calculation.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

> ---
>  drivers/md/dm-zone.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 3dafc0e8b7a9..31c16aafdbfc 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -390,7 +390,9 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>  	case REQ_OP_WRITE_ZEROES:
>  	case REQ_OP_WRITE:
>  		/* Writes must be aligned to the zone write pointer */
> -		if ((clone->bi_iter.bi_sector & (zsectors - 1)) != zwp_offset)
> +		if ((bdev_offset_from_zone_start(md->disk->part0,
> +						 clone->bi_iter.bi_sector)) != zwp_offset)
> +
>  			return false;
>  		break;
>  	case REQ_OP_ZONE_APPEND:
> @@ -602,11 +604,8 @@ void dm_zone_endio(struct dm_io *io, struct bio *clone)
>  		 */
>  		if (clone->bi_status == BLK_STS_OK &&
>  		    bio_op(clone) == REQ_OP_ZONE_APPEND) {
> -			sector_t mask =
> -				(sector_t)bdev_zone_sectors(disk->part0) - 1;
> -
>  			orig_bio->bi_iter.bi_sector +=
> -				clone->bi_iter.bi_sector & mask;
> +				bdev_offset_from_zone_start(disk->part0, clone->bi_iter.bi_sector);
>  		}
>  
>  		return;


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 08/11] dm-zone: use generic helpers to calculate offset from zone start
@ 2022-07-28  3:20         ` Damien Le Moal
  0 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:20 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: bvanassche, pankydev8, gost.dev, linux-kernel, linux-nvme,
	linux-block, dm-devel, jaegeuk, matias.bjorling,
	Luis Chamberlain

On 7/28/22 01:22, Pankaj Raghav wrote:
> Use the bdev_offset_from_zone_start() helper function to calculate
> the offset from zone start instead of using power of 2 based
> calculation.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

> ---
>  drivers/md/dm-zone.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 3dafc0e8b7a9..31c16aafdbfc 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -390,7 +390,9 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>  	case REQ_OP_WRITE_ZEROES:
>  	case REQ_OP_WRITE:
>  		/* Writes must be aligned to the zone write pointer */
> -		if ((clone->bi_iter.bi_sector & (zsectors - 1)) != zwp_offset)
> +		if ((bdev_offset_from_zone_start(md->disk->part0,
> +						 clone->bi_iter.bi_sector)) != zwp_offset)
> +
>  			return false;
>  		break;
>  	case REQ_OP_ZONE_APPEND:
> @@ -602,11 +604,8 @@ void dm_zone_endio(struct dm_io *io, struct bio *clone)
>  		 */
>  		if (clone->bi_status == BLK_STS_OK &&
>  		    bio_op(clone) == REQ_OP_ZONE_APPEND) {
> -			sector_t mask =
> -				(sector_t)bdev_zone_sectors(disk->part0) - 1;
> -
>  			orig_bio->bi_iter.bi_sector +=
> -				clone->bi_iter.bi_sector & mask;
> +				bdev_offset_from_zone_start(disk->part0, clone->bi_iter.bi_sector);
>  		}
>  
>  		return;


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 09/11] dm-table: allow non po2 zoned devices
  2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
@ 2022-07-28  3:22         ` Damien Le Moal
  -1 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:22 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme

On 7/28/22 01:22, Pankaj Raghav wrote:
> As the block layer now supports non po2 zoned devices, allow dm to
> support non po2 zoned device.

Please rephrase "non po2 zoned devices" here and in the title to correctly
refer to the zone size of zoned devices. Because "non po2 zoned devices"
means absolutely nothing. Let's be clear please.

> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/md/dm-table.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 332f96b58252..534fddfc2b42 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -250,7 +250,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
>  	if (bdev_is_zoned(bdev)) {
>  		unsigned int zone_sectors = bdev_zone_sectors(bdev);
>  
> -		if (start & (zone_sectors - 1)) {
> +		if (!bdev_is_zone_aligned(bdev, start)) {
>  			DMWARN("%s: start=%llu not aligned to h/w zone size %u of %pg",
>  			       dm_device_name(ti->table->md),
>  			       (unsigned long long)start,
> @@ -267,7 +267,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
>  		 * devices do not end up with a smaller zone in the middle of
>  		 * the sector range.
>  		 */
> -		if (len & (zone_sectors - 1)) {
> +		if (!bdev_is_zone_aligned(bdev, len)) {
>  			DMWARN("%s: len=%llu not aligned to h/w zone size %u of %pg",
>  			       dm_device_name(ti->table->md),
>  			       (unsigned long long)len,
> @@ -1642,8 +1642,8 @@ static int validate_hardware_zoned_model(struct dm_table *t,
>  		return -EINVAL;
>  	}
>  
> -	/* Check zone size validity and compatibility */
> -	if (!zone_sectors || !is_power_of_2(zone_sectors))
> +	/* Check zone size validity */

The comment is not super useful now given the trivial test.

> +	if (!zone_sectors)
>  		return -EINVAL;
>  
>  	if (dm_table_any_dev_attr(t, device_not_matches_zone_sectors, &zone_sectors)) {


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 09/11] dm-table: allow non po2 zoned devices
@ 2022-07-28  3:22         ` Damien Le Moal
  0 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:22 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: bvanassche, pankydev8, gost.dev, linux-kernel, linux-nvme,
	linux-block, dm-devel, jaegeuk, matias.bjorling

On 7/28/22 01:22, Pankaj Raghav wrote:
> As the block layer now supports non po2 zoned devices, allow dm to
> support non po2 zoned device.

Please rephrase "non po2 zoned devices" here and in the title to correctly
refer to the zone size of zoned devices. Because "non po2 zoned devices"
means absolutely nothing. Let's be clear please.

> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/md/dm-table.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 332f96b58252..534fddfc2b42 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -250,7 +250,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
>  	if (bdev_is_zoned(bdev)) {
>  		unsigned int zone_sectors = bdev_zone_sectors(bdev);
>  
> -		if (start & (zone_sectors - 1)) {
> +		if (!bdev_is_zone_aligned(bdev, start)) {
>  			DMWARN("%s: start=%llu not aligned to h/w zone size %u of %pg",
>  			       dm_device_name(ti->table->md),
>  			       (unsigned long long)start,
> @@ -267,7 +267,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
>  		 * devices do not end up with a smaller zone in the middle of
>  		 * the sector range.
>  		 */
> -		if (len & (zone_sectors - 1)) {
> +		if (!bdev_is_zone_aligned(bdev, len)) {
>  			DMWARN("%s: len=%llu not aligned to h/w zone size %u of %pg",
>  			       dm_device_name(ti->table->md),
>  			       (unsigned long long)len,
> @@ -1642,8 +1642,8 @@ static int validate_hardware_zoned_model(struct dm_table *t,
>  		return -EINVAL;
>  	}
>  
> -	/* Check zone size validity and compatibility */
> -	if (!zone_sectors || !is_power_of_2(zone_sectors))
> +	/* Check zone size validity */

The comment is not super useful now given the trivial test.

> +	if (!zone_sectors)
>  		return -EINVAL;
>  
>  	if (dm_table_any_dev_attr(t, device_not_matches_zone_sectors, &zone_sectors)) {


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 10/11] dm: call dm_zone_endio after the target endio callback for zoned devices
  2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
@ 2022-07-28  3:25         ` Damien Le Moal
  -1 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:25 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme

On 7/28/22 01:22, Pankaj Raghav wrote:
> dm_zone_endio() updates the bi_sector of orig bio for zoned devices that
> uses either native append or append emulation and it is called before the
> endio of the target. But target endio can still update the clone bio
> after dm_zone_endio is called, thereby, the orig bio does not contain
> the updated information anymore. Call dm_zone_endio for zoned devices
> after calling the target's endio function
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/md/dm.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 03ac6143b8aa..bc410ee04004 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1123,10 +1123,6 @@ static void clone_endio(struct bio *bio)
>  			disable_write_zeroes(md);
>  	}
>  
> -	if (static_branch_unlikely(&zoned_enabled) &&
> -	    unlikely(bdev_is_zoned(bio->bi_bdev)))
> -		dm_zone_endio(io, bio);
> -
>  	if (endio) {
>  		int r = endio(ti, bio, &error);
>  		switch (r) {
> @@ -1155,6 +1151,10 @@ static void clone_endio(struct bio *bio)
>  		}
>  	}
>  
> +	if (static_branch_unlikely(&zoned_enabled) &&
> +	    unlikely(bdev_is_zoned(bio->bi_bdev)))
> +		dm_zone_endio(io, bio);
> +
>  	if (static_branch_unlikely(&swap_bios_enabled) &&
>  	    unlikely(swap_bios_limit(ti, bio)))
>  		up(&md->swap_bios_semaphore);

This patch seems completely unrelated to the series topic. Is that a bug
fix ? How do you trigger it ? Our tests do not show any issues here...
If this triggers only with non power of 2 zone size devices, then this
should be squashed in patch 8. And patch 9 could also be squashed with
patch 8 too.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 10/11] dm: call dm_zone_endio after the target endio callback for zoned devices
@ 2022-07-28  3:25         ` Damien Le Moal
  0 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:25 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: bvanassche, pankydev8, gost.dev, linux-kernel, linux-nvme,
	linux-block, dm-devel, jaegeuk, matias.bjorling

On 7/28/22 01:22, Pankaj Raghav wrote:
> dm_zone_endio() updates the bi_sector of orig bio for zoned devices that
> uses either native append or append emulation and it is called before the
> endio of the target. But target endio can still update the clone bio
> after dm_zone_endio is called, thereby, the orig bio does not contain
> the updated information anymore. Call dm_zone_endio for zoned devices
> after calling the target's endio function
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/md/dm.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 03ac6143b8aa..bc410ee04004 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1123,10 +1123,6 @@ static void clone_endio(struct bio *bio)
>  			disable_write_zeroes(md);
>  	}
>  
> -	if (static_branch_unlikely(&zoned_enabled) &&
> -	    unlikely(bdev_is_zoned(bio->bi_bdev)))
> -		dm_zone_endio(io, bio);
> -
>  	if (endio) {
>  		int r = endio(ti, bio, &error);
>  		switch (r) {
> @@ -1155,6 +1151,10 @@ static void clone_endio(struct bio *bio)
>  		}
>  	}
>  
> +	if (static_branch_unlikely(&zoned_enabled) &&
> +	    unlikely(bdev_is_zoned(bio->bi_bdev)))
> +		dm_zone_endio(io, bio);
> +
>  	if (static_branch_unlikely(&swap_bios_enabled) &&
>  	    unlikely(swap_bios_limit(ti, bio)))
>  		up(&md->swap_bios_semaphore);

This patch seems completely unrelated to the series topic. Is that a bug
fix ? How do you trigger it ? Our tests do not show any issues here...
If this triggers only with non power of 2 zone size devices, then this
should be squashed in patch 8. And patch 9 could also be squashed with
patch 8 too.

-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 00/11] support non power of 2 zoned device
  2022-07-28  2:58         ` [dm-devel] " Bart Van Assche
@ 2022-07-28  3:28           ` Damien Le Moal
  -1 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:28 UTC (permalink / raw)
  To: Bart Van Assche, Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, jaegeuk, dm-devel, linux-nvme

On 7/28/22 11:58, Bart Van Assche wrote:
> On 7/27/22 18:52, Damien Le Moal wrote:
>> On 7/28/22 08:19, Bart Van Assche wrote:
>>> On 7/27/22 09:22, Pankaj Raghav wrote:
>>>> This series adds support to npo2 zoned devices in the block and nvme
>>>> layer and a new **dm target** is added: dm-po2z-target. This new
>>>> target will be initially used for filesystems such as btrfs and
>>>> f2fs that does not have native npo2 zone support.
>>>
>>> Should any SCSI changes be included in this patch series? From sd_zbc.c:
>>>
>>> 	if (!is_power_of_2(zone_blocks)) {
>>> 		sd_printk(KERN_ERR, sdkp,
>>> 			  "Zone size %llu is not a power of two.\n",
>>> 			  zone_blocks);
>>> 		return -EINVAL;
>>> 	}
>>
>> There are no non-power of 2 SMR drives on the market and no plans to have
>> any as far as I know. Users want power of 2 zone size. So I think it is
>> better to leave sd_zbc & scsi_debug as is for now.
> 
> Zoned UFS devices will support ZBC and may have a zone size that is not 
> a power of two.

OK. So the check needs to be removed then and the entire zone append
emulation checked carefully. The divisions for zone no etc on non power of
2 zone size devices in zone append emulation hot path are really not
welcome though.


> 
> Thanks,
> 
> Bart.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 00/11] support non power of 2 zoned device
@ 2022-07-28  3:28           ` Damien Le Moal
  0 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  3:28 UTC (permalink / raw)
  To: Bart Van Assche, Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: pankydev8, gost.dev, linux-kernel, linux-nvme, linux-block,
	dm-devel, jaegeuk, matias.bjorling

On 7/28/22 11:58, Bart Van Assche wrote:
> On 7/27/22 18:52, Damien Le Moal wrote:
>> On 7/28/22 08:19, Bart Van Assche wrote:
>>> On 7/27/22 09:22, Pankaj Raghav wrote:
>>>> This series adds support to npo2 zoned devices in the block and nvme
>>>> layer and a new **dm target** is added: dm-po2z-target. This new
>>>> target will be initially used for filesystems such as btrfs and
>>>> f2fs that does not have native npo2 zone support.
>>>
>>> Should any SCSI changes be included in this patch series? From sd_zbc.c:
>>>
>>> 	if (!is_power_of_2(zone_blocks)) {
>>> 		sd_printk(KERN_ERR, sdkp,
>>> 			  "Zone size %llu is not a power of two.\n",
>>> 			  zone_blocks);
>>> 		return -EINVAL;
>>> 	}
>>
>> There are no non-power of 2 SMR drives on the market and no plans to have
>> any as far as I know. Users want power of 2 zone size. So I think it is
>> better to leave sd_zbc & scsi_debug as is for now.
> 
> Zoned UFS devices will support ZBC and may have a zone size that is not 
> a power of two.

OK. So the check needs to be removed then and the entire zone append
emulation checked carefully. The divisions for zone no etc on non power of
2 zone size devices in zone append emulation hot path are really not
welcome though.


> 
> Thanks,
> 
> Bart.


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 11/11] dm: add power-of-2 zoned target for non-power-of-2 zoned devices
  2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
@ 2022-07-28  4:33         ` Damien Le Moal
  -1 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  4:33 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Damien Le Moal

On 7/28/22 01:22, Pankaj Raghav wrote:
> Only power-of-2(po2) zoned devices were supported in linux but now non
> power-of-2(npo2) zoned device support has been added to the block layer.

Please rephrase this spelling out "zone size" in the text. As mentioned
before, "power-of-2(po2) zoned devices" is very unclear to people not
familiar with zoned block devices.

> 
> Filesystems such as F2FS and btrfs have support for zoned devices with
> po2 zone size assumption. Before adding native support for npo2 zoned
> devices, it was suggested to create a dm target for npo2 zoned device to
> appear as a po2 target so that file systems can initially work without any
> explicit changes by using this target.
> 
> The design of this target is very simple: introduce gaps between the
> underlying device's zone size and the nearest po2 device zone size.
> Device's actual zone size becomes the zone capacity of the target and
> device's nearest po2 zone size becomes the target's zone size.

The design of this target is very simple: remap the device zone size to
the zone capacity and change the zone size to be the nearest power of 2
numner of sectors.

> 
> For e.g., a device with a zone size/capacity of 3M will have an equivalent
> target layout as follows:
> 
> Device layout :-
> zone capacity = 3M
> zone size = 3M
> 
> |--------------|-------------|
> 0             3M            6M
> 
> Target layout :-
> zone capacity=3M
> zone size = 4M
> 
> |--------------|---|--------------|---|
> 0             3M  4M             7M  8M
> 
> All IOs will be remapped from target to the actual device location.

That is what DM does... Instead of this obvious statement, explain how the
remapping is done.

> 
> The area between target's zone capacity and zone size will be emulated
> in the target.
> The read IOs that fall in the emulated gap area will return 0 filled
> bio and all the other IOs in that area will result in an error.
> If a read IO span across the emulated area boundary, then the IOs are
> split across them. All other IO operations that span across the emulated
> area boundary will result in an error.
> 
> The target can be easily created as follows:
> dmsetup create <label> --table '0 <size_sects> po2z /dev/nvme<id>'

0 -> <start offset> ? Or are you allowing only entire devices ?

> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Suggested-by: Damien Le Moal <damien.lemoal@wdc.com>
> Suggested-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/md/Kconfig            |   9 ++
>  drivers/md/Makefile           |   2 +
>  drivers/md/dm-po2z-target.c   | 261 ++++++++++++++++++++++++++++++++++
>  drivers/md/dm-table.c         |  13 +-
>  drivers/md/dm-zone.c          |   1 +
>  include/linux/device-mapper.h |   9 ++
>  6 files changed, 292 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/md/dm-po2z-target.c
> 
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index 998a5cfdbc4e..d58ccfee765b 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -518,6 +518,15 @@ config DM_FLAKEY
>  	help
>  	 A target that intermittently fails I/O for debugging purposes.
>  
> +config DM_PO2Z

Maybe DM_PO2_ZONE ?

or DM_ZONE_PO2SIZE ?

To be clearer...

> +	tristate "Power-of-2 target for non power-of-2 zoned devices"

...for zoned block devices with a zone size that is not a power of 2
number of sectors.

> +	depends on BLK_DEV_DM
> +	depends on BLK_DEV_ZONED
> +	help
> +	  A target that converts a zoned device with non power-of-2 zone size to
> +	  be power of 2. This is done by introducing gaps in between the zone

...to have zones with size equal to a power of 2 n umber of sectors. And
drop the "This done...".
How the target works and how to use it should be documented under
Documentation/admin-guide/device-mapper/

> +	  capacity and the power of 2 zone size.
> +
>  config DM_VERITY
>  	tristate "Verity target support"
>  	depends on BLK_DEV_DM
> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
> index 270f694850ec..98af4ed98f73 100644
> --- a/drivers/md/Makefile
> +++ b/drivers/md/Makefile
> @@ -26,6 +26,7 @@ dm-era-y	+= dm-era-target.o
>  dm-clone-y	+= dm-clone-target.o dm-clone-metadata.o
>  dm-verity-y	+= dm-verity-target.o
>  dm-zoned-y	+= dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
> +dm-po2z-y       += dm-po2z-target.o

dm-po2zone ?

Spelling out "zone" makes things clearer I think.

>  
>  md-mod-y	+= md.o md-bitmap.o
>  raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
> @@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
>  obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
>  obj-$(CONFIG_DM_DUST)		+= dm-dust.o
>  obj-$(CONFIG_DM_FLAKEY)		+= dm-flakey.o
> +obj-$(CONFIG_DM_PO2Z)		+= dm-po2z.o
>  obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
>  obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
>  obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
> diff --git a/drivers/md/dm-po2z-target.c b/drivers/md/dm-po2z-target.c
> new file mode 100644
> index 000000000000..e9c5b7d00eda
> --- /dev/null
> +++ b/drivers/md/dm-po2z-target.c
> @@ -0,0 +1,261 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2022 Samsung Electronics Co., Ltd.
> + */
> +
> +#include <linux/device-mapper.h>
> +
> +#define DM_MSG_PREFIX "po2z"

s/po2z/po2zone ?

> +
> +struct dm_po2z_target {
> +	struct dm_dev *dev;
> +	sector_t zone_size; /* Actual zone size of the underlying dev*/
> +	sector_t zone_size_po2; /* zone_size rounded to the nearest po2 value */
> +	sector_t zone_size_diff; /* diff between zone_size_po2 and zone_size */
> +	u32 nr_zones;

s/u32/unsigned int. This is not a hw driver.

> +};
> +
> +static inline u32 npo2_zone_no(struct dm_po2z_target *dmh, sector_t sect)
> +{
> +	return div64_u64(sect, dmh->zone_size);
> +}
> +
> +static inline u32 po2_zone_no(struct dm_po2z_target *dmh, sector_t sect)
> +{
> +	return sect >> ilog2(dmh->zone_size_po2);

You could saves the shift as "zone_size_po2_shift" in struct dm_po2z_target.

> +}
> +
> +static inline sector_t target_to_device_sect(struct dm_po2z_target *dmh,
> +					     sector_t sect)
> +{
> +	u32 zone_idx = po2_zone_no(dmh, sect);
> +
> +	return sect - (zone_idx * dmh->zone_size_diff);

The idx variable seems unnecessary. And use unsgined int, not u32.

> +}
> +
> +static inline sector_t device_to_target_sect(struct dm_po2z_target *dmh,
> +					     sector_t sect)
> +{
> +	u32 zone_idx = npo2_zone_no(dmh, sect);
> +
> +	return sect + (zone_idx * dmh->zone_size_diff);

same.

> +}
> +
> +/*
> + * This target works on the complete zoned device. Partial mapping is not
> + * supported.
> + * Construct a zoned po2 logical device: <dev-path>
> + */
> +static int dm_po2z_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> +{
> +	struct dm_po2z_target *dmh = NULL;
> +	int ret;
> +	sector_t zone_size;
> +	sector_t dev_capacity;
> +
> +	if (argc < 1)

	if (argc != 1) ?

> +		return -EINVAL;
> +
> +	dmh = kmalloc(sizeof(*dmh), GFP_KERNEL);
> +	if (!dmh)
> +		return -ENOMEM;
> +
> +	ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
> +			    &dmh->dev);
> +
> +	if (ret) {
> +		ti->error = "Device lookup failed";
> +		return ret;

You are leaking dmh.

> +	}
> +
> +	zone_size = bdev_zone_sectors(dmh->dev->bdev);
> +	dev_capacity = get_capacity(dmh->dev->bdev->bd_disk);
> +	if (ti->len != dev_capacity || ti->begin) {
> +		DMERR("%pg Partial mapping of the target not supported",
> +		      dmh->dev->bdev);

dmh is leaked.

> +		return -EINVAL;
> +	}
> +
> +	if (is_power_of_2(zone_size)) {
> +		DMERR("%pg: this target is not useful for power-of-2 zoned devices",
> +		      dmh->dev->bdev);
> +		return -EINVAL;

Same here.

And as suggested before, we could simply warn here but let the target be
created. If used with a power of 2 zone size device, it should still work.

> +	}
> +
> +	dmh->zone_size = zone_size;
> +	dmh->zone_size_po2 = 1 << get_count_order_long(zone_size);
> +	dmh->zone_size_diff = dmh->zone_size_po2 - dmh->zone_size;
> +	ti->private = dmh;
> +	dmh->nr_zones = npo2_zone_no(dmh, ti->len);
> +	ti->len = dmh->zone_size_po2 * dmh->nr_zones;
> +
> +	return 0;
> +}
> +
> +static int dm_po2z_report_zones_cb(struct blk_zone *zone, unsigned int idx,
> +				   void *data)
> +{
> +	struct dm_report_zones_args *args = data;
> +	struct dm_po2z_target *dmh = args->tgt->private;
> +
> +	zone->start = device_to_target_sect(dmh, zone->start);
> +	zone->wp = device_to_target_sect(dmh, zone->wp);

You could simplify this to a single call to device_to_target_sect() by
saving the diff between device zone start and remapped zone start and add
that diff to the wp field.

> +	zone->len = dmh->zone_size_po2;
> +	args->next_sector = zone->start + zone->len;
> +
> +	return args->orig_cb(zone, args->zone_idx++, args->orig_data);
> +}
> +
> +static int dm_po2z_report_zones(struct dm_target *ti,
> +				struct dm_report_zones_args *args,
> +				unsigned int nr_zones)
> +{
> +	struct dm_po2z_target *dmh = ti->private;
> +	sector_t sect = po2_zone_no(dmh, args->next_sector) * dmh->zone_size;
> +
> +	return blkdev_report_zones(dmh->dev->bdev, sect, nr_zones,
> +				   dm_po2z_report_zones_cb, args);
> +}
> +
> +static int dm_po2z_end_io(struct dm_target *ti, struct bio *bio,
> +			  blk_status_t *error)
> +{
> +	struct dm_po2z_target *dmh = ti->private;
> +
> +	if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND)
> +		bio->bi_iter.bi_sector =
> +			device_to_target_sect(dmh, bio->bi_iter.bi_sector);
> +
> +	return DM_ENDIO_DONE;
> +}
> +
> +static void dm_po2z_io_hints(struct dm_target *ti, struct queue_limits *limits)
> +{
> +	struct dm_po2z_target *dmh = ti->private;
> +
> +	limits->chunk_sectors = dmh->zone_size_po2;
> +}
> +
> +static bool bio_across_emulated_zone_area(struct dm_po2z_target *dmh,
> +					  struct bio *bio)
> +{
> +	u32 zone_idx = po2_zone_no(dmh, bio->bi_iter.bi_sector);
> +	sector_t size = bio->bi_iter.bi_size >> SECTOR_SHIFT;

Rename that to nr_sectors to be clear about the unit.

> +
> +	return (bio->bi_iter.bi_sector - (zone_idx * dmh->zone_size_po2) +
> +		size) > dmh->zone_size;

This is hard to read and actually seems very wrong. Shouldn't this be simply:

	return bio->bi_iter.bi_sector + nr_sectors >
		zone_idx * dmh->zone_size_po2 + dmh->zone_size;

Thatis, check that the BIO goes beyond the zone capacity of the target.
?

> +}
> +
> +static int dm_po2z_map_read(struct dm_po2z_target *dmh, struct bio *bio)
> +{
> +	sector_t start_sect, size, relative_sect_in_zone, split_io_pos;
> +	u32 zone_idx;
> +
> +	/*
> +	 * Read does not extend into the emulated zone area (area between
> +	 * zone capacity and zone size)
> +	 */
> +	if (!bio_across_emulated_zone_area(dmh, bio)) {

hu... why not use dm_accept_partial_bio() in dm_po2z_map() to never have
to deal with this here ? That is, dm_po2z_map_read() is called for any
read BIO fully within the zone capacity. And create a
dm_po2z_map_read_zero() helper for any read BIO that is after the zone
capacity. Way simpler I think.

> +		bio->bi_iter.bi_sector =
> +			target_to_device_sect(dmh, bio->bi_iter.bi_sector);
> +		return DM_MAPIO_REMAPPED;
> +	}
> +
> +	start_sect = bio->bi_iter.bi_sector;
> +	size = bio->bi_iter.bi_size >> SECTOR_SHIFT;
> +	zone_idx = po2_zone_no(dmh, start_sect);
> +	relative_sect_in_zone = start_sect - (zone_idx * dmh->zone_size_po2);
> +
> +	/*
> +	 * If the starting sector is in the emulated area then fill
> +	 * all the bio with zeros. If bio is across emulated zone boundary,
> +	 * split the bio across boundaries and fill zeros only for the
> +	 * bio that is between the zone capacity and the zone size.
> +	 */
> +	if (relative_sect_in_zone < dmh->zone_size &&
> +	    ((relative_sect_in_zone + size) > dmh->zone_size)) {
> +		split_io_pos = (zone_idx * dmh->zone_size_po2) + dmh->zone_size;
> +		dm_accept_partial_bio(bio, split_io_pos - start_sect);

See above. Do that in dm_po2z_map(). This will simplify this function.

> +		bio->bi_iter.bi_sector = target_to_device_sect(dmh, start_sect);
> +
> +		return DM_MAPIO_REMAPPED;
> +	} else if (relative_sect_in_zone >= dmh->zone_size &&
> +		   ((relative_sect_in_zone + size) > dmh->zone_size_po2)) {

No need for the else after return. And this can NEVER happen since BIOs
can never cross zone boundaries.

> +		split_io_pos = (zone_idx + 1) * dmh->zone_size_po2;
> +		dm_accept_partial_bio(bio, split_io_pos - start_sect);
> +	}
> +
> +	zero_fill_bio(bio);
> +	bio_endio(bio);
> +	return DM_MAPIO_SUBMITTED;
> +}
> +
> +static int dm_po2z_map(struct dm_target *ti, struct bio *bio)
> +{
> +	struct dm_po2z_target *dmh = ti->private;
> +
> +	bio_set_dev(bio, dmh->dev->bdev);
> +	if (bio_sectors(bio) || op_is_zone_mgmt(bio_op(bio))) {
> +		/*
> +		 * Read operation on the emulated zone area (between zone capacity
> +		 * and zone size) will fill the bio with zeroes. Any other operation
> +		 * in the emulated area should return an error.
> +		 */
> +		if (bio_op(bio) == REQ_OP_READ)
> +			return dm_po2z_map_read(dmh, bio);
> +
> +		if (!bio_across_emulated_zone_area(dmh, bio)) {
> +			bio->bi_iter.bi_sector = target_to_device_sect(dmh,
> +								       bio->bi_iter.bi_sector);
> +			return DM_MAPIO_REMAPPED;
> +		}
> +		return DM_MAPIO_KILL;
> +	}
> +	return DM_MAPIO_REMAPPED;
> +}
> +
> +static int dm_po2z_iterate_devices(struct dm_target *ti,
> +				   iterate_devices_callout_fn fn, void *data)
> +{
> +	struct dm_po2z_target *dmh = ti->private;
> +	sector_t len = dmh->nr_zones * dmh->zone_size;
> +
> +	return fn(ti, dmh->dev, 0, len, data);
> +}
> +
> +static struct target_type dm_po2z_target = {
> +	.name = "po2z",
> +	.version = { 1, 0, 0 },
> +	.features = DM_TARGET_ZONED_HM | DM_TARGET_EMULATED_ZONES,
> +	.map = dm_po2z_map,
> +	.end_io = dm_po2z_end_io,
> +	.report_zones = dm_po2z_report_zones,
> +	.iterate_devices = dm_po2z_iterate_devices,
> +	.module = THIS_MODULE,
> +	.io_hints = dm_po2z_io_hints,
> +	.ctr = dm_po2z_ctr,
> +};
> +
> +static int __init dm_po2z_init(void)
> +{
> +	int r = dm_register_target(&dm_po2z_target);
> +
> +	if (r < 0)
> +		DMERR("register failed %d", r);
> +
> +	return r;

Simplify:

	return dm_register_target(&dm_po2z_target);

> +}
> +
> +static void __exit dm_po2z_exit(void)
> +{
> +	dm_unregister_target(&dm_po2z_target);
> +}
> +
> +/* Module hooks */
> +module_init(dm_po2z_init);
> +module_exit(dm_po2z_exit);
> +
> +MODULE_DESCRIPTION(DM_NAME "power-of-2 zoned target");
> +MODULE_AUTHOR("Pankaj Raghav <p.raghav@samsung.com>");
> +MODULE_LICENSE("GPL");
> +

the changes below should be a different prep patch.

> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 534fddfc2b42..d77feff124af 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -1614,13 +1614,20 @@ static bool dm_table_supports_zoned_model(struct dm_table *t,
>  	return true;
>  }
>  
> -static int device_not_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
> +/*
> + * Callback function to check for device zone sector across devices. If the
> + * DM_TARGET_EMULATED_ZONES target feature flag is not set, then the target
> + * should have the same zone sector as the underlying devices.
> + */
> +static int check_valid_device_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
>  					   sector_t start, sector_t len, void *data)
>  {
>  	unsigned int *zone_sectors = data;
>  
> -	if (!bdev_is_zoned(dev->bdev))
> +	if (!bdev_is_zoned(dev->bdev) ||
> +	    dm_target_supports_emulated_zones(ti->type))
>  		return 0;
> +
>  	return bdev_zone_sectors(dev->bdev) != *zone_sectors;
>  }
>  
> @@ -1646,7 +1653,7 @@ static int validate_hardware_zoned_model(struct dm_table *t,
>  	if (!zone_sectors)
>  		return -EINVAL;
>  
> -	if (dm_table_any_dev_attr(t, device_not_matches_zone_sectors, &zone_sectors)) {
> +	if (dm_table_any_dev_attr(t, check_valid_device_zone_sectors, &zone_sectors)) {
>  		DMERR("%s: zone sectors is not consistent across all zoned devices",
>  		      dm_device_name(t->md));
>  		return -EINVAL;
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 31c16aafdbfc..2b6b3883471f 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -302,6 +302,7 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
>  	if (dm_table_supports_zone_append(t)) {
>  		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
>  		dm_cleanup_zoned_dev(md);
> +
>  		return 0;
>  	}
>  
> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> index 920085dd7f3b..7dbd28b8de01 100644
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -294,6 +294,15 @@ struct target_type {
>  #define dm_target_supports_mixed_zoned_model(type) (false)
>  #endif
>  
> +#ifdef CONFIG_BLK_DEV_ZONED
> +#define DM_TARGET_EMULATED_ZONES	0x00000400
> +#define dm_target_supports_emulated_zones(type) \
> +	((type)->features & DM_TARGET_EMULATED_ZONES)
> +#else
> +#define DM_TARGET_EMULATED_ZONES	0x00000000
> +#define dm_target_supports_emulated_zones(type) (false)
> +#endif
> +
>  struct dm_target {
>  	struct dm_table *table;
>  	struct target_type *type;


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 11/11] dm: add power-of-2 zoned target for non-power-of-2 zoned devices
@ 2022-07-28  4:33         ` Damien Le Moal
  0 siblings, 0 replies; 80+ messages in thread
From: Damien Le Moal @ 2022-07-28  4:33 UTC (permalink / raw)
  To: Pankaj Raghav, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: Damien Le Moal, bvanassche, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, jaegeuk, matias.bjorling

On 7/28/22 01:22, Pankaj Raghav wrote:
> Only power-of-2(po2) zoned devices were supported in linux but now non
> power-of-2(npo2) zoned device support has been added to the block layer.

Please rephrase this spelling out "zone size" in the text. As mentioned
before, "power-of-2(po2) zoned devices" is very unclear to people not
familiar with zoned block devices.

> 
> Filesystems such as F2FS and btrfs have support for zoned devices with
> po2 zone size assumption. Before adding native support for npo2 zoned
> devices, it was suggested to create a dm target for npo2 zoned device to
> appear as a po2 target so that file systems can initially work without any
> explicit changes by using this target.
> 
> The design of this target is very simple: introduce gaps between the
> underlying device's zone size and the nearest po2 device zone size.
> Device's actual zone size becomes the zone capacity of the target and
> device's nearest po2 zone size becomes the target's zone size.

The design of this target is very simple: remap the device zone size to
the zone capacity and change the zone size to be the nearest power of 2
numner of sectors.

> 
> For e.g., a device with a zone size/capacity of 3M will have an equivalent
> target layout as follows:
> 
> Device layout :-
> zone capacity = 3M
> zone size = 3M
> 
> |--------------|-------------|
> 0             3M            6M
> 
> Target layout :-
> zone capacity=3M
> zone size = 4M
> 
> |--------------|---|--------------|---|
> 0             3M  4M             7M  8M
> 
> All IOs will be remapped from target to the actual device location.

That is what DM does... Instead of this obvious statement, explain how the
remapping is done.

> 
> The area between target's zone capacity and zone size will be emulated
> in the target.
> The read IOs that fall in the emulated gap area will return 0 filled
> bio and all the other IOs in that area will result in an error.
> If a read IO span across the emulated area boundary, then the IOs are
> split across them. All other IO operations that span across the emulated
> area boundary will result in an error.
> 
> The target can be easily created as follows:
> dmsetup create <label> --table '0 <size_sects> po2z /dev/nvme<id>'

0 -> <start offset> ? Or are you allowing only entire devices ?

> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Suggested-by: Damien Le Moal <damien.lemoal@wdc.com>
> Suggested-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/md/Kconfig            |   9 ++
>  drivers/md/Makefile           |   2 +
>  drivers/md/dm-po2z-target.c   | 261 ++++++++++++++++++++++++++++++++++
>  drivers/md/dm-table.c         |  13 +-
>  drivers/md/dm-zone.c          |   1 +
>  include/linux/device-mapper.h |   9 ++
>  6 files changed, 292 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/md/dm-po2z-target.c
> 
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index 998a5cfdbc4e..d58ccfee765b 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -518,6 +518,15 @@ config DM_FLAKEY
>  	help
>  	 A target that intermittently fails I/O for debugging purposes.
>  
> +config DM_PO2Z

Maybe DM_PO2_ZONE ?

or DM_ZONE_PO2SIZE ?

To be clearer...

> +	tristate "Power-of-2 target for non power-of-2 zoned devices"

...for zoned block devices with a zone size that is not a power of 2
number of sectors.

> +	depends on BLK_DEV_DM
> +	depends on BLK_DEV_ZONED
> +	help
> +	  A target that converts a zoned device with non power-of-2 zone size to
> +	  be power of 2. This is done by introducing gaps in between the zone

...to have zones with size equal to a power of 2 n umber of sectors. And
drop the "This done...".
How the target works and how to use it should be documented under
Documentation/admin-guide/device-mapper/

> +	  capacity and the power of 2 zone size.
> +
>  config DM_VERITY
>  	tristate "Verity target support"
>  	depends on BLK_DEV_DM
> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
> index 270f694850ec..98af4ed98f73 100644
> --- a/drivers/md/Makefile
> +++ b/drivers/md/Makefile
> @@ -26,6 +26,7 @@ dm-era-y	+= dm-era-target.o
>  dm-clone-y	+= dm-clone-target.o dm-clone-metadata.o
>  dm-verity-y	+= dm-verity-target.o
>  dm-zoned-y	+= dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
> +dm-po2z-y       += dm-po2z-target.o

dm-po2zone ?

Spelling out "zone" makes things clearer I think.

>  
>  md-mod-y	+= md.o md-bitmap.o
>  raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
> @@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
>  obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
>  obj-$(CONFIG_DM_DUST)		+= dm-dust.o
>  obj-$(CONFIG_DM_FLAKEY)		+= dm-flakey.o
> +obj-$(CONFIG_DM_PO2Z)		+= dm-po2z.o
>  obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
>  obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
>  obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
> diff --git a/drivers/md/dm-po2z-target.c b/drivers/md/dm-po2z-target.c
> new file mode 100644
> index 000000000000..e9c5b7d00eda
> --- /dev/null
> +++ b/drivers/md/dm-po2z-target.c
> @@ -0,0 +1,261 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2022 Samsung Electronics Co., Ltd.
> + */
> +
> +#include <linux/device-mapper.h>
> +
> +#define DM_MSG_PREFIX "po2z"

s/po2z/po2zone ?

> +
> +struct dm_po2z_target {
> +	struct dm_dev *dev;
> +	sector_t zone_size; /* Actual zone size of the underlying dev*/
> +	sector_t zone_size_po2; /* zone_size rounded to the nearest po2 value */
> +	sector_t zone_size_diff; /* diff between zone_size_po2 and zone_size */
> +	u32 nr_zones;

s/u32/unsigned int. This is not a hw driver.

> +};
> +
> +static inline u32 npo2_zone_no(struct dm_po2z_target *dmh, sector_t sect)
> +{
> +	return div64_u64(sect, dmh->zone_size);
> +}
> +
> +static inline u32 po2_zone_no(struct dm_po2z_target *dmh, sector_t sect)
> +{
> +	return sect >> ilog2(dmh->zone_size_po2);

You could saves the shift as "zone_size_po2_shift" in struct dm_po2z_target.

> +}
> +
> +static inline sector_t target_to_device_sect(struct dm_po2z_target *dmh,
> +					     sector_t sect)
> +{
> +	u32 zone_idx = po2_zone_no(dmh, sect);
> +
> +	return sect - (zone_idx * dmh->zone_size_diff);

The idx variable seems unnecessary. And use unsgined int, not u32.

> +}
> +
> +static inline sector_t device_to_target_sect(struct dm_po2z_target *dmh,
> +					     sector_t sect)
> +{
> +	u32 zone_idx = npo2_zone_no(dmh, sect);
> +
> +	return sect + (zone_idx * dmh->zone_size_diff);

same.

> +}
> +
> +/*
> + * This target works on the complete zoned device. Partial mapping is not
> + * supported.
> + * Construct a zoned po2 logical device: <dev-path>
> + */
> +static int dm_po2z_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> +{
> +	struct dm_po2z_target *dmh = NULL;
> +	int ret;
> +	sector_t zone_size;
> +	sector_t dev_capacity;
> +
> +	if (argc < 1)

	if (argc != 1) ?

> +		return -EINVAL;
> +
> +	dmh = kmalloc(sizeof(*dmh), GFP_KERNEL);
> +	if (!dmh)
> +		return -ENOMEM;
> +
> +	ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
> +			    &dmh->dev);
> +
> +	if (ret) {
> +		ti->error = "Device lookup failed";
> +		return ret;

You are leaking dmh.

> +	}
> +
> +	zone_size = bdev_zone_sectors(dmh->dev->bdev);
> +	dev_capacity = get_capacity(dmh->dev->bdev->bd_disk);
> +	if (ti->len != dev_capacity || ti->begin) {
> +		DMERR("%pg Partial mapping of the target not supported",
> +		      dmh->dev->bdev);

dmh is leaked.

> +		return -EINVAL;
> +	}
> +
> +	if (is_power_of_2(zone_size)) {
> +		DMERR("%pg: this target is not useful for power-of-2 zoned devices",
> +		      dmh->dev->bdev);
> +		return -EINVAL;

Same here.

And as suggested before, we could simply warn here but let the target be
created. If used with a power of 2 zone size device, it should still work.

> +	}
> +
> +	dmh->zone_size = zone_size;
> +	dmh->zone_size_po2 = 1 << get_count_order_long(zone_size);
> +	dmh->zone_size_diff = dmh->zone_size_po2 - dmh->zone_size;
> +	ti->private = dmh;
> +	dmh->nr_zones = npo2_zone_no(dmh, ti->len);
> +	ti->len = dmh->zone_size_po2 * dmh->nr_zones;
> +
> +	return 0;
> +}
> +
> +static int dm_po2z_report_zones_cb(struct blk_zone *zone, unsigned int idx,
> +				   void *data)
> +{
> +	struct dm_report_zones_args *args = data;
> +	struct dm_po2z_target *dmh = args->tgt->private;
> +
> +	zone->start = device_to_target_sect(dmh, zone->start);
> +	zone->wp = device_to_target_sect(dmh, zone->wp);

You could simplify this to a single call to device_to_target_sect() by
saving the diff between device zone start and remapped zone start and add
that diff to the wp field.

> +	zone->len = dmh->zone_size_po2;
> +	args->next_sector = zone->start + zone->len;
> +
> +	return args->orig_cb(zone, args->zone_idx++, args->orig_data);
> +}
> +
> +static int dm_po2z_report_zones(struct dm_target *ti,
> +				struct dm_report_zones_args *args,
> +				unsigned int nr_zones)
> +{
> +	struct dm_po2z_target *dmh = ti->private;
> +	sector_t sect = po2_zone_no(dmh, args->next_sector) * dmh->zone_size;
> +
> +	return blkdev_report_zones(dmh->dev->bdev, sect, nr_zones,
> +				   dm_po2z_report_zones_cb, args);
> +}
> +
> +static int dm_po2z_end_io(struct dm_target *ti, struct bio *bio,
> +			  blk_status_t *error)
> +{
> +	struct dm_po2z_target *dmh = ti->private;
> +
> +	if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND)
> +		bio->bi_iter.bi_sector =
> +			device_to_target_sect(dmh, bio->bi_iter.bi_sector);
> +
> +	return DM_ENDIO_DONE;
> +}
> +
> +static void dm_po2z_io_hints(struct dm_target *ti, struct queue_limits *limits)
> +{
> +	struct dm_po2z_target *dmh = ti->private;
> +
> +	limits->chunk_sectors = dmh->zone_size_po2;
> +}
> +
> +static bool bio_across_emulated_zone_area(struct dm_po2z_target *dmh,
> +					  struct bio *bio)
> +{
> +	u32 zone_idx = po2_zone_no(dmh, bio->bi_iter.bi_sector);
> +	sector_t size = bio->bi_iter.bi_size >> SECTOR_SHIFT;

Rename that to nr_sectors to be clear about the unit.

> +
> +	return (bio->bi_iter.bi_sector - (zone_idx * dmh->zone_size_po2) +
> +		size) > dmh->zone_size;

This is hard to read and actually seems very wrong. Shouldn't this be simply:

	return bio->bi_iter.bi_sector + nr_sectors >
		zone_idx * dmh->zone_size_po2 + dmh->zone_size;

Thatis, check that the BIO goes beyond the zone capacity of the target.
?

> +}
> +
> +static int dm_po2z_map_read(struct dm_po2z_target *dmh, struct bio *bio)
> +{
> +	sector_t start_sect, size, relative_sect_in_zone, split_io_pos;
> +	u32 zone_idx;
> +
> +	/*
> +	 * Read does not extend into the emulated zone area (area between
> +	 * zone capacity and zone size)
> +	 */
> +	if (!bio_across_emulated_zone_area(dmh, bio)) {

hu... why not use dm_accept_partial_bio() in dm_po2z_map() to never have
to deal with this here ? That is, dm_po2z_map_read() is called for any
read BIO fully within the zone capacity. And create a
dm_po2z_map_read_zero() helper for any read BIO that is after the zone
capacity. Way simpler I think.

> +		bio->bi_iter.bi_sector =
> +			target_to_device_sect(dmh, bio->bi_iter.bi_sector);
> +		return DM_MAPIO_REMAPPED;
> +	}
> +
> +	start_sect = bio->bi_iter.bi_sector;
> +	size = bio->bi_iter.bi_size >> SECTOR_SHIFT;
> +	zone_idx = po2_zone_no(dmh, start_sect);
> +	relative_sect_in_zone = start_sect - (zone_idx * dmh->zone_size_po2);
> +
> +	/*
> +	 * If the starting sector is in the emulated area then fill
> +	 * all the bio with zeros. If bio is across emulated zone boundary,
> +	 * split the bio across boundaries and fill zeros only for the
> +	 * bio that is between the zone capacity and the zone size.
> +	 */
> +	if (relative_sect_in_zone < dmh->zone_size &&
> +	    ((relative_sect_in_zone + size) > dmh->zone_size)) {
> +		split_io_pos = (zone_idx * dmh->zone_size_po2) + dmh->zone_size;
> +		dm_accept_partial_bio(bio, split_io_pos - start_sect);

See above. Do that in dm_po2z_map(). This will simplify this function.

> +		bio->bi_iter.bi_sector = target_to_device_sect(dmh, start_sect);
> +
> +		return DM_MAPIO_REMAPPED;
> +	} else if (relative_sect_in_zone >= dmh->zone_size &&
> +		   ((relative_sect_in_zone + size) > dmh->zone_size_po2)) {

No need for the else after return. And this can NEVER happen since BIOs
can never cross zone boundaries.

> +		split_io_pos = (zone_idx + 1) * dmh->zone_size_po2;
> +		dm_accept_partial_bio(bio, split_io_pos - start_sect);
> +	}
> +
> +	zero_fill_bio(bio);
> +	bio_endio(bio);
> +	return DM_MAPIO_SUBMITTED;
> +}
> +
> +static int dm_po2z_map(struct dm_target *ti, struct bio *bio)
> +{
> +	struct dm_po2z_target *dmh = ti->private;
> +
> +	bio_set_dev(bio, dmh->dev->bdev);
> +	if (bio_sectors(bio) || op_is_zone_mgmt(bio_op(bio))) {
> +		/*
> +		 * Read operation on the emulated zone area (between zone capacity
> +		 * and zone size) will fill the bio with zeroes. Any other operation
> +		 * in the emulated area should return an error.
> +		 */
> +		if (bio_op(bio) == REQ_OP_READ)
> +			return dm_po2z_map_read(dmh, bio);
> +
> +		if (!bio_across_emulated_zone_area(dmh, bio)) {
> +			bio->bi_iter.bi_sector = target_to_device_sect(dmh,
> +								       bio->bi_iter.bi_sector);
> +			return DM_MAPIO_REMAPPED;
> +		}
> +		return DM_MAPIO_KILL;
> +	}
> +	return DM_MAPIO_REMAPPED;
> +}
> +
> +static int dm_po2z_iterate_devices(struct dm_target *ti,
> +				   iterate_devices_callout_fn fn, void *data)
> +{
> +	struct dm_po2z_target *dmh = ti->private;
> +	sector_t len = dmh->nr_zones * dmh->zone_size;
> +
> +	return fn(ti, dmh->dev, 0, len, data);
> +}
> +
> +static struct target_type dm_po2z_target = {
> +	.name = "po2z",
> +	.version = { 1, 0, 0 },
> +	.features = DM_TARGET_ZONED_HM | DM_TARGET_EMULATED_ZONES,
> +	.map = dm_po2z_map,
> +	.end_io = dm_po2z_end_io,
> +	.report_zones = dm_po2z_report_zones,
> +	.iterate_devices = dm_po2z_iterate_devices,
> +	.module = THIS_MODULE,
> +	.io_hints = dm_po2z_io_hints,
> +	.ctr = dm_po2z_ctr,
> +};
> +
> +static int __init dm_po2z_init(void)
> +{
> +	int r = dm_register_target(&dm_po2z_target);
> +
> +	if (r < 0)
> +		DMERR("register failed %d", r);
> +
> +	return r;

Simplify:

	return dm_register_target(&dm_po2z_target);

> +}
> +
> +static void __exit dm_po2z_exit(void)
> +{
> +	dm_unregister_target(&dm_po2z_target);
> +}
> +
> +/* Module hooks */
> +module_init(dm_po2z_init);
> +module_exit(dm_po2z_exit);
> +
> +MODULE_DESCRIPTION(DM_NAME "power-of-2 zoned target");
> +MODULE_AUTHOR("Pankaj Raghav <p.raghav@samsung.com>");
> +MODULE_LICENSE("GPL");
> +

the changes below should be a different prep patch.

> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 534fddfc2b42..d77feff124af 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -1614,13 +1614,20 @@ static bool dm_table_supports_zoned_model(struct dm_table *t,
>  	return true;
>  }
>  
> -static int device_not_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
> +/*
> + * Callback function to check for device zone sector across devices. If the
> + * DM_TARGET_EMULATED_ZONES target feature flag is not set, then the target
> + * should have the same zone sector as the underlying devices.
> + */
> +static int check_valid_device_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
>  					   sector_t start, sector_t len, void *data)
>  {
>  	unsigned int *zone_sectors = data;
>  
> -	if (!bdev_is_zoned(dev->bdev))
> +	if (!bdev_is_zoned(dev->bdev) ||
> +	    dm_target_supports_emulated_zones(ti->type))
>  		return 0;
> +
>  	return bdev_zone_sectors(dev->bdev) != *zone_sectors;
>  }
>  
> @@ -1646,7 +1653,7 @@ static int validate_hardware_zoned_model(struct dm_table *t,
>  	if (!zone_sectors)
>  		return -EINVAL;
>  
> -	if (dm_table_any_dev_attr(t, device_not_matches_zone_sectors, &zone_sectors)) {
> +	if (dm_table_any_dev_attr(t, check_valid_device_zone_sectors, &zone_sectors)) {
>  		DMERR("%s: zone sectors is not consistent across all zoned devices",
>  		      dm_device_name(t->md));
>  		return -EINVAL;
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 31c16aafdbfc..2b6b3883471f 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -302,6 +302,7 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
>  	if (dm_table_supports_zone_append(t)) {
>  		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
>  		dm_cleanup_zoned_dev(md);
> +
>  		return 0;
>  	}
>  
> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> index 920085dd7f3b..7dbd28b8de01 100644
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -294,6 +294,15 @@ struct target_type {
>  #define dm_target_supports_mixed_zoned_model(type) (false)
>  #endif
>  
> +#ifdef CONFIG_BLK_DEV_ZONED
> +#define DM_TARGET_EMULATED_ZONES	0x00000400
> +#define dm_target_supports_emulated_zones(type) \
> +	((type)->features & DM_TARGET_EMULATED_ZONES)
> +#else
> +#define DM_TARGET_EMULATED_ZONES	0x00000000
> +#define dm_target_supports_emulated_zones(type) (false)
> +#endif
> +
>  struct dm_target {
>  	struct dm_table *table;
>  	struct target_type *type;


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 00/11] support non power of 2 zoned device
  2022-07-27 23:19     ` [dm-devel] " Bart Van Assche
@ 2022-07-28 11:57       ` Pankaj Raghav
  -1 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-28 11:57 UTC (permalink / raw)
  To: Bart Van Assche, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, jaegeuk, dm-devel, linux-nvme

Hi Bart,

On 2022-07-28 01:19, Bart Van Assche wrote:
> On 7/27/22 09:22, Pankaj Raghav wrote:
>> This series adds support to npo2 zoned devices in the block and nvme
>> layer and a new **dm target** is added: dm-po2z-target. This new
>> target will be initially used for filesystems such as btrfs and
>> f2fs that does not have native npo2 zone support.
> 
> Should any SCSI changes be included in this patch series? From sd_zbc.c:
> 
>     if (!is_power_of_2(zone_blocks)) {
>         sd_printk(KERN_ERR, sdkp,
>               "Zone size %llu is not a power of two.\n",
>               zone_blocks);
>         return -EINVAL;
>     }
> 
I would keep these changes out of the current patch series because it
will also increase the test scope. I think once the block layer
constraint is removed as a part of this series, we can work on the SCSI
changes in the next cycle.
> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 00/11] support non power of 2 zoned device
@ 2022-07-28 11:57       ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-28 11:57 UTC (permalink / raw)
  To: Bart Van Assche, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: pankydev8, gost.dev, linux-kernel, linux-nvme, linux-block,
	dm-devel, jaegeuk, matias.bjorling

Hi Bart,

On 2022-07-28 01:19, Bart Van Assche wrote:
> On 7/27/22 09:22, Pankaj Raghav wrote:
>> This series adds support to npo2 zoned devices in the block and nvme
>> layer and a new **dm target** is added: dm-po2z-target. This new
>> target will be initially used for filesystems such as btrfs and
>> f2fs that does not have native npo2 zone support.
> 
> Should any SCSI changes be included in this patch series? From sd_zbc.c:
> 
>     if (!is_power_of_2(zone_blocks)) {
>         sd_printk(KERN_ERR, sdkp,
>               "Zone size %llu is not a power of two.\n",
>               zone_blocks);
>         return -EINVAL;
>     }
> 
I would keep these changes out of the current patch series because it
will also increase the test scope. I think once the block layer
constraint is removed as a part of this series, we can work on the SCSI
changes in the next cycle.
> Thanks,
> 
> Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-07-27 23:16         ` [dm-devel] " Bart Van Assche
@ 2022-07-28 12:11           ` Pankaj Raghav
  -1 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-28 12:11 UTC (permalink / raw)
  To: Bart Van Assche, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, jaegeuk, dm-devel, linux-nvme, Luis Chamberlain

On 2022-07-28 01:16, Bart Van Assche wrote:
> On 7/27/22 09:22, Pankaj Raghav wrote:
>> Checking if a given sector is aligned to a zone is a common
>> operation that is performed for zoned devices. Add
>> bdev_is_zone_start helper to check for this instead of opencoding it
>> everywhere.
> 
> I can't find the bdev_is_zone_start() function in this patch?
> 
I made the name change from bdev_is_zone_start to bdev_is_zone_aligned
last moment and missed changing it in the commit log.

>> To make this work bdev_get_queue(), bdev_zone_sectors() and
>> bdev_is_zoned() are moved earlier without modifications.
> 
> Can that change perhaps be isolated into a separate patch?
> 
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 3d286a256d3d..1f7e9a90e198 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -570,7 +570,7 @@ static inline blk_status_t
>> blk_check_zone_append(struct request_queue *q,
>>           return BLK_STS_NOTSUPP;
>>         /* The bio sector must point to the start of a sequential zone */
>> -    if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) -
>> 1) ||
>> +    if (!bdev_is_zone_aligned(bio->bi_bdev, bio->bi_iter.bi_sector) ||
>>           !bio_zone_is_seq(bio))
>>           return BLK_STS_IOERR;
> 
> The bdev_is_zone_start() name seems more clear to me than
> bdev_is_zone_aligned(). Has there already been a discussion about which
> name to use for this function?
> 
The reason I did s/bdev_is_zone_start/bdev_is_zone_aligned is that this
name makes more sense for also checking if a given size is a multiple of
zone sectors for e.g., used in PATCH 9:

-		if (len & (zone_sectors - 1)) {
+		if (!bdev_is_zone_aligned(bdev, len)) {

I felt `bdev_is_zone_aligned` fits the use case of checking if the
sector starts at the start of a zone and also check if a given length of
sectors also align with the zone sectors. bdev_is_zone_start does not
make the intention clear for the latter use case IMO.

But I am fine with going back to bdev_is_zone_start if you and Damien
feel strongly otherwise.
>> +        /*
>> +         * Non power-of-2 zone size support was added to remove the
>> +         * gap between zone capacity and zone size. Though it is
>> technically
>> +         * possible to have gaps in a non power-of-2 device, Linux
>> requires
>> +         * the zone size to be equal to zone capacity for non power-of-2
>> +         * zoned devices.
>> +         */
>> +        if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
>> +            pr_warn("%s: Invalid zone capacity for non power of 2
>> zone size",
>> +                disk->disk_name);
> 
> Given the severity of this error, shouldn't the zone capacity and length
> be reported in the error message?
> 
Ok.
> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
@ 2022-07-28 12:11           ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-28 12:11 UTC (permalink / raw)
  To: Bart Van Assche, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: pankydev8, gost.dev, linux-kernel, linux-nvme, linux-block,
	dm-devel, jaegeuk, matias.bjorling, Luis Chamberlain

On 2022-07-28 01:16, Bart Van Assche wrote:
> On 7/27/22 09:22, Pankaj Raghav wrote:
>> Checking if a given sector is aligned to a zone is a common
>> operation that is performed for zoned devices. Add
>> bdev_is_zone_start helper to check for this instead of opencoding it
>> everywhere.
> 
> I can't find the bdev_is_zone_start() function in this patch?
> 
I made the name change from bdev_is_zone_start to bdev_is_zone_aligned
last moment and missed changing it in the commit log.

>> To make this work bdev_get_queue(), bdev_zone_sectors() and
>> bdev_is_zoned() are moved earlier without modifications.
> 
> Can that change perhaps be isolated into a separate patch?
> 
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 3d286a256d3d..1f7e9a90e198 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -570,7 +570,7 @@ static inline blk_status_t
>> blk_check_zone_append(struct request_queue *q,
>>           return BLK_STS_NOTSUPP;
>>         /* The bio sector must point to the start of a sequential zone */
>> -    if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) -
>> 1) ||
>> +    if (!bdev_is_zone_aligned(bio->bi_bdev, bio->bi_iter.bi_sector) ||
>>           !bio_zone_is_seq(bio))
>>           return BLK_STS_IOERR;
> 
> The bdev_is_zone_start() name seems more clear to me than
> bdev_is_zone_aligned(). Has there already been a discussion about which
> name to use for this function?
> 
The reason I did s/bdev_is_zone_start/bdev_is_zone_aligned is that this
name makes more sense for also checking if a given size is a multiple of
zone sectors for e.g., used in PATCH 9:

-		if (len & (zone_sectors - 1)) {
+		if (!bdev_is_zone_aligned(bdev, len)) {

I felt `bdev_is_zone_aligned` fits the use case of checking if the
sector starts at the start of a zone and also check if a given length of
sectors also align with the zone sectors. bdev_is_zone_start does not
make the intention clear for the latter use case IMO.

But I am fine with going back to bdev_is_zone_start if you and Damien
feel strongly otherwise.
>> +        /*
>> +         * Non power-of-2 zone size support was added to remove the
>> +         * gap between zone capacity and zone size. Though it is
>> technically
>> +         * possible to have gaps in a non power-of-2 device, Linux
>> requires
>> +         * the zone size to be equal to zone capacity for non power-of-2
>> +         * zoned devices.
>> +         */
>> +        if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
>> +            pr_warn("%s: Invalid zone capacity for non power of 2
>> zone size",
>> +                disk->disk_name);
> 
> Given the severity of this error, shouldn't the zone capacity and length
> be reported in the error message?
> 
Ok.
> Thanks,
> 
> Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 07/11] dm-zoned: ensure only power of 2 zone sizes are allowed
  2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
@ 2022-07-28 12:15         ` David Sterba
  -1 siblings, 0 replies; 80+ messages in thread
From: David Sterba @ 2022-07-28 12:15 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn,
	matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Luis Chamberlain

On Wed, Jul 27, 2022 at 06:22:41PM +0200, Pankaj Raghav wrote:
> --- a/drivers/md/dm-zoned-target.c
> +++ b/drivers/md/dm-zoned-target.c
> @@ -792,6 +792,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
>  				return -EINVAL;
>  			}
>  			zone_nr_sectors = bdev_zone_sectors(bdev);
> +			if (!is_power_of_2(zone_nr_sectors)) {
> +				ti->error = "Zone size is not power of 2";

This could print what's the value of zone_nr_sectors

> +				return -EINVAL;
> +			}
>  			zoned_dev->zone_nr_sectors = zone_nr_sectors;
>  			zoned_dev->nr_zones = bdev_nr_zones(bdev);
>  		}
> @@ -804,6 +808,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
>  			return -EINVAL;
>  		}
>  		zoned_dev->zone_nr_sectors = bdev_zone_sectors(bdev);
> +		if (!is_power_of_2(zoned_dev->zone_nr_sectors)) {
> +			ti->error = "Zone size is not power of 2";

Same

> +			return -EINVAL;
> +		}
>  		zoned_dev->nr_zones = bdev_nr_zones(bdev);
>  	}
>  
> -- 
> 2.25.1

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 07/11] dm-zoned: ensure only power of 2 zone sizes are allowed
@ 2022-07-28 12:15         ` David Sterba
  0 siblings, 0 replies; 80+ messages in thread
From: David Sterba @ 2022-07-28 12:15 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: axboe, bvanassche, pankydev8, Johannes.Thumshirn, damien.lemoal,
	snitzer, linux-kernel, linux-nvme, linux-block, dm-devel,
	gost.dev, jaegeuk, hch, matias.bjorling, Luis Chamberlain

On Wed, Jul 27, 2022 at 06:22:41PM +0200, Pankaj Raghav wrote:
> --- a/drivers/md/dm-zoned-target.c
> +++ b/drivers/md/dm-zoned-target.c
> @@ -792,6 +792,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
>  				return -EINVAL;
>  			}
>  			zone_nr_sectors = bdev_zone_sectors(bdev);
> +			if (!is_power_of_2(zone_nr_sectors)) {
> +				ti->error = "Zone size is not power of 2";

This could print what's the value of zone_nr_sectors

> +				return -EINVAL;
> +			}
>  			zoned_dev->zone_nr_sectors = zone_nr_sectors;
>  			zoned_dev->nr_zones = bdev_nr_zones(bdev);
>  		}
> @@ -804,6 +808,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
>  			return -EINVAL;
>  		}
>  		zoned_dev->zone_nr_sectors = bdev_zone_sectors(bdev);
> +		if (!is_power_of_2(zoned_dev->zone_nr_sectors)) {
> +			ti->error = "Zone size is not power of 2";

Same

> +			return -EINVAL;
> +		}
>  		zoned_dev->nr_zones = bdev_nr_zones(bdev);
>  	}
>  
> -- 
> 2.25.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-07-28  3:07         ` [dm-devel] " Damien Le Moal
@ 2022-07-28 12:24           ` Pankaj Raghav
  -1 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-28 12:24 UTC (permalink / raw)
  To: Damien Le Moal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Luis Chamberlain

On 2022-07-28 05:07, Damien Le Moal wrote:
> On 7/28/22 01:22, Pankaj Raghav wrote:
>> Checking if a given sector is aligned to a zone is a common
>> operation that is performed for zoned devices. Add
>> bdev_is_zone_start helper to check for this instead of opencoding it
> 
> The patch actually introduces bdev_is_zone_aligned(). I agree with Bart
> that bdev_is_zone_start() is a better name.
I have posted my rationale behind this change in my reply to Bart. Let
me know what you think.
>

<snip>
>>  		args->zone_sectors = zone->len;
>> -		args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
>> +		args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
> 
> 		args->nr_zones = disk_zone_no(disk, capacity);
> 
We are doing a round up with a division here mainly to take into account
the last unequal zone if present. disk_zone_no does just a division so
it won't account for the unequal last zone.

>>  	} else if (zone->start + args->zone_sectors < capacity) {
>>  		if (zone->len != args->zone_sectors) {
>>  			pr_warn("%s: Invalid zoned device with non constant zone size\n",
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 85b832908f28..1be805223026 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -634,6 +634,11 @@ static inline bool queue_is_mq(struct request_queue *q)
>>  	return q->mq_ops;
>>  }
>>  
>> +static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
>> +{
>> +	return bdev->bd_queue;	/* this is never NULL */
>> +}
>> +
>>  #ifdef CONFIG_PM
>>  static inline enum rpm_status queue_rpm_status(struct request_queue *q)
>>  {
>> @@ -665,6 +670,25 @@ static inline bool blk_queue_is_zoned(struct request_queue *q)
>>  	}
>>  }
>>  
>> +static inline bool bdev_is_zoned(struct block_device *bdev)
>> +{
>> +	struct request_queue *q = bdev_get_queue(bdev);
>> +
>> +	if (q)
>> +		return blk_queue_is_zoned(q);
>> +
>> +	return false;
>> +}
>> +
>> +static inline sector_t bdev_zone_sectors(struct block_device *bdev)
>> +{
>> +	struct request_queue *q = bdev_get_queue(bdev);
>> +
>> +	if (!blk_queue_is_zoned(q))
>> +		return 0;
>> +	return q->limits.chunk_sectors;
>> +}
>> +
>>  #ifdef CONFIG_BLK_DEV_ZONED
>>  static inline unsigned int disk_nr_zones(struct gendisk *disk)
>>  {
>> @@ -684,6 +708,30 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
>>  	return div64_u64(sector, zone_sectors);
>>  }
>>  
>> +static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
>> +						   sector_t sec)
>> +{
>> +	sector_t zone_sectors = bdev_zone_sectors(bdev);
>> +	u64 remainder = 0;
>> +
>> +	if (!bdev_is_zoned(bdev))
>> +		return 0;
>> +
>> +	if (is_power_of_2(zone_sectors))
>> +		return sec & (zone_sectors - 1);
>> +
>> +	div64_u64_rem(sec, zone_sectors, &remainder);
>> +	return remainder;
>> +}
>> +
>> +static inline bool bdev_is_zone_aligned(struct block_device *bdev, sector_t sec)
>> +{
>> +	if (!bdev_is_zoned(bdev))
>> +		return false;
> 
> This is checked in bdev_offset_from_zone_start(). No need to add it again
> here.
> 
bdev_offset_from_zone_start returns 0 if the device is not zoned, and
the below check will then return `true`. That is why I explicitly return
a false if the device is not zoned.
>> +
>> +	return bdev_offset_from_zone_start(bdev, sec) == 0;
>> +}
>> +
>>  static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
>>  {
>>  	if (!blk_queue_is_zoned(disk->queue))
>> @@ -728,6 +776,18 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
>>  {
>>  	return 0;
>>  }
>> +
>> +static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
>> +						   sector_t sec)
>> +{
>> +	return 0;
>> +}
> 
> This one is not used when CONFIG_BLK_DEV_ZONED is not set. No need to
> define it.
> 
Ok. I will remove it if it is not required.
>> +
>> +static inline bool bdev_is_zone_aligned(struct block_device *bdev, sector_t sec)
>> +{

> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
@ 2022-07-28 12:24           ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-28 12:24 UTC (permalink / raw)
  To: Damien Le Moal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: bvanassche, pankydev8, gost.dev, linux-kernel, linux-nvme,
	linux-block, dm-devel, jaegeuk, matias.bjorling,
	Luis Chamberlain

On 2022-07-28 05:07, Damien Le Moal wrote:
> On 7/28/22 01:22, Pankaj Raghav wrote:
>> Checking if a given sector is aligned to a zone is a common
>> operation that is performed for zoned devices. Add
>> bdev_is_zone_start helper to check for this instead of opencoding it
> 
> The patch actually introduces bdev_is_zone_aligned(). I agree with Bart
> that bdev_is_zone_start() is a better name.
I have posted my rationale behind this change in my reply to Bart. Let
me know what you think.
>

<snip>
>>  		args->zone_sectors = zone->len;
>> -		args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
>> +		args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
> 
> 		args->nr_zones = disk_zone_no(disk, capacity);
> 
We are doing a round up with a division here mainly to take into account
the last unequal zone if present. disk_zone_no does just a division so
it won't account for the unequal last zone.

>>  	} else if (zone->start + args->zone_sectors < capacity) {
>>  		if (zone->len != args->zone_sectors) {
>>  			pr_warn("%s: Invalid zoned device with non constant zone size\n",
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 85b832908f28..1be805223026 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -634,6 +634,11 @@ static inline bool queue_is_mq(struct request_queue *q)
>>  	return q->mq_ops;
>>  }
>>  
>> +static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
>> +{
>> +	return bdev->bd_queue;	/* this is never NULL */
>> +}
>> +
>>  #ifdef CONFIG_PM
>>  static inline enum rpm_status queue_rpm_status(struct request_queue *q)
>>  {
>> @@ -665,6 +670,25 @@ static inline bool blk_queue_is_zoned(struct request_queue *q)
>>  	}
>>  }
>>  
>> +static inline bool bdev_is_zoned(struct block_device *bdev)
>> +{
>> +	struct request_queue *q = bdev_get_queue(bdev);
>> +
>> +	if (q)
>> +		return blk_queue_is_zoned(q);
>> +
>> +	return false;
>> +}
>> +
>> +static inline sector_t bdev_zone_sectors(struct block_device *bdev)
>> +{
>> +	struct request_queue *q = bdev_get_queue(bdev);
>> +
>> +	if (!blk_queue_is_zoned(q))
>> +		return 0;
>> +	return q->limits.chunk_sectors;
>> +}
>> +
>>  #ifdef CONFIG_BLK_DEV_ZONED
>>  static inline unsigned int disk_nr_zones(struct gendisk *disk)
>>  {
>> @@ -684,6 +708,30 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
>>  	return div64_u64(sector, zone_sectors);
>>  }
>>  
>> +static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
>> +						   sector_t sec)
>> +{
>> +	sector_t zone_sectors = bdev_zone_sectors(bdev);
>> +	u64 remainder = 0;
>> +
>> +	if (!bdev_is_zoned(bdev))
>> +		return 0;
>> +
>> +	if (is_power_of_2(zone_sectors))
>> +		return sec & (zone_sectors - 1);
>> +
>> +	div64_u64_rem(sec, zone_sectors, &remainder);
>> +	return remainder;
>> +}
>> +
>> +static inline bool bdev_is_zone_aligned(struct block_device *bdev, sector_t sec)
>> +{
>> +	if (!bdev_is_zoned(bdev))
>> +		return false;
> 
> This is checked in bdev_offset_from_zone_start(). No need to add it again
> here.
> 
bdev_offset_from_zone_start returns 0 if the device is not zoned, and
the below check will then return `true`. That is why I explicitly return
a false if the device is not zoned.
>> +
>> +	return bdev_offset_from_zone_start(bdev, sec) == 0;
>> +}
>> +
>>  static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
>>  {
>>  	if (!blk_queue_is_zoned(disk->queue))
>> @@ -728,6 +776,18 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
>>  {
>>  	return 0;
>>  }
>> +
>> +static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
>> +						   sector_t sec)
>> +{
>> +	return 0;
>> +}
> 
> This one is not used when CONFIG_BLK_DEV_ZONED is not set. No need to
> define it.
> 
Ok. I will remove it if it is not required.
>> +
>> +static inline bool bdev_is_zone_aligned(struct block_device *bdev, sector_t sec)
>> +{

> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-07-28 12:11           ` [dm-devel] " Pankaj Raghav
@ 2022-07-28 13:29             ` Bart Van Assche
  -1 siblings, 0 replies; 80+ messages in thread
From: Bart Van Assche @ 2022-07-28 13:29 UTC (permalink / raw)
  To: Pankaj Raghav, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, jaegeuk, dm-devel, linux-nvme, Luis Chamberlain

On 7/28/22 05:11, Pankaj Raghav wrote:
> On 2022-07-28 01:16, Bart Van Assche wrote:
>> The bdev_is_zone_start() name seems more clear to me than
>> bdev_is_zone_aligned(). Has there already been a discussion about which
>> name to use for this function?
>>
> The reason I did s/bdev_is_zone_start/bdev_is_zone_aligned is that this
> name makes more sense for also checking if a given size is a multiple of
> zone sectors for e.g., used in PATCH 9:
> 
> -		if (len & (zone_sectors - 1)) {
> +		if (!bdev_is_zone_aligned(bdev, len)) {
> 
> I felt `bdev_is_zone_aligned` fits the use case of checking if the
> sector starts at the start of a zone and also check if a given length of
> sectors also align with the zone sectors. bdev_is_zone_start does not
> make the intention clear for the latter use case IMO.
> 
> But I am fine with going back to bdev_is_zone_start if you and Damien
> feel strongly otherwise.
The "zone start LBA" terminology occurs in ZBC-1, ZBC-2 and ZNS but 
"zone aligned" not. I prefer "zone start" because it is clear, 
unambiguous and because it has the same meaning as in the corresponding 
standards documents. I propose to proceed as follows for checking 
whether a number of LBAs is a multiple of the zone length:
* Either use bdev_is_zone_start() directly.
* Or introduce a synonym for bdev_is_zone_start() with an appropriate 
name, e.g. bdev_is_zone_len_multiple().

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
@ 2022-07-28 13:29             ` Bart Van Assche
  0 siblings, 0 replies; 80+ messages in thread
From: Bart Van Assche @ 2022-07-28 13:29 UTC (permalink / raw)
  To: Pankaj Raghav, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: pankydev8, gost.dev, linux-kernel, linux-nvme, linux-block,
	dm-devel, jaegeuk, matias.bjorling, Luis Chamberlain

On 7/28/22 05:11, Pankaj Raghav wrote:
> On 2022-07-28 01:16, Bart Van Assche wrote:
>> The bdev_is_zone_start() name seems more clear to me than
>> bdev_is_zone_aligned(). Has there already been a discussion about which
>> name to use for this function?
>>
> The reason I did s/bdev_is_zone_start/bdev_is_zone_aligned is that this
> name makes more sense for also checking if a given size is a multiple of
> zone sectors for e.g., used in PATCH 9:
> 
> -		if (len & (zone_sectors - 1)) {
> +		if (!bdev_is_zone_aligned(bdev, len)) {
> 
> I felt `bdev_is_zone_aligned` fits the use case of checking if the
> sector starts at the start of a zone and also check if a given length of
> sectors also align with the zone sectors. bdev_is_zone_start does not
> make the intention clear for the latter use case IMO.
> 
> But I am fine with going back to bdev_is_zone_start if you and Damien
> feel strongly otherwise.
The "zone start LBA" terminology occurs in ZBC-1, ZBC-2 and ZNS but 
"zone aligned" not. I prefer "zone start" because it is clear, 
unambiguous and because it has the same meaning as in the corresponding 
standards documents. I propose to proceed as follows for checking 
whether a number of LBAs is a multiple of the zone length:
* Either use bdev_is_zone_start() directly.
* Or introduce a synonym for bdev_is_zone_start() with an appropriate 
name, e.g. bdev_is_zone_len_multiple().

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 00/11] support non power of 2 zoned device
  2022-07-28 11:57       ` [dm-devel] " Pankaj Raghav
@ 2022-07-28 13:30         ` Bart Van Assche
  -1 siblings, 0 replies; 80+ messages in thread
From: Bart Van Assche @ 2022-07-28 13:30 UTC (permalink / raw)
  To: Pankaj Raghav, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, jaegeuk, dm-devel, linux-nvme

On 7/28/22 04:57, Pankaj Raghav wrote:
> Hi Bart,
> 
> On 2022-07-28 01:19, Bart Van Assche wrote:
>> On 7/27/22 09:22, Pankaj Raghav wrote:
>>> This series adds support to npo2 zoned devices in the block and nvme
>>> layer and a new **dm target** is added: dm-po2z-target. This new
>>> target will be initially used for filesystems such as btrfs and
>>> f2fs that does not have native npo2 zone support.
>>
>> Should any SCSI changes be included in this patch series? From sd_zbc.c:
>>
>>      if (!is_power_of_2(zone_blocks)) {
>>          sd_printk(KERN_ERR, sdkp,
>>                "Zone size %llu is not a power of two.\n",
>>                zone_blocks);
>>          return -EINVAL;
>>      }
>>
> I would keep these changes out of the current patch series because it
> will also increase the test scope. I think once the block layer
> constraint is removed as a part of this series, we can work on the SCSI
> changes in the next cycle.

That's fine with me.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 00/11] support non power of 2 zoned device
@ 2022-07-28 13:30         ` Bart Van Assche
  0 siblings, 0 replies; 80+ messages in thread
From: Bart Van Assche @ 2022-07-28 13:30 UTC (permalink / raw)
  To: Pankaj Raghav, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: pankydev8, gost.dev, linux-kernel, linux-nvme, linux-block,
	dm-devel, jaegeuk, matias.bjorling

On 7/28/22 04:57, Pankaj Raghav wrote:
> Hi Bart,
> 
> On 2022-07-28 01:19, Bart Van Assche wrote:
>> On 7/27/22 09:22, Pankaj Raghav wrote:
>>> This series adds support to npo2 zoned devices in the block and nvme
>>> layer and a new **dm target** is added: dm-po2z-target. This new
>>> target will be initially used for filesystems such as btrfs and
>>> f2fs that does not have native npo2 zone support.
>>
>> Should any SCSI changes be included in this patch series? From sd_zbc.c:
>>
>>      if (!is_power_of_2(zone_blocks)) {
>>          sd_printk(KERN_ERR, sdkp,
>>                "Zone size %llu is not a power of two.\n",
>>                zone_blocks);
>>          return -EINVAL;
>>      }
>>
> I would keep these changes out of the current patch series because it
> will also increase the test scope. I think once the block layer
> constraint is removed as a part of this series, we can work on the SCSI
> changes in the next cycle.

That's fine with me.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 04/11] nvmet: Allow ZNS target to support non-power_of_2 zone sizes
  2022-07-28  3:09         ` [dm-devel] " Damien Le Moal
@ 2022-07-29  7:45           ` Pankaj Raghav
  -1 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-29  7:45 UTC (permalink / raw)
  To: Damien Le Moal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: bvanassche, pankydev8, gost.dev, linux-kernel, linux-nvme,
	linux-block, dm-devel, jaegeuk, matias.bjorling,
	Luis Chamberlain

On 2022-07-28 05:09, Damien Le Moal wrote:

>>  }
> 
> This change should go into patch 3. Otherwise, adding patch 3 only will
> break the nvme target zone code.
> 
Ok.
>>  
>>  static unsigned long get_nr_zones_from_buf(struct nvmet_req *req, u32 bufsize)
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 1be805223026..d1ef9b9552ed 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1350,6 +1350,11 @@ static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
>>  	return BLK_ZONED_NONE;
>>  }
>>  
>> +static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
>> +{
>> +	return disk_zone_no(bdev->bd_disk, sec);
>> +}
>> +
> 
> This should go into a prep patch before patch 3.
> 
Bart commented in the earlier versions to see the new helpers along with
its usage instead of having it separately in a prep patch.

>>  static inline int queue_dma_alignment(const struct request_queue *q)
>>  {
>>  	return q ? q->dma_alignment : 511;
> 
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 04/11] nvmet: Allow ZNS target to support non-power_of_2 zone sizes
@ 2022-07-29  7:45           ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-29  7:45 UTC (permalink / raw)
  To: Damien Le Moal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Luis Chamberlain

On 2022-07-28 05:09, Damien Le Moal wrote:

>>  }
> 
> This change should go into patch 3. Otherwise, adding patch 3 only will
> break the nvme target zone code.
> 
Ok.
>>  
>>  static unsigned long get_nr_zones_from_buf(struct nvmet_req *req, u32 bufsize)
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 1be805223026..d1ef9b9552ed 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1350,6 +1350,11 @@ static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
>>  	return BLK_ZONED_NONE;
>>  }
>>  
>> +static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
>> +{
>> +	return disk_zone_no(bdev->bd_disk, sec);
>> +}
>> +
> 
> This should go into a prep patch before patch 3.
> 
Bart commented in the earlier versions to see the new helpers along with
its usage instead of having it separately in a prep patch.

>>  static inline int queue_dma_alignment(const struct request_queue *q)
>>  {
>>  	return q ? q->dma_alignment : 511;
> 
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 05/11] null_blk: allow non power of 2 zoned devices
  2022-07-27 21:59         ` [dm-devel] " Chaitanya Kulkarni
@ 2022-07-29  7:51           ` Pankaj Raghav
  -1 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-29  7:51 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: snitzer, matias.bjorling, Johannes.Thumshirn, gost.dev,
	linux-kernel, hare, hch, axboe, damien.lemoal, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Luis Chamberlain

On 2022-07-27 23:59, Chaitanya Kulkarni wrote:
>> Sequential Write:
>>
>> x-----------------x---------------------------------x---------------------------------x
>> |     IOdepth     |            8                    |            16                   |
>> x-----------------x---------------------------------x---------------------------------x
>> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
>> x-----------------x---------------------------------x---------------------------------x
>> | Without patch   |  578     |  2257    |   12.80   |  576     |  2248    |   25.78   |
>> x-----------------x---------------------------------x---------------------------------x
>> |  With patch     |  581     |  2268    |   12.74   |  576     |  2248    |   25.85   |
>> x-----------------x---------------------------------x---------------------------------x
>>
>> Sequential read:
>>
>> x-----------------x---------------------------------x---------------------------------x
>> | IOdepth         |            8                    |            16                   |
>> x-----------------x---------------------------------x---------------------------------x
>> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
>> x-----------------x---------------------------------x---------------------------------x
>> | Without patch   |  667     |  2605    |   11.79   |  675     |  2637    |   23.49   |
>> x-----------------x---------------------------------x---------------------------------x
>> |  With patch     |  667     |  2605    |   11.79   |  675     |  2638    |   23.48   |
>> x-----------------x---------------------------------x---------------------------------x
>>
>> Random read:
>>
>> x-----------------x---------------------------------x---------------------------------x
>> | IOdepth         |            8                    |            16                   |
>> x-----------------x---------------------------------x---------------------------------x
>> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
>> x-----------------x---------------------------------x---------------------------------x
>> | Without patch   |  522     |  2038    |   15.05   |  514     |  2006    |   30.87   |
>> x-----------------x---------------------------------x---------------------------------x
>> |  With patch     |  522     |  2039    |   15.04   |  523     |  2042    |   30.33   |
>> x-----------------x---------------------------------x---------------------------------x
>>
>> Minor variations are noticed in Sequential write with io depth 8 and
>> in random read with io depth 16. But overall no noticeable differences
>> were noticed
> 
> minor variations in with aspect of the performance ?
> are these documented somewhere ?
> 
The table above documents those minor variations in performance.
> move the large table of performance numbers to the cover letter it looks 
> ugly in the git log...
> 
I could do that.
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 05/11] null_blk: allow non power of 2 zoned devices
@ 2022-07-29  7:51           ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-29  7:51 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: axboe, bvanassche, pankydev8, Johannes.Thumshirn, damien.lemoal,
	snitzer, linux-kernel, linux-nvme, linux-block, dm-devel,
	gost.dev, jaegeuk, hch, matias.bjorling, Luis Chamberlain

On 2022-07-27 23:59, Chaitanya Kulkarni wrote:
>> Sequential Write:
>>
>> x-----------------x---------------------------------x---------------------------------x
>> |     IOdepth     |            8                    |            16                   |
>> x-----------------x---------------------------------x---------------------------------x
>> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
>> x-----------------x---------------------------------x---------------------------------x
>> | Without patch   |  578     |  2257    |   12.80   |  576     |  2248    |   25.78   |
>> x-----------------x---------------------------------x---------------------------------x
>> |  With patch     |  581     |  2268    |   12.74   |  576     |  2248    |   25.85   |
>> x-----------------x---------------------------------x---------------------------------x
>>
>> Sequential read:
>>
>> x-----------------x---------------------------------x---------------------------------x
>> | IOdepth         |            8                    |            16                   |
>> x-----------------x---------------------------------x---------------------------------x
>> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
>> x-----------------x---------------------------------x---------------------------------x
>> | Without patch   |  667     |  2605    |   11.79   |  675     |  2637    |   23.49   |
>> x-----------------x---------------------------------x---------------------------------x
>> |  With patch     |  667     |  2605    |   11.79   |  675     |  2638    |   23.48   |
>> x-----------------x---------------------------------x---------------------------------x
>>
>> Random read:
>>
>> x-----------------x---------------------------------x---------------------------------x
>> | IOdepth         |            8                    |            16                   |
>> x-----------------x---------------------------------x---------------------------------x
>> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
>> x-----------------x---------------------------------x---------------------------------x
>> | Without patch   |  522     |  2038    |   15.05   |  514     |  2006    |   30.87   |
>> x-----------------x---------------------------------x---------------------------------x
>> |  With patch     |  522     |  2039    |   15.04   |  523     |  2042    |   30.33   |
>> x-----------------x---------------------------------x---------------------------------x
>>
>> Minor variations are noticed in Sequential write with io depth 8 and
>> in random read with io depth 16. But overall no noticeable differences
>> were noticed
> 
> minor variations in with aspect of the performance ?
> are these documented somewhere ?
> 
The table above documents those minor variations in performance.
> move the large table of performance numbers to the cover letter it looks 
> ugly in the git log...
> 
I could do that.
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 07/11] dm-zoned: ensure only power of 2 zone sizes are allowed
  2022-07-28 12:15         ` [dm-devel] " David Sterba
@ 2022-07-29  8:00           ` Pankaj Raghav
  -1 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-29  8:00 UTC (permalink / raw)
  To: dsterba, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn,
	matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Luis Chamberlain

On 2022-07-28 14:15, David Sterba wrote:
> On Wed, Jul 27, 2022 at 06:22:41PM +0200, Pankaj Raghav wrote:
>> --- a/drivers/md/dm-zoned-target.c
>> +++ b/drivers/md/dm-zoned-target.c
>> @@ -792,6 +792,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
>>  				return -EINVAL;
>>  			}
>>  			zone_nr_sectors = bdev_zone_sectors(bdev);
>> +			if (!is_power_of_2(zone_nr_sectors)) {
>> +				ti->error = "Zone size is not power of 2";
> 
> This could print what's the value of zone_nr_sectors
Ok. I will rephrase based on Damien's comment and add the
zone_nr_sectors to be included. Thanks.
> 
>> +				return -EINVAL;
>> +			}
>>  			zoned_dev->zone_nr_sectors = zone_nr_sectors;
>>  			zoned_dev->nr_zones = bdev_nr_zones(bdev);
>>  		}
>> @@ -804,6 +808,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
>>  			return -EINVAL;
>>  		}
>>  		zoned_dev->zone_nr_sectors = bdev_zone_sectors(bdev);
>> +		if (!is_power_of_2(zoned_dev->zone_nr_sectors)) {
>> +			ti->error = "Zone size is not power of 2";
> 
> Same
> 
>> +			return -EINVAL;
>> +		}
>>  		zoned_dev->nr_zones = bdev_nr_zones(bdev);
>>  	}
>>  
>> -- 
>> 2.25.1

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 07/11] dm-zoned: ensure only power of 2 zone sizes are allowed
@ 2022-07-29  8:00           ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-29  8:00 UTC (permalink / raw)
  To: dsterba, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn,
	matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Luis Chamberlain

On 2022-07-28 14:15, David Sterba wrote:
> On Wed, Jul 27, 2022 at 06:22:41PM +0200, Pankaj Raghav wrote:
>> --- a/drivers/md/dm-zoned-target.c
>> +++ b/drivers/md/dm-zoned-target.c
>> @@ -792,6 +792,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
>>  				return -EINVAL;
>>  			}
>>  			zone_nr_sectors = bdev_zone_sectors(bdev);
>> +			if (!is_power_of_2(zone_nr_sectors)) {
>> +				ti->error = "Zone size is not power of 2";
> 
> This could print what's the value of zone_nr_sectors
Ok. I will rephrase based on Damien's comment and add the
zone_nr_sectors to be included. Thanks.
> 
>> +				return -EINVAL;
>> +			}
>>  			zoned_dev->zone_nr_sectors = zone_nr_sectors;
>>  			zoned_dev->nr_zones = bdev_nr_zones(bdev);
>>  		}
>> @@ -804,6 +808,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
>>  			return -EINVAL;
>>  		}
>>  		zoned_dev->zone_nr_sectors = bdev_zone_sectors(bdev);
>> +		if (!is_power_of_2(zoned_dev->zone_nr_sectors)) {
>> +			ti->error = "Zone size is not power of 2";
> 
> Same
> 
>> +			return -EINVAL;
>> +		}
>>  		zoned_dev->nr_zones = bdev_nr_zones(bdev);
>>  	}
>>  
>> -- 
>> 2.25.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 10/11] dm: call dm_zone_endio after the target endio callback for zoned devices
  2022-07-28  3:25         ` [dm-devel] " Damien Le Moal
@ 2022-07-29  8:48           ` Pankaj Raghav
  -1 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-29  8:48 UTC (permalink / raw)
  To: Damien Le Moal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme

>>  	if (endio) {
>>  		int r = endio(ti, bio, &error);
>>  		switch (r) {
>> @@ -1155,6 +1151,10 @@ static void clone_endio(struct bio *bio)
>>  		}
>>  	}
>>  
>> +	if (static_branch_unlikely(&zoned_enabled) &&
>> +	    unlikely(bdev_is_zoned(bio->bi_bdev)))
>> +		dm_zone_endio(io, bio);
>> +
>>  	if (static_branch_unlikely(&swap_bios_enabled) &&
>>  	    unlikely(swap_bios_limit(ti, bio)))
>>  		up(&md->swap_bios_semaphore);
> 
> This patch seems completely unrelated to the series topic. Is that a bug
> fix ? How do you trigger it ? Our tests do not show any issues here...
> If this triggers only with non power of 2 zone size devices, then this
> should be squashed in patch 8. And patch 9 could also be squashed with
> patch 8 too.
> 
The targets that support zoned devices such as dm-zoned, dm-linear, and
dm-crypt do not have an endio function, and even if they do (such as
dm-flakey), they don't modify the bio->bi_iter.bi_sector of the cloned
bio that is used to update the orig_bio's bi_sector in dm_zone_endio
function.

This path is triggered only for the new target dm-po2zone and not for
npo2 zone size devices in general. I will mention this is a prep patch
for the new target because I wouldn't call it a bug per se.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 10/11] dm: call dm_zone_endio after the target endio callback for zoned devices
@ 2022-07-29  8:48           ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-29  8:48 UTC (permalink / raw)
  To: Damien Le Moal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: bvanassche, pankydev8, gost.dev, linux-kernel, linux-nvme,
	linux-block, dm-devel, jaegeuk, matias.bjorling

>>  	if (endio) {
>>  		int r = endio(ti, bio, &error);
>>  		switch (r) {
>> @@ -1155,6 +1151,10 @@ static void clone_endio(struct bio *bio)
>>  		}
>>  	}
>>  
>> +	if (static_branch_unlikely(&zoned_enabled) &&
>> +	    unlikely(bdev_is_zoned(bio->bi_bdev)))
>> +		dm_zone_endio(io, bio);
>> +
>>  	if (static_branch_unlikely(&swap_bios_enabled) &&
>>  	    unlikely(swap_bios_limit(ti, bio)))
>>  		up(&md->swap_bios_semaphore);
> 
> This patch seems completely unrelated to the series topic. Is that a bug
> fix ? How do you trigger it ? Our tests do not show any issues here...
> If this triggers only with non power of 2 zone size devices, then this
> should be squashed in patch 8. And patch 9 could also be squashed with
> patch 8 too.
> 
The targets that support zoned devices such as dm-zoned, dm-linear, and
dm-crypt do not have an endio function, and even if they do (such as
dm-flakey), they don't modify the bio->bi_iter.bi_sector of the cloned
bio that is used to update the orig_bio's bi_sector in dm_zone_endio
function.

This path is triggered only for the new target dm-po2zone and not for
npo2 zone size devices in general. I will mention this is a prep patch
for the new target because I wouldn't call it a bug per se.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-07-28 13:29             ` [dm-devel] " Bart Van Assche
@ 2022-07-29  9:09               ` Pankaj Raghav
  -1 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-29  9:09 UTC (permalink / raw)
  To: Bart Van Assche, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, jaegeuk, dm-devel, linux-nvme, Luis Chamberlain

On 2022-07-28 15:29, Bart Van Assche wrote:

>> But I am fine with going back to bdev_is_zone_start if you and Damien
>> feel strongly otherwise.
> The "zone start LBA" terminology occurs in ZBC-1, ZBC-2 and ZNS but
> "zone aligned" not. I prefer "zone start" because it is clear,
> unambiguous and because it has the same meaning as in the corresponding
> standards documents. I propose to proceed as follows for checking
> whether a number of LBAs is a multiple of the zone length:
> * Either use bdev_is_zone_start() directly.
> * Or introduce a synonym for bdev_is_zone_start() with an appropriate
> name, e.g. bdev_is_zone_len_multiple().
> 
Thanks for the clarification Bart. I will go with bdev_is_zone_start()
as it is also a commonly used terminology.
> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
@ 2022-07-29  9:09               ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-07-29  9:09 UTC (permalink / raw)
  To: Bart Van Assche, damien.lemoal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: pankydev8, gost.dev, linux-kernel, linux-nvme, linux-block,
	dm-devel, jaegeuk, matias.bjorling, Luis Chamberlain

On 2022-07-28 15:29, Bart Van Assche wrote:

>> But I am fine with going back to bdev_is_zone_start if you and Damien
>> feel strongly otherwise.
> The "zone start LBA" terminology occurs in ZBC-1, ZBC-2 and ZNS but
> "zone aligned" not. I prefer "zone start" because it is clear,
> unambiguous and because it has the same meaning as in the corresponding
> standards documents. I propose to proceed as follows for checking
> whether a number of LBAs is a multiple of the zone length:
> * Either use bdev_is_zone_start() directly.
> * Or introduce a synonym for bdev_is_zone_start() with an appropriate
> name, e.g. bdev_is_zone_len_multiple().
> 
Thanks for the clarification Bart. I will go with bdev_is_zone_start()
as it is also a commonly used terminology.
> Thanks,
> 
> Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v8 11/11] dm: add power-of-2 zoned target for non-power-of-2 zoned devices
  2022-07-28  4:33         ` [dm-devel] " Damien Le Moal
@ 2022-08-01  8:35           ` Pankaj Raghav
  -1 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-08-01  8:35 UTC (permalink / raw)
  To: Damien Le Moal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: matias.bjorling, gost.dev, linux-kernel, hare, linux-block,
	pankydev8, bvanassche, jaegeuk, dm-devel, linux-nvme,
	Damien Le Moal

On 2022-07-28 06:33, Damien Le Moal wrote:
>>
>> The area between target's zone capacity and zone size will be emulated
>> in the target.
>> The read IOs that fall in the emulated gap area will return 0 filled
>> bio and all the other IOs in that area will result in an error.
>> If a read IO span across the emulated area boundary, then the IOs are
>> split across them. All other IO operations that span across the emulated
>> area boundary will result in an error.
>>
>> The target can be easily created as follows:
>> dmsetup create <label> --table '0 <size_sects> po2z /dev/nvme<id>'
> 
> 0 -> <start offset> ? Or are you allowing only entire devices ?
> 
Yeah, partially mapping is not allowed.
>>
>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>> Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> Suggested-by: Damien Le Moal <damien.lemoal@wdc.com>
>> Suggested-by: Hannes Reinecke <hare@suse.de>
>> ---
>>  drivers/md/Kconfig            |   9 ++
>>  drivers/md/Makefile           |   2 +
>>  drivers/md/dm-po2z-target.c   | 261 ++++++++++++++++++++++++++++++++++
>>  drivers/md/dm-table.c         |  13 +-
>>  drivers/md/dm-zone.c          |   1 +
>>  include/linux/device-mapper.h |   9 ++
>>  6 files changed, 292 insertions(+), 3 deletions(-)
>>  create mode 100644 drivers/md/dm-po2z-target.c
>>
>> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
>> index 998a5cfdbc4e..d58ccfee765b 100644
>> --- a/drivers/md/Kconfig
>> +++ b/drivers/md/Kconfig
>> @@ -518,6 +518,15 @@ config DM_FLAKEY
>>  	help
>>  	 A target that intermittently fails I/O for debugging purposes.
>>  
>> +config DM_PO2Z
> 
> Maybe DM_PO2_ZONE ?
> 
I prefer this.

> or DM_ZONE_PO2SIZE ?
> 
> To be clearer...
> 
>> +
>> +struct dm_po2z_target {
>> +	struct dm_dev *dev;
>> +	sector_t zone_size; /* Actual zone size of the underlying dev*/
>> +	sector_t zone_size_po2; /* zone_size rounded to the nearest po2 value */
>> +	sector_t zone_size_diff; /* diff between zone_size_po2 and zone_size */
>> +	u32 nr_zones;
> 
> s/u32/unsigned int. This is not a hw driver.
> 
Ok.
>> +}
>> +
>> +/*
>> + * This target works on the complete zoned device. Partial mapping is not
>> + * supported.
>> + * Construct a zoned po2 logical device: <dev-path>
>> + */
>> +static int dm_po2z_ctr(struct dm_target *ti, unsigned int argc, char **argv)
>> +{
>> +	struct dm_po2z_target *dmh = NULL;
>> +	int ret;
>> +	sector_t zone_size;
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (is_power_of_2(zone_size)) {
>> +		DMERR("%pg: this target is not useful for power-of-2 zoned devices",
>> +		      dmh->dev->bdev);
>> +		return -EINVAL;
> 
> Same here.
> 
> And as suggested before, we could simply warn here but let the target be
> created. If used with a power of 2 zone size device, it should still work.
> 
I will do this in the next revision. I will change DMERR to DMWARN with
no return err.
>> +	}
>> +
>> +	dmh->zone_size = zone_size;
>> +	dmh->zone_size_po2 = 1 << get_count_order_long(zone_size);
>> +	dmh->zone_size_diff = dmh->zone_size_po2 - dmh->zone_size;
>> +	ti->private = dmh;
>> +	dmh->nr_zones = npo2_zone_no(dmh, ti->len);
>> +	ti->len = dmh->zone_size_po2 * dmh->nr_zones;
>> +
>> +	return 0;
>> +}
>> +
>> +
>> +static int dm_po2z_end_io(struct dm_target *ti, struct bio *bio,
>> +			  blk_status_t *error)
>> +{
>> +	struct dm_po2z_target *dmh = ti->private;
>> +
>> +	if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND)
>> +		bio->bi_iter.bi_sector =
>> +			device_to_target_sect(dmh, bio->bi_iter.bi_sector);
>> +
>> +	return DM_ENDIO_DONE;
>> +}
>> +
>> +static void dm_po2z_io_hints(struct dm_target *ti, struct queue_limits *limits)
>> +{
>> +	struct dm_po2z_target *dmh = ti->private;
>> +
>> +	limits->chunk_sectors = dmh->zone_size_po2;
>> +}
>> +
>> +static bool bio_across_emulated_zone_area(struct dm_po2z_target *dmh,
>> +					  struct bio *bio)
>> +{
>> +	u32 zone_idx = po2_zone_no(dmh, bio->bi_iter.bi_sector);
>> +	sector_t size = bio->bi_iter.bi_size >> SECTOR_SHIFT;
> 
> Rename that to nr_sectors to be clear about the unit.
> 
>> +
>> +	return (bio->bi_iter.bi_sector - (zone_idx * dmh->zone_size_po2) +
>> +		size) > dmh->zone_size;
> 
> This is hard to read and actually seems very wrong. Shouldn't this be simply:
> 
> 	return bio->bi_iter.bi_sector + nr_sectors >
> 		zone_idx * dmh->zone_size_po2 + dmh->zone_size;
> 
> Thatis, check that the BIO goes beyond the zone capacity of the target.
> ?
> 
Yeah, this reads better. I will change it.
>> +}
>> +
>> +static int dm_po2z_map_read(struct dm_po2z_target *dmh, struct bio *bio)
>> +{
>> +	sector_t start_sect, size, relative_sect_in_zone, split_io_pos;
>> +	u32 zone_idx;
>> +
>> +	/*
>> +	 * Read does not extend into the emulated zone area (area between
>> +	 * zone capacity and zone size)
>> +	 */
>> +	if (!bio_across_emulated_zone_area(dmh, bio)) {
> 
> hu... why not use dm_accept_partial_bio() in dm_po2z_map() to never have
> to deal with this here ? That is, dm_po2z_map_read() is called for any
> read BIO fully within the zone capacity. And create a
> dm_po2z_map_read_zero() helper for any read BIO that is after the zone
> capacity. Way simpler I think.
> 
>> +		bio->bi_iter.bi_sector =
>> +			target_to_device_sect(dmh, bio->bi_iter.bi_sector);
>> +		return DM_MAPIO_REMAPPED;
>> +	}
>> +
>> +	start_sect = bio->bi_iter.bi_sector;
>> +	size = bio->bi_iter.bi_size >> SECTOR_SHIFT;
>> +	zone_idx = po2_zone_no(dmh, start_sect);
>> +	relative_sect_in_zone = start_sect - (zone_idx * dmh->zone_size_po2);
>> +
>> +	/*
>> +	 * If the starting sector is in the emulated area then fill
>> +	 * all the bio with zeros. If bio is across emulated zone boundary,
>> +	 * split the bio across boundaries and fill zeros only for the
>> +	 * bio that is between the zone capacity and the zone size.
>> +	 */
>> +	if (relative_sect_in_zone < dmh->zone_size &&
>> +	    ((relative_sect_in_zone + size) > dmh->zone_size)) {
>> +		split_io_pos = (zone_idx * dmh->zone_size_po2) + dmh->zone_size;
>> +		dm_accept_partial_bio(bio, split_io_pos - start_sect);
> 
> See above. Do that in dm_po2z_map(). This will simplify this function.
> 
>> +		bio->bi_iter.bi_sector = target_to_device_sect(dmh, start_sect);
>> +
>> +		return DM_MAPIO_REMAPPED;
>> +	} else if (relative_sect_in_zone >= dmh->zone_size &&
>> +		   ((relative_sect_in_zone + size) > dmh->zone_size_po2)) {
> 
> No need for the else after return. And this can NEVER happen since BIOs
> can never cross zone boundaries.
> 
This did not happen and that is why this else if condition was
introduced. But I had missed assigning the t->max_io_len in the ctr
which led to bio crossing the zone boundary.

With t->max_io_len set to zone_nr_sectors, we don't need the else if
condition and we can simplify the map_read.
Thanks for pointing it out.
>> +		split_io_pos = (zone_idx + 1) * dmh->zone_size_po2;
>> +		dm_accept_partial_bio(bio, split_io_pos - start_sect);
>> +	}
>> +
>> +	zero_fill_bio(bio);
>> +	bio_endio(bio);
>> +	return DM_MAPIO_SUBMITTED;
>> +}
>> +
>> +static int dm_po2z_map(struct dm_target *ti, struct bio *bio)
>> +{
>> +	struct dm_po2z_target *dmh = ti->private;
>> +
>> +	bio_set_dev(bio, dmh->dev->bdev);
>> +	if (bio_sectors(bio) || op_is_zone_mgmt(bio_op(bio))) {
>> +		/*
>> +		 * Read operation on the emulated zone area (between zone capacity
>> +		 * and zone size) will fill the bio with zeroes. Any other operation
>> +		 * in the emulated area should return an error.
>> +		 */
>> +		if (bio_op(bio) == REQ_OP_READ)
>> +			return dm_po2z_map_read(dmh, bio);
>> +
>> +		if (!bio_across_emulated_zone_area(dmh, bio)) {
>> +			bio->bi_iter.bi_sector = target_to_device_sect(dmh,
>> +								       bio->bi_iter.bi_sector);
>> +			return DM_MAPIO_REMAPPED;
>> +		}
>> +		return DM_MAPIO_KILL;
>> +	}
>> +	return DM_MAPIO_REMAPPED;
>> +}
>> +
>> +static int dm_po2z_iterate_devices(struct dm_target *ti,
>> +				   iterate_devices_callout_fn fn, void *data)
>> +{
>> +	struct dm_po2z_target *dmh = ti->private;
>> +	sector_t len = dmh->nr_zones * dmh->zone_size;
>> +
>> +	return fn(ti, dmh->dev, 0, len, data);
>> +}
>> +
>> +static struct target_type dm_po2z_target = {
>> +	.name = "po2z",
>> +	.version = { 1, 0, 0 },
>> +	.features = DM_TARGET_ZONED_HM | DM_TARGET_EMULATED_ZONES,
>> +	.map = dm_po2z_map,
>> +	.end_io = dm_po2z_end_io,
>> +	.report_zones = dm_po2z_report_zones,
>> +	.iterate_devices = dm_po2z_iterate_devices,
>> +	.module = THIS_MODULE,
>> +	.io_hints = dm_po2z_io_hints,
>> +	.ctr = dm_po2z_ctr,
>> +};
>> +
>> +static int __init dm_po2z_init(void)
>> +{
>> +	int r = dm_register_target(&dm_po2z_target);
>> +
>> +	if (r < 0)
>> +		DMERR("register failed %d", r);
>> +
>> +	return r;
> 
> Simplify:
> 
> 	return dm_register_target(&dm_po2z_target);
> 
>> +}
>> +
>> +static void __exit dm_po2z_exit(void)
>> +{
>> +	dm_unregister_target(&dm_po2z_target);
>> +}
>> +
>> +/* Module hooks */
>> +module_init(dm_po2z_init);
>> +module_exit(dm_po2z_exit);
>> +
>> +MODULE_DESCRIPTION(DM_NAME "power-of-2 zoned target");
>> +MODULE_AUTHOR("Pankaj Raghav <p.raghav@samsung.com>");
>> +MODULE_LICENSE("GPL");
>> +
> 
> the changes below should be a different prep patch.
> 
Ok. `dm:introduce DM_EMULATED_ZONE target type` before introducing this
new target.

>> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
>> index 534fddfc2b42..d77feff124af 100644
>> --- a/drivers/md/dm-table.c
>> +++ b/drivers/md/dm-table.c
>> @@ -1614,13 +1614,20 @@ static bool dm_table_supports_zoned_model(struct dm_table *t,
>>  	return true;
>>  }
>>  
>> -static int device_not_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
>> +/*
>> + * Callback function to check for device zone sector across devices. If the
>> + * DM_TARGET_EMULATED_ZONES target feature flag is not set, then the target
>> + * should have the same zone sector as the underlying devices.
>> + */
>> +static int check_valid_device_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
>>  					   sector_t start, sector_t len, void *data)
>>  {
>>  	unsigned int *zone_sectors = data;
>>  
>> -	if (!bdev_is_zoned(dev->bdev))
>> +	if (!bdev_is_zoned(dev->bdev) ||
>> +	    dm_target_supports_emulated_zones(ti->type))
>>  		return 0;
>> +
>>  	return bdev_zone_sectors(dev->bdev) != *zone_sectors;
>>  }
>>  
>> @@ -1646,7 +1653,7 @@ static int validate_hardware_zoned_model(struct dm_table *t,
>>  	if (!zone_sectors)
>>  		return -EINVAL;
>>  
>> -	if (dm_table_any_dev_attr(t, device_not_matches_zone_sectors, &zone_sectors)) {
>> +	if (dm_table_any_dev_attr(t, check_valid_device_zone_sectors, &zone_sectors)) {
>>  		DMERR("%s: zone sectors is not consistent across all zoned devices",
>>  		      dm_device_name(t->md));
>>  		return -EINVAL;
>> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
>> index 31c16aafdbfc..2b6b3883471f 100644
>> --- a/drivers/md/dm-zone.c
>> +++ b/drivers/md/dm-zone.c
>> @@ -302,6 +302,7 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
>>  	if (dm_table_supports_zone_append(t)) {
>>  		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
>>  		dm_cleanup_zoned_dev(md);
>> +
>>  		return 0;
>>  	}
>>  
>> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
>> index 920085dd7f3b..7dbd28b8de01 100644
>> --- a/include/linux/device-mapper.h
>> +++ b/include/linux/device-mapper.h
>> @@ -294,6 +294,15 @@ struct target_type {
>>  #define dm_target_supports_mixed_zoned_model(type) (false)
>>  #endif
>>  
>> +#ifdef CONFIG_BLK_DEV_ZONED
>> +#define DM_TARGET_EMULATED_ZONES	0x00000400
>> +#define dm_target_supports_emulated_zones(type) \
>> +	((type)->features & DM_TARGET_EMULATED_ZONES)
>> +#else
>> +#define DM_TARGET_EMULATED_ZONES	0x00000000
>> +#define dm_target_supports_emulated_zones(type) (false)
>> +#endif
>> +
>>  struct dm_target {
>>  	struct dm_table *table;
>>  	struct target_type *type;
> 
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [dm-devel] [PATCH v8 11/11] dm: add power-of-2 zoned target for non-power-of-2 zoned devices
@ 2022-08-01  8:35           ` Pankaj Raghav
  0 siblings, 0 replies; 80+ messages in thread
From: Pankaj Raghav @ 2022-08-01  8:35 UTC (permalink / raw)
  To: Damien Le Moal, hch, axboe, snitzer, Johannes.Thumshirn
  Cc: Damien Le Moal, bvanassche, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, jaegeuk, matias.bjorling

On 2022-07-28 06:33, Damien Le Moal wrote:
>>
>> The area between target's zone capacity and zone size will be emulated
>> in the target.
>> The read IOs that fall in the emulated gap area will return 0 filled
>> bio and all the other IOs in that area will result in an error.
>> If a read IO span across the emulated area boundary, then the IOs are
>> split across them. All other IO operations that span across the emulated
>> area boundary will result in an error.
>>
>> The target can be easily created as follows:
>> dmsetup create <label> --table '0 <size_sects> po2z /dev/nvme<id>'
> 
> 0 -> <start offset> ? Or are you allowing only entire devices ?
> 
Yeah, partially mapping is not allowed.
>>
>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>> Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> Suggested-by: Damien Le Moal <damien.lemoal@wdc.com>
>> Suggested-by: Hannes Reinecke <hare@suse.de>
>> ---
>>  drivers/md/Kconfig            |   9 ++
>>  drivers/md/Makefile           |   2 +
>>  drivers/md/dm-po2z-target.c   | 261 ++++++++++++++++++++++++++++++++++
>>  drivers/md/dm-table.c         |  13 +-
>>  drivers/md/dm-zone.c          |   1 +
>>  include/linux/device-mapper.h |   9 ++
>>  6 files changed, 292 insertions(+), 3 deletions(-)
>>  create mode 100644 drivers/md/dm-po2z-target.c
>>
>> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
>> index 998a5cfdbc4e..d58ccfee765b 100644
>> --- a/drivers/md/Kconfig
>> +++ b/drivers/md/Kconfig
>> @@ -518,6 +518,15 @@ config DM_FLAKEY
>>  	help
>>  	 A target that intermittently fails I/O for debugging purposes.
>>  
>> +config DM_PO2Z
> 
> Maybe DM_PO2_ZONE ?
> 
I prefer this.

> or DM_ZONE_PO2SIZE ?
> 
> To be clearer...
> 
>> +
>> +struct dm_po2z_target {
>> +	struct dm_dev *dev;
>> +	sector_t zone_size; /* Actual zone size of the underlying dev*/
>> +	sector_t zone_size_po2; /* zone_size rounded to the nearest po2 value */
>> +	sector_t zone_size_diff; /* diff between zone_size_po2 and zone_size */
>> +	u32 nr_zones;
> 
> s/u32/unsigned int. This is not a hw driver.
> 
Ok.
>> +}
>> +
>> +/*
>> + * This target works on the complete zoned device. Partial mapping is not
>> + * supported.
>> + * Construct a zoned po2 logical device: <dev-path>
>> + */
>> +static int dm_po2z_ctr(struct dm_target *ti, unsigned int argc, char **argv)
>> +{
>> +	struct dm_po2z_target *dmh = NULL;
>> +	int ret;
>> +	sector_t zone_size;
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (is_power_of_2(zone_size)) {
>> +		DMERR("%pg: this target is not useful for power-of-2 zoned devices",
>> +		      dmh->dev->bdev);
>> +		return -EINVAL;
> 
> Same here.
> 
> And as suggested before, we could simply warn here but let the target be
> created. If used with a power of 2 zone size device, it should still work.
> 
I will do this in the next revision. I will change DMERR to DMWARN with
no return err.
>> +	}
>> +
>> +	dmh->zone_size = zone_size;
>> +	dmh->zone_size_po2 = 1 << get_count_order_long(zone_size);
>> +	dmh->zone_size_diff = dmh->zone_size_po2 - dmh->zone_size;
>> +	ti->private = dmh;
>> +	dmh->nr_zones = npo2_zone_no(dmh, ti->len);
>> +	ti->len = dmh->zone_size_po2 * dmh->nr_zones;
>> +
>> +	return 0;
>> +}
>> +
>> +
>> +static int dm_po2z_end_io(struct dm_target *ti, struct bio *bio,
>> +			  blk_status_t *error)
>> +{
>> +	struct dm_po2z_target *dmh = ti->private;
>> +
>> +	if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND)
>> +		bio->bi_iter.bi_sector =
>> +			device_to_target_sect(dmh, bio->bi_iter.bi_sector);
>> +
>> +	return DM_ENDIO_DONE;
>> +}
>> +
>> +static void dm_po2z_io_hints(struct dm_target *ti, struct queue_limits *limits)
>> +{
>> +	struct dm_po2z_target *dmh = ti->private;
>> +
>> +	limits->chunk_sectors = dmh->zone_size_po2;
>> +}
>> +
>> +static bool bio_across_emulated_zone_area(struct dm_po2z_target *dmh,
>> +					  struct bio *bio)
>> +{
>> +	u32 zone_idx = po2_zone_no(dmh, bio->bi_iter.bi_sector);
>> +	sector_t size = bio->bi_iter.bi_size >> SECTOR_SHIFT;
> 
> Rename that to nr_sectors to be clear about the unit.
> 
>> +
>> +	return (bio->bi_iter.bi_sector - (zone_idx * dmh->zone_size_po2) +
>> +		size) > dmh->zone_size;
> 
> This is hard to read and actually seems very wrong. Shouldn't this be simply:
> 
> 	return bio->bi_iter.bi_sector + nr_sectors >
> 		zone_idx * dmh->zone_size_po2 + dmh->zone_size;
> 
> Thatis, check that the BIO goes beyond the zone capacity of the target.
> ?
> 
Yeah, this reads better. I will change it.
>> +}
>> +
>> +static int dm_po2z_map_read(struct dm_po2z_target *dmh, struct bio *bio)
>> +{
>> +	sector_t start_sect, size, relative_sect_in_zone, split_io_pos;
>> +	u32 zone_idx;
>> +
>> +	/*
>> +	 * Read does not extend into the emulated zone area (area between
>> +	 * zone capacity and zone size)
>> +	 */
>> +	if (!bio_across_emulated_zone_area(dmh, bio)) {
> 
> hu... why not use dm_accept_partial_bio() in dm_po2z_map() to never have
> to deal with this here ? That is, dm_po2z_map_read() is called for any
> read BIO fully within the zone capacity. And create a
> dm_po2z_map_read_zero() helper for any read BIO that is after the zone
> capacity. Way simpler I think.
> 
>> +		bio->bi_iter.bi_sector =
>> +			target_to_device_sect(dmh, bio->bi_iter.bi_sector);
>> +		return DM_MAPIO_REMAPPED;
>> +	}
>> +
>> +	start_sect = bio->bi_iter.bi_sector;
>> +	size = bio->bi_iter.bi_size >> SECTOR_SHIFT;
>> +	zone_idx = po2_zone_no(dmh, start_sect);
>> +	relative_sect_in_zone = start_sect - (zone_idx * dmh->zone_size_po2);
>> +
>> +	/*
>> +	 * If the starting sector is in the emulated area then fill
>> +	 * all the bio with zeros. If bio is across emulated zone boundary,
>> +	 * split the bio across boundaries and fill zeros only for the
>> +	 * bio that is between the zone capacity and the zone size.
>> +	 */
>> +	if (relative_sect_in_zone < dmh->zone_size &&
>> +	    ((relative_sect_in_zone + size) > dmh->zone_size)) {
>> +		split_io_pos = (zone_idx * dmh->zone_size_po2) + dmh->zone_size;
>> +		dm_accept_partial_bio(bio, split_io_pos - start_sect);
> 
> See above. Do that in dm_po2z_map(). This will simplify this function.
> 
>> +		bio->bi_iter.bi_sector = target_to_device_sect(dmh, start_sect);
>> +
>> +		return DM_MAPIO_REMAPPED;
>> +	} else if (relative_sect_in_zone >= dmh->zone_size &&
>> +		   ((relative_sect_in_zone + size) > dmh->zone_size_po2)) {
> 
> No need for the else after return. And this can NEVER happen since BIOs
> can never cross zone boundaries.
> 
This did not happen and that is why this else if condition was
introduced. But I had missed assigning the t->max_io_len in the ctr
which led to bio crossing the zone boundary.

With t->max_io_len set to zone_nr_sectors, we don't need the else if
condition and we can simplify the map_read.
Thanks for pointing it out.
>> +		split_io_pos = (zone_idx + 1) * dmh->zone_size_po2;
>> +		dm_accept_partial_bio(bio, split_io_pos - start_sect);
>> +	}
>> +
>> +	zero_fill_bio(bio);
>> +	bio_endio(bio);
>> +	return DM_MAPIO_SUBMITTED;
>> +}
>> +
>> +static int dm_po2z_map(struct dm_target *ti, struct bio *bio)
>> +{
>> +	struct dm_po2z_target *dmh = ti->private;
>> +
>> +	bio_set_dev(bio, dmh->dev->bdev);
>> +	if (bio_sectors(bio) || op_is_zone_mgmt(bio_op(bio))) {
>> +		/*
>> +		 * Read operation on the emulated zone area (between zone capacity
>> +		 * and zone size) will fill the bio with zeroes. Any other operation
>> +		 * in the emulated area should return an error.
>> +		 */
>> +		if (bio_op(bio) == REQ_OP_READ)
>> +			return dm_po2z_map_read(dmh, bio);
>> +
>> +		if (!bio_across_emulated_zone_area(dmh, bio)) {
>> +			bio->bi_iter.bi_sector = target_to_device_sect(dmh,
>> +								       bio->bi_iter.bi_sector);
>> +			return DM_MAPIO_REMAPPED;
>> +		}
>> +		return DM_MAPIO_KILL;
>> +	}
>> +	return DM_MAPIO_REMAPPED;
>> +}
>> +
>> +static int dm_po2z_iterate_devices(struct dm_target *ti,
>> +				   iterate_devices_callout_fn fn, void *data)
>> +{
>> +	struct dm_po2z_target *dmh = ti->private;
>> +	sector_t len = dmh->nr_zones * dmh->zone_size;
>> +
>> +	return fn(ti, dmh->dev, 0, len, data);
>> +}
>> +
>> +static struct target_type dm_po2z_target = {
>> +	.name = "po2z",
>> +	.version = { 1, 0, 0 },
>> +	.features = DM_TARGET_ZONED_HM | DM_TARGET_EMULATED_ZONES,
>> +	.map = dm_po2z_map,
>> +	.end_io = dm_po2z_end_io,
>> +	.report_zones = dm_po2z_report_zones,
>> +	.iterate_devices = dm_po2z_iterate_devices,
>> +	.module = THIS_MODULE,
>> +	.io_hints = dm_po2z_io_hints,
>> +	.ctr = dm_po2z_ctr,
>> +};
>> +
>> +static int __init dm_po2z_init(void)
>> +{
>> +	int r = dm_register_target(&dm_po2z_target);
>> +
>> +	if (r < 0)
>> +		DMERR("register failed %d", r);
>> +
>> +	return r;
> 
> Simplify:
> 
> 	return dm_register_target(&dm_po2z_target);
> 
>> +}
>> +
>> +static void __exit dm_po2z_exit(void)
>> +{
>> +	dm_unregister_target(&dm_po2z_target);
>> +}
>> +
>> +/* Module hooks */
>> +module_init(dm_po2z_init);
>> +module_exit(dm_po2z_exit);
>> +
>> +MODULE_DESCRIPTION(DM_NAME "power-of-2 zoned target");
>> +MODULE_AUTHOR("Pankaj Raghav <p.raghav@samsung.com>");
>> +MODULE_LICENSE("GPL");
>> +
> 
> the changes below should be a different prep patch.
> 
Ok. `dm:introduce DM_EMULATED_ZONE target type` before introducing this
new target.

>> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
>> index 534fddfc2b42..d77feff124af 100644
>> --- a/drivers/md/dm-table.c
>> +++ b/drivers/md/dm-table.c
>> @@ -1614,13 +1614,20 @@ static bool dm_table_supports_zoned_model(struct dm_table *t,
>>  	return true;
>>  }
>>  
>> -static int device_not_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
>> +/*
>> + * Callback function to check for device zone sector across devices. If the
>> + * DM_TARGET_EMULATED_ZONES target feature flag is not set, then the target
>> + * should have the same zone sector as the underlying devices.
>> + */
>> +static int check_valid_device_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
>>  					   sector_t start, sector_t len, void *data)
>>  {
>>  	unsigned int *zone_sectors = data;
>>  
>> -	if (!bdev_is_zoned(dev->bdev))
>> +	if (!bdev_is_zoned(dev->bdev) ||
>> +	    dm_target_supports_emulated_zones(ti->type))
>>  		return 0;
>> +
>>  	return bdev_zone_sectors(dev->bdev) != *zone_sectors;
>>  }
>>  
>> @@ -1646,7 +1653,7 @@ static int validate_hardware_zoned_model(struct dm_table *t,
>>  	if (!zone_sectors)
>>  		return -EINVAL;
>>  
>> -	if (dm_table_any_dev_attr(t, device_not_matches_zone_sectors, &zone_sectors)) {
>> +	if (dm_table_any_dev_attr(t, check_valid_device_zone_sectors, &zone_sectors)) {
>>  		DMERR("%s: zone sectors is not consistent across all zoned devices",
>>  		      dm_device_name(t->md));
>>  		return -EINVAL;
>> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
>> index 31c16aafdbfc..2b6b3883471f 100644
>> --- a/drivers/md/dm-zone.c
>> +++ b/drivers/md/dm-zone.c
>> @@ -302,6 +302,7 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q)
>>  	if (dm_table_supports_zone_append(t)) {
>>  		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
>>  		dm_cleanup_zoned_dev(md);
>> +
>>  		return 0;
>>  	}
>>  
>> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
>> index 920085dd7f3b..7dbd28b8de01 100644
>> --- a/include/linux/device-mapper.h
>> +++ b/include/linux/device-mapper.h
>> @@ -294,6 +294,15 @@ struct target_type {
>>  #define dm_target_supports_mixed_zoned_model(type) (false)
>>  #endif
>>  
>> +#ifdef CONFIG_BLK_DEV_ZONED
>> +#define DM_TARGET_EMULATED_ZONES	0x00000400
>> +#define dm_target_supports_emulated_zones(type) \
>> +	((type)->features & DM_TARGET_EMULATED_ZONES)
>> +#else
>> +#define DM_TARGET_EMULATED_ZONES	0x00000000
>> +#define dm_target_supports_emulated_zones(type) (false)
>> +#endif
>> +
>>  struct dm_target {
>>  	struct dm_table *table;
>>  	struct target_type *type;
> 
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2022-08-01  8:35 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20220727162246eucas1p1a758799f13d36ba99d30bf92cc5e2754@eucas1p1.samsung.com>
2022-07-27 16:22 ` [PATCH v8 00/11] support non power of 2 zoned device Pankaj Raghav
2022-07-27 16:22   ` [dm-devel] " Pankaj Raghav
     [not found]   ` <CGME20220727162247eucas1p203fc14aa17ecbcb3e6215d5304bb0c85@eucas1p2.samsung.com>
2022-07-27 16:22     ` [PATCH v8 01/11] block: make bdev_nr_zones and disk_zone_no generic for npo2 zsze Pankaj Raghav
2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
2022-07-28  2:54       ` Damien Le Moal
2022-07-28  2:54         ` [dm-devel] " Damien Le Moal
     [not found]   ` <CGME20220727162248eucas1p2ff8c3c2b021bedcae3960024b4e269e9@eucas1p2.samsung.com>
2022-07-27 16:22     ` [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size Pankaj Raghav
2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
2022-07-27 23:16       ` Bart Van Assche
2022-07-27 23:16         ` [dm-devel] " Bart Van Assche
2022-07-28 12:11         ` Pankaj Raghav
2022-07-28 12:11           ` [dm-devel] " Pankaj Raghav
2022-07-28 13:29           ` Bart Van Assche
2022-07-28 13:29             ` [dm-devel] " Bart Van Assche
2022-07-29  9:09             ` Pankaj Raghav
2022-07-29  9:09               ` [dm-devel] " Pankaj Raghav
2022-07-28  3:07       ` Damien Le Moal
2022-07-28  3:07         ` [dm-devel] " Damien Le Moal
2022-07-28 12:24         ` Pankaj Raghav
2022-07-28 12:24           ` [dm-devel] " Pankaj Raghav
     [not found]   ` <CGME20220727162249eucas1p28fa44c840e590f6f1b53e0cc12ee3771@eucas1p2.samsung.com>
2022-07-27 16:22     ` [PATCH v8 03/11] nvme: zns: Allow ZNS drives that have non-power_of_2 " Pankaj Raghav
2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
     [not found]   ` <CGME20220727162250eucas1p133e8a814fee934f7161866122ef93273@eucas1p1.samsung.com>
2022-07-27 16:22     ` [PATCH v8 04/11] nvmet: Allow ZNS target to support non-power_of_2 zone sizes Pankaj Raghav
2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
2022-07-28  3:09       ` Damien Le Moal
2022-07-28  3:09         ` [dm-devel] " Damien Le Moal
2022-07-29  7:45         ` Pankaj Raghav
2022-07-29  7:45           ` Pankaj Raghav
     [not found]   ` <CGME20220727162251eucas1p12939ac3864fd8705ae139eb2d1087d8f@eucas1p1.samsung.com>
2022-07-27 16:22     ` [PATCH v8 05/11] null_blk: allow non power of 2 zoned devices Pankaj Raghav
2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
2022-07-27 21:59       ` Chaitanya Kulkarni
2022-07-27 21:59         ` [dm-devel] " Chaitanya Kulkarni
2022-07-29  7:51         ` Pankaj Raghav
2022-07-29  7:51           ` [dm-devel] " Pankaj Raghav
2022-07-28  3:14       ` Damien Le Moal
2022-07-28  3:14         ` [dm-devel] " Damien Le Moal
     [not found]   ` <CGME20220727162252eucas1p25be8b79231334fa0c759c2475859e93b@eucas1p2.samsung.com>
2022-07-27 16:22     ` [PATCH v8 06/11] zonefs: " Pankaj Raghav
2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
2022-07-28  3:16       ` Damien Le Moal
2022-07-28  3:16         ` [dm-devel] " Damien Le Moal
     [not found]   ` <CGME20220727162253eucas1p1a5912e0494f6918504cc8ff15ad5d31f@eucas1p1.samsung.com>
2022-07-27 16:22     ` [PATCH v8 07/11] dm-zoned: ensure only power of 2 zone sizes are allowed Pankaj Raghav
2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
2022-07-28  3:18       ` Damien Le Moal
2022-07-28  3:18         ` [dm-devel] " Damien Le Moal
2022-07-28 12:15       ` David Sterba
2022-07-28 12:15         ` [dm-devel] " David Sterba
2022-07-29  8:00         ` Pankaj Raghav
2022-07-29  8:00           ` [dm-devel] " Pankaj Raghav
     [not found]   ` <CGME20220727162254eucas1p1fd990f746d9f9870b8d58ee0bd01fedd@eucas1p1.samsung.com>
2022-07-27 16:22     ` [PATCH v8 08/11] dm-zone: use generic helpers to calculate offset from zone start Pankaj Raghav
2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
2022-07-28  3:20       ` Damien Le Moal
2022-07-28  3:20         ` [dm-devel] " Damien Le Moal
     [not found]   ` <CGME20220727162255eucas1p2945c6dca42b799bb3b4abf3edb83dde8@eucas1p2.samsung.com>
2022-07-27 16:22     ` [PATCH v8 09/11] dm-table: allow non po2 zoned devices Pankaj Raghav
2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
2022-07-28  3:22       ` Damien Le Moal
2022-07-28  3:22         ` [dm-devel] " Damien Le Moal
     [not found]   ` <CGME20220727162256eucas1p284a15532173cce3eca46eee0cee3acdd@eucas1p2.samsung.com>
2022-07-27 16:22     ` [PATCH v8 10/11] dm: call dm_zone_endio after the target endio callback for " Pankaj Raghav
2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
2022-07-28  3:25       ` Damien Le Moal
2022-07-28  3:25         ` [dm-devel] " Damien Le Moal
2022-07-29  8:48         ` Pankaj Raghav
2022-07-29  8:48           ` [dm-devel] " Pankaj Raghav
     [not found]   ` <CGME20220727162257eucas1p2848a75c4aa7e559abb5d9ae0fbd374c1@eucas1p2.samsung.com>
2022-07-27 16:22     ` [PATCH v8 11/11] dm: add power-of-2 zoned target for non-power-of-2 " Pankaj Raghav
2022-07-27 16:22       ` [dm-devel] " Pankaj Raghav
2022-07-28  4:33       ` Damien Le Moal
2022-07-28  4:33         ` [dm-devel] " Damien Le Moal
2022-08-01  8:35         ` Pankaj Raghav
2022-08-01  8:35           ` [dm-devel] " Pankaj Raghav
2022-07-27 23:19   ` [PATCH v8 00/11] support non power of 2 zoned device Bart Van Assche
2022-07-27 23:19     ` [dm-devel] " Bart Van Assche
2022-07-28  1:52     ` Damien Le Moal
2022-07-28  1:52       ` [dm-devel] " Damien Le Moal
2022-07-28  2:58       ` Bart Van Assche
2022-07-28  2:58         ` [dm-devel] " Bart Van Assche
2022-07-28  3:28         ` Damien Le Moal
2022-07-28  3:28           ` [dm-devel] " Damien Le Moal
2022-07-28 11:57     ` Pankaj Raghav
2022-07-28 11:57       ` [dm-devel] " Pankaj Raghav
2022-07-28 13:30       ` Bart Van Assche
2022-07-28 13:30         ` [dm-devel] " Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.