linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes
       [not found] <CGME20220920091120eucas1p2c82c18f552d6298d24547cba2f70b7fc@eucas1p2.samsung.com>
@ 2022-09-20  9:11 ` Pankaj Raghav
       [not found]   ` <CGME20220920091121eucas1p26eed14714ff34e2489bc9adb40fd1250@eucas1p2.samsung.com>
                     ` (13 more replies)
  0 siblings, 14 replies; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-20  9:11 UTC (permalink / raw)
  To: agk, snitzer, axboe, damien.lemoal, hch
  Cc: gost.dev, jaegeuk, Johannes.Thumshirn, hare, bvanassche,
	dm-devel, matias.bjorling, linux-kernel, linux-nvme, pankydev8,
	linux-block, Pankaj Raghav

- Background and Motivation:

The zone storage implementation in Linux, introduced since v4.10, first
targetted SMR drives which have a power of 2 (po2) zone size alignment
requirement. The po2 zone size was further imposed implicitly by the
block layer's blk_queue_chunk_sectors(), used to prevent IO merging
across chunks beyond the specified size, since v3.16 through commit
762380ad9322 ("block: add notion of a chunk size for request merging").
But this same general block layer po2 requirement for blk_queue_chunk_sectors()
was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
to be non-power-of-2").

NAND, which is the media used in newer zoned storage devices, does not
naturally align to po2. In these devices, zone capacity(cap) is not the
same as the po2 zone size. When the zone cap != zone size, then unmapped
LBAs are introduced to cover the space between the zone cap and zone size.
po2 requirement does not make sense for these type of zone storage devices.
This patch series aims to remove these unmapped LBAs for zoned devices when
zone cap is npo2. This is done by relaxing the po2 zone size constraint
in the kernel and allowing zoned device with npo2 zone sizes if zone cap
== zone size.

Removing the po2 requirement from zone storage should be possible
now provided that no userspace regression and no performance regressions are
introduced. Stop-gap patches have been already merged into f2fs-tools to
proactively not allow npo2 zone sizes until proper support is added [1].

There were two efforts previously to add support to npo2 devices: 1) via
device level emulation [2] but that was rejected with a final conclusion
to add support for non po2 zoned device in the complete stack[3] 2)
adding support to the complete stack by removing the constraint in the
block layer and NVMe layer with support to btrfs, zonefs, etc which was
rejected with a conclusion to add a dm target for FS support [0]
to reduce the regression impact.

This series adds support to npo2 zoned devices in the block and nvme
layer and a new **dm target** is added: dm-po2zoned-target. This new
target will be initially used for filesystems such as btrfs and
f2fs until native npo2 zone support is added.

- Patchset description:
Patches 1-3 deals with removing the po2 constraint from the
block layer.

Patches 4-5 deals with removing the constraint from nvme zns.

Patch 5 removes the po2 contraint in null blk

Patch 6 adds npo2 support to zonefs

Patches 7-13 adds support for npo2 zoned devices in the DM layer and
adds a new target dm-po2zoned-target which converts a zoned device with
npo2 zone size into a zoned target with po2 zone size.

The patch series is based on linux-next tag: next-20220919

Testing:
The new target was tested with blktest and zonefs test suite in qemu and
on a real ZNS device with npo2 zone size.

Performance Measurement on a null blk:
Device:
zone size = 128M, blocksize=4k

FIO cmd:
fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
--loops=4

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Sequential Write:
x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  578     |  2257    |   12.80   |  576     |  2248    |   25.78   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  581     |  2268    |   12.74   |  576     |  2248    |   25.85   |
x-----------------x---------------------------------x---------------------------------x

Sequential read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  667     |  2605    |   11.79   |  675     |  2637    |   23.49   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  667     |  2605    |   11.79   |  675     |  2638    |   23.48   |
x-----------------x---------------------------------x---------------------------------x

Random read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  522     |  2038    |   15.05   |  514     |  2006    |   30.87   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  522     |  2039    |   15.04   |  523     |  2042    |   30.33   |
x-----------------x---------------------------------x---------------------------------x

Minor variations are noticed in Sequential write with io depth 8 and
in random read with io depth 16. But overall no noticeable differences
were noticed

[0] https://lore.kernel.org/lkml/PH0PR04MB74166C87F694B150A5AE0F009BD09@PH0PR04MB7416.namprd04.prod.outlook.com/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git/commit/?h=dev-test&id=6afcf6493578e77528abe65ab8b12f3e1c16749f
[2] https://lore.kernel.org/all/20220310094725.GA28499@lst.de/T/
[3] https://lore.kernel.org/all/20220315135245.eqf4tqngxxb7ymqa@unifi/

Changes since v1:
- Put the function declaration and its usage in the same commit (Bart)
- Remove bdev_zone_aligned function (Bart)
- Change the name from blk_queue_zone_aligned to blk_queue_is_zone_start
  (Damien)
- q is never null in from bdev_get_queue (Damien)
- Add condition during bringup and check for zsze == zcap for npo2
  drives (Damien)
- Rounddown operation should be made generic to work in 32 bits arch
  (bart)
- Add comments where generic calculation is directly used instead having
  special handling for po2 zone sizes (Hannes)
- Make the minimum zone size alignment requirement for btrfs to be 1M
  instead of BTRFS_STRIPE_LEN(David)

Changes since v2:
- Minor formatting changes

Changes since v3:
- Make superblock mirror align with the existing superblock log offsets
  (David)
- DM change return value and remove extra newline
- Optimize null blk zone index lookup with shift for po2 zone size

Changes since v4:
- Remove direct filesystems support for npo2 devices (Johannes, Hannes,
  Damien)

Changes since v5:
- Use DIV_ROUND_UP* helper instead of round_up as it breaks 32bit arch
  build in null blk(kernel-test-robot, Nathan)
- Use DIV_ROUND_UP_SECTOR_T also in blkdev_nr_zones function instead of
  open coding it with div64_u64
- Added extra condition in dm-zoned and in dm to reject non power of 2
  zone sizes.

Changes since v6:
- Added a new dm target for non power of 2 devices
- Added support for non power of 2 devices in the DM layer.

Changes since v7:
- Improved dm target for non power of 2 zoned devices with some bug
  fixes and rearrangement
- Removed some unnecessary comments.

Changes since v8:
- Rename dm-po2z to dm-po2zone
- set max_io_len for the target to po2 zone size sector
- Simplify dm-po2zone target by removing some superfluous conditions
- Added documentation for the new dm-po2zone target
- Change pr_warn to pr_err for critical errors
- Split patch 2 and 11 with their corresponding prep patches
- Minor spelling and grammatical improvements

Changes since v9:
- Add a check for a zoned device in dm-po2zone ctr.
- Rephrased some commit messages and documentation for clarity

Changes since v10:
- Simplified dm_poz_map function (Damien)

Changes since v11:
- Rename bio_in_emulated_zone_area and some formatting adjustments
  (Damien)

Changes since v12:
- Changed the name from dm-po2zone to dm-po2zoned to have a common
  naming convention for zoned devices(Mike)
- Return directly from the dm_po2z_map function instead of having
  returns from different functions (Mike)
- Change target type to target feature flag in commit header (Mike)
- Added dm_po2z_status function and NOWAIT flag to the target
- Added some extra information to the target's documentation.

Changes since v13:
- Use goto for cleanup in dm-po2zoned target (Mike)
- Added dtr to dm-po2zoned target
- Expose zone capacity instead of po2 zone size for
  DMSTATUS_TYPE_INFO(Mike)

Luis Chamberlain (1):
  dm-zoned: ensure only power of 2 zone sizes are allowed

Pankaj Raghav (12):
  block: make bdev_nr_zones and disk_zone_no generic for npo2 zone size
  block: rearrange bdev_{is_zoned,zone_sectors,get_queue} helper in
    blkdev.h
  block: allow blk-zoned devices to have non-power-of-2 zone size
  nvmet: Allow ZNS target to support non-power_of_2 zone sizes
  nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  null_blk: allow zoned devices with non power-of-2 zone sizes
  zonefs: allow non power of 2 zoned devices
  dm-zone: use generic helpers to calculate offset from zone start
  dm-table: allow zoned devices with non power-of-2 zone sizes
  dm: call dm_zone_endio after the target endio callback for zoned
    devices
  dm: introduce DM_EMULATED_ZONES target feature flag
  dm: add power-of-2 target for zoned devices with non power-of-2 zone
    sizes

 .../admin-guide/device-mapper/dm-po2zoned.rst |  79 +++++
 .../admin-guide/device-mapper/index.rst       |   1 +
 block/blk-core.c                              |   2 +-
 block/blk-zoned.c                             |  37 ++-
 drivers/block/null_blk/main.c                 |   5 +-
 drivers/block/null_blk/null_blk.h             |   1 +
 drivers/block/null_blk/zoned.c                |  18 +-
 drivers/md/Kconfig                            |  10 +
 drivers/md/Makefile                           |   2 +
 drivers/md/dm-po2zoned-target.c               | 291 ++++++++++++++++++
 drivers/md/dm-table.c                         |  20 +-
 drivers/md/dm-zone.c                          |   8 +-
 drivers/md/dm-zoned-target.c                  |   8 +
 drivers/md/dm.c                               |   8 +-
 drivers/nvme/host/zns.c                       |  14 +-
 drivers/nvme/target/zns.c                     |   3 +-
 fs/zonefs/super.c                             |   6 +-
 fs/zonefs/zonefs.h                            |   1 -
 include/linux/blkdev.h                        |  80 +++--
 include/linux/device-mapper.h                 |   9 +
 20 files changed, 528 insertions(+), 75 deletions(-)
 create mode 100644 Documentation/admin-guide/device-mapper/dm-po2zoned.rst
 create mode 100644 drivers/md/dm-po2zoned-target.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v14 01/13] block: make bdev_nr_zones and disk_zone_no generic for npo2 zone size
       [not found]   ` <CGME20220920091121eucas1p26eed14714ff34e2489bc9adb40fd1250@eucas1p2.samsung.com>
@ 2022-09-20  9:11     ` Pankaj Raghav
  0 siblings, 0 replies; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-20  9:11 UTC (permalink / raw)
  To: agk, snitzer, axboe, damien.lemoal, hch
  Cc: gost.dev, jaegeuk, Johannes.Thumshirn, hare, bvanassche,
	dm-devel, matias.bjorling, linux-kernel, linux-nvme, pankydev8,
	linux-block, Pankaj Raghav, Luis Chamberlain, Adam Manzanares,
	Chaitanya Kulkarni, Johannes Thumshirn

Adapt bdev_nr_zones and disk_zone_no functions so that they can
also work for non-power-of-2 zone sizes.

As the existing deployments assume that a device zone size is a power of
2 number of sectors, power-of-2 optimized calculation is used for those
devices.

There are no direct hot paths modified and the changes just
introduce one new branch per call.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 block/blk-zoned.c      | 13 +++++++++----
 include/linux/blkdev.h |  8 +++++++-
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index a264621d4905..dce9c95b4bcd 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -111,17 +111,22 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
  * bdev_nr_zones - Get number of zones
  * @bdev:	Target device
  *
- * Return the total number of zones of a zoned block device.  For a block
- * device without zone capabilities, the number of zones is always 0.
+ * Return the total number of zones of a zoned block device, including the
+ * eventual small last zone if present. For a block device without zone
+ * capabilities, the number of zones is always 0.
  */
 unsigned int bdev_nr_zones(struct block_device *bdev)
 {
 	sector_t zone_sectors = bdev_zone_sectors(bdev);
+	sector_t capacity = bdev_nr_sectors(bdev);
 
 	if (!bdev_is_zoned(bdev))
 		return 0;
-	return (bdev_nr_sectors(bdev) + zone_sectors - 1) >>
-		ilog2(zone_sectors);
+
+	if (is_power_of_2(zone_sectors))
+		return (capacity + zone_sectors - 1) >> ilog2(zone_sectors);
+
+	return DIV_ROUND_UP_SECTOR_T(capacity, zone_sectors);
 }
 EXPORT_SYMBOL_GPL(bdev_nr_zones);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8038c5fbde40..6c6bf4dd5709 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -674,9 +674,15 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
 
 static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
 {
+	sector_t zone_sectors = disk->queue->limits.chunk_sectors;
+
 	if (!blk_queue_is_zoned(disk->queue))
 		return 0;
-	return sector >> ilog2(disk->queue->limits.chunk_sectors);
+
+	if (is_power_of_2(zone_sectors))
+		return sector >> ilog2(zone_sectors);
+
+	return div64_u64(sector, zone_sectors);
 }
 
 static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v14 02/13] block: rearrange bdev_{is_zoned,zone_sectors,get_queue} helper in blkdev.h
       [not found]   ` <CGME20220920091122eucas1p2934bc26b6c11bdbafa7ebd3004ce72ee@eucas1p2.samsung.com>
@ 2022-09-20  9:11     ` Pankaj Raghav
  0 siblings, 0 replies; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-20  9:11 UTC (permalink / raw)
  To: agk, snitzer, axboe, damien.lemoal, hch
  Cc: gost.dev, jaegeuk, Johannes.Thumshirn, hare, bvanassche,
	dm-devel, matias.bjorling, linux-kernel, linux-nvme, pankydev8,
	linux-block, Pankaj Raghav, Chaitanya Kulkarni,
	Johannes Thumshirn

Define bdev_is_zoned(), bdev_zone_sectors() and bdev_get_queue() earlier
in the blkdev.h include file. Simplify bdev_is_zoned() by removing the
superfluous NULL check for request queue while we are at it.

This commit has no functional change, and it is a prep patch for allowing
zoned devices with non-power-of-2 zone sizes in the block layer.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 include/linux/blkdev.h | 43 +++++++++++++++++++-----------------------
 1 file changed, 19 insertions(+), 24 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6c6bf4dd5709..6cf43f9384cc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -635,6 +635,11 @@ static inline bool queue_is_mq(struct request_queue *q)
 	return q->mq_ops;
 }
 
+static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
+{
+	return bdev->bd_queue;	/* this is never NULL */
+}
+
 #ifdef CONFIG_PM
 static inline enum rpm_status queue_rpm_status(struct request_queue *q)
 {
@@ -666,6 +671,20 @@ static inline bool blk_queue_is_zoned(struct request_queue *q)
 	}
 }
 
+static inline bool bdev_is_zoned(struct block_device *bdev)
+{
+	return blk_queue_is_zoned(bdev_get_queue(bdev));
+}
+
+static inline sector_t bdev_zone_sectors(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (!blk_queue_is_zoned(q))
+		return 0;
+	return q->limits.chunk_sectors;
+}
+
 #ifdef CONFIG_BLK_DEV_ZONED
 static inline unsigned int disk_nr_zones(struct gendisk *disk)
 {
@@ -892,11 +911,6 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags);
 int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
 			unsigned int flags);
 
-static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
-{
-	return bdev->bd_queue;	/* this is never NULL */
-}
-
 /* Helper to convert BLK_ZONE_ZONE_XXX to its string format XXX */
 const char *blk_zone_cond_str(enum blk_zone_cond zone_cond);
 
@@ -1296,25 +1310,6 @@ static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
 	return BLK_ZONED_NONE;
 }
 
-static inline bool bdev_is_zoned(struct block_device *bdev)
-{
-	struct request_queue *q = bdev_get_queue(bdev);
-
-	if (q)
-		return blk_queue_is_zoned(q);
-
-	return false;
-}
-
-static inline sector_t bdev_zone_sectors(struct block_device *bdev)
-{
-	struct request_queue *q = bdev_get_queue(bdev);
-
-	if (!blk_queue_is_zoned(q))
-		return 0;
-	return q->limits.chunk_sectors;
-}
-
 static inline int queue_dma_alignment(const struct request_queue *q)
 {
 	return q ? q->dma_alignment : 511;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v14 03/13] block: allow blk-zoned devices to have non-power-of-2 zone size
       [not found]   ` <CGME20220920091123eucas1p26623d067c9a2d36d4f3b29b77c6d937a@eucas1p2.samsung.com>
@ 2022-09-20  9:11     ` Pankaj Raghav
  0 siblings, 0 replies; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-20  9:11 UTC (permalink / raw)
  To: agk, snitzer, axboe, damien.lemoal, hch
  Cc: gost.dev, jaegeuk, Johannes.Thumshirn, hare, bvanassche,
	dm-devel, matias.bjorling, linux-kernel, linux-nvme, pankydev8,
	linux-block, Pankaj Raghav, Luis Chamberlain, Johannes Thumshirn

Checking if a given sector is aligned to a zone is a common
operation that is performed for zoned devices. Add
bdev_is_zone_start helper to check for this instead of opencoding it
everywhere.

Convert the calculations on zone size to be generic instead of relying on
power-of-2(po2) based arithmetic in the block layer using the helpers
wherever possible.

The only hot path affected by this change for zoned devices with po2
zone size is in blk_check_zone_append() but bdev_is_zone_start() helper is
used to optimize the calculation for po2 zone sizes.

Finally, allow zoned devices with non po2 zone sizes provided that their
zone capacity and zone size are equal. The main motivation to allow zoned
devices with non po2 zone size is to remove the unmapped LBA between
zone capcity and zone size for devices that cannot have a po2 zone
capacity.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 block/blk-core.c       |  2 +-
 block/blk-zoned.c      | 24 ++++++++++++++++++------
 include/linux/blkdev.h | 30 ++++++++++++++++++++++++++++++
 3 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4d0dd0e9e46d..735f63b6159a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -559,7 +559,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 		return BLK_STS_NOTSUPP;
 
 	/* The bio sector must point to the start of a sequential zone */
-	if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) - 1) ||
+	if (!bdev_is_zone_start(bio->bi_bdev, bio->bi_iter.bi_sector) ||
 	    !bio_zone_is_seq(bio))
 		return BLK_STS_IOERR;
 
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index dce9c95b4bcd..6806c69c81dc 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -285,10 +285,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
 		return -EINVAL;
 
 	/* Check alignment (handle eventual smaller last zone) */
-	if (sector & (zone_sectors - 1))
+	if (!bdev_is_zone_start(bdev, sector))
 		return -EINVAL;
 
-	if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
+	if (!bdev_is_zone_start(bdev, nr_sectors) && end_sector != capacity)
 		return -EINVAL;
 
 	/*
@@ -486,14 +486,26 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 	 * smaller last zone.
 	 */
 	if (zone->start == 0) {
-		if (zone->len == 0 || !is_power_of_2(zone->len)) {
-			pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
-				disk->disk_name, zone->len);
+		if (zone->len == 0) {
+			pr_warn("%s: Invalid zero zone size", disk->disk_name);
+			return -ENODEV;
+		}
+
+		/*
+		 * Non power-of-2 zone size support was added to remove the
+		 * gap between zone capacity and zone size. Though it is technically
+		 * possible to have gaps in a non power-of-2 device, Linux requires
+		 * the zone size to be equal to zone capacity for non power-of-2
+		 * zoned devices.
+		 */
+		if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
+			pr_err("%s: Invalid zone capacity %lld with non power-of-2 zone size %lld",
+			       disk->disk_name, zone->capacity, zone->len);
 			return -ENODEV;
 		}
 
 		args->zone_sectors = zone->len;
-		args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
+		args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
 	} else if (zone->start + args->zone_sectors < capacity) {
 		if (zone->len != args->zone_sectors) {
 			pr_warn("%s: Invalid zoned device with non constant zone size\n",
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6cf43f9384cc..e29799076298 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -704,6 +704,30 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
 	return div64_u64(sector, zone_sectors);
 }
 
+static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
+						   sector_t sec)
+{
+	sector_t zone_sectors = bdev_zone_sectors(bdev);
+	u64 remainder = 0;
+
+	if (!bdev_is_zoned(bdev))
+		return 0;
+
+	if (is_power_of_2(zone_sectors))
+		return sec & (zone_sectors - 1);
+
+	div64_u64_rem(sec, zone_sectors, &remainder);
+	return remainder;
+}
+
+static inline bool bdev_is_zone_start(struct block_device *bdev, sector_t sec)
+{
+	if (!bdev_is_zoned(bdev))
+		return false;
+
+	return bdev_offset_from_zone_start(bdev, sec) == 0;
+}
+
 static inline bool disk_zone_is_seq(struct gendisk *disk, sector_t sector)
 {
 	if (!blk_queue_is_zoned(disk->queue))
@@ -748,6 +772,12 @@ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector)
 {
 	return 0;
 }
+
+static inline bool bdev_is_zone_start(struct block_device *bdev, sector_t sec)
+{
+	return false;
+}
+
 static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
 {
 	return 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v14 04/13] nvmet: Allow ZNS target to support non-power_of_2 zone sizes
       [not found]   ` <CGME20220920091124eucas1p1dc8fd9969ebd1839fca5a3c627a29b75@eucas1p1.samsung.com>
@ 2022-09-20  9:11     ` Pankaj Raghav
  0 siblings, 0 replies; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-20  9:11 UTC (permalink / raw)
  To: agk, snitzer, axboe, damien.lemoal, hch
  Cc: gost.dev, jaegeuk, Johannes.Thumshirn, hare, bvanassche,
	dm-devel, matias.bjorling, linux-kernel, linux-nvme, pankydev8,
	linux-block, Pankaj Raghav, Johannes Thumshirn,
	Chaitanya Kulkarni, Luis Chamberlain

A generic bdev_zone_no() helper is added to calculate zone number for a
given sector in a block device. This helper internally uses disk_zone_no()
to find the zone number.

Use the helper bdev_zone_no() to calculate nr of zones. This let's us
make modifications to the math if needed in one place and adds now
support for zoned devices with non po2 zone size.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/target/zns.c | 3 +--
 include/linux/blkdev.h    | 5 +++++
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
index 835bfda86fcf..1c5352295db1 100644
--- a/drivers/nvme/target/zns.c
+++ b/drivers/nvme/target/zns.c
@@ -254,8 +254,7 @@ static unsigned long nvmet_req_nr_zones_from_slba(struct nvmet_req *req)
 {
 	unsigned int sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);
 
-	return bdev_nr_zones(req->ns->bdev) -
-		(sect >> ilog2(bdev_zone_sectors(req->ns->bdev)));
+	return bdev_nr_zones(req->ns->bdev) - bdev_zone_no(req->ns->bdev, sect);
 }
 
 static unsigned long get_nr_zones_from_buf(struct nvmet_req *req, u32 bufsize)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e29799076298..5cf34ccd3e12 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1340,6 +1340,11 @@ static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
 	return BLK_ZONED_NONE;
 }
 
+static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
+{
+	return disk_zone_no(bdev->bd_disk, sec);
+}
+
 static inline int queue_dma_alignment(const struct request_queue *q)
 {
 	return q ? q->dma_alignment : 511;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v14 05/13] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
       [not found]   ` <CGME20220920091125eucas1p2e994ea9a52fcbf50fd7242ca8949c179@eucas1p2.samsung.com>
@ 2022-09-20  9:11     ` Pankaj Raghav
  0 siblings, 0 replies; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-20  9:11 UTC (permalink / raw)
  To: agk, snitzer, axboe, damien.lemoal, hch
  Cc: gost.dev, jaegeuk, Johannes.Thumshirn, hare, bvanassche,
	dm-devel, matias.bjorling, linux-kernel, linux-nvme, pankydev8,
	linux-block, Pankaj Raghav, Luis Chamberlain, Chaitanya Kulkarni

Remove the condition which disallows non-power_of_2 zone size ZNS drive
to be updated and use generic method to calculate number of zones
instead of relying on log and shift based calculation on zone size.

The power_of_2 calculation has been replaced directly with generic
calculation without special handling. Both modified functions are not
used in hot paths, they are only used during initialization &
revalidation of the ZNS device.

As rounddown macro from math.h does not work for 32 bit architectures,
round down operation is open coded.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/host/zns.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 12316ab51bda..fe1d715d61cc 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
 	}
 
 	ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
-	if (!is_power_of_2(ns->zsze)) {
-		dev_warn(ns->ctrl->device,
-			"invalid zone size:%llu for namespace:%u\n",
-			ns->zsze, ns->head->ns_id);
-		status = -ENODEV;
-		goto free_data;
-	}
 
 	disk_set_zoned(ns->disk, BLK_ZONED_HM);
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
@@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
 				   sizeof(struct nvme_zone_descriptor);
 
 	nr_zones = min_t(unsigned int, nr_zones,
-			 get_capacity(ns->disk) >> ilog2(ns->zsze));
+			 div64_u64(get_capacity(ns->disk), ns->zsze));
 
 	bufsize = sizeof(struct nvme_zone_report) +
 		nr_zones * sizeof(struct nvme_zone_descriptor);
@@ -182,6 +175,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 	int ret, zone_idx = 0;
 	unsigned int nz, i;
 	size_t buflen;
+	u64 remainder = 0;
 
 	if (ns->head->ids.csi != NVME_CSI_ZNS)
 		return -EINVAL;
@@ -197,7 +191,9 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
 	c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
 
-	sector &= ~(ns->zsze - 1);
+	/* Round down the sector value to the nearest zone start */
+	div64_u64_rem(sector, ns->zsze, &remainder);
+	sector -= remainder;
 	while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
 		memset(report, 0, buflen);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v14 06/13] null_blk: allow zoned devices with non power-of-2 zone sizes
       [not found]   ` <CGME20220920091126eucas1p2e2665221ea346937bff44c9fd6963928@eucas1p2.samsung.com>
@ 2022-09-20  9:11     ` Pankaj Raghav
  0 siblings, 0 replies; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-20  9:11 UTC (permalink / raw)
  To: agk, snitzer, axboe, damien.lemoal, hch
  Cc: gost.dev, jaegeuk, Johannes.Thumshirn, hare, bvanassche,
	dm-devel, matias.bjorling, linux-kernel, linux-nvme, pankydev8,
	linux-block, Pankaj Raghav, Luis Chamberlain, Chaitanya Kulkarni,
	Johannes Thumshirn

Convert the power-of-2(po2) based calculation with zone size to be generic
in null_zone_no with optimization for po2 zone sizes.

The nr_zones calculation in null_init_zoned_dev has been replaced with a
division without special handling for po2 zone sizes as this function is
called only during the initialization and will not be invoked in the hot
path.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/block/null_blk/main.c     |  5 ++---
 drivers/block/null_blk/null_blk.h |  1 +
 drivers/block/null_blk/zoned.c    | 18 +++++++++++-------
 3 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 1f154f92f4c2..3b24125d8594 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -1976,9 +1976,8 @@ static int null_validate_conf(struct nullb_device *dev)
 	if (dev->queue_mode == NULL_Q_BIO)
 		dev->mbps = 0;
 
-	if (dev->zoned &&
-	    (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
-		pr_err("zone_size must be power-of-two\n");
+	if (dev->zoned && !dev->zone_size) {
+		pr_err("Invalid zero zone size\n");
 		return -EINVAL;
 	}
 
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 94ff68052b1e..f63b6bed1bb3 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -83,6 +83,7 @@ struct nullb_device {
 	unsigned int imp_close_zone_no;
 	struct nullb_zone *zones;
 	sector_t zone_size_sects;
+	unsigned int zone_size_sects_shift;
 	bool need_zone_res_mgmt;
 	spinlock_t zone_res_lock;
 
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 55a69e48ef8b..015f6823706c 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -16,7 +16,10 @@ static inline sector_t mb_to_sects(unsigned long mb)
 
 static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
 {
-	return sect >> ilog2(dev->zone_size_sects);
+	if (dev->zone_size_sects_shift)
+		return sect >> dev->zone_size_sects_shift;
+
+	return div64_u64(sect, dev->zone_size_sects);
 }
 
 static inline void null_lock_zone_res(struct nullb_device *dev)
@@ -65,10 +68,6 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	sector_t sector = 0;
 	unsigned int i;
 
-	if (!is_power_of_2(dev->zone_size)) {
-		pr_err("zone_size must be power-of-two\n");
-		return -EINVAL;
-	}
 	if (dev->zone_size > dev->size) {
 		pr_err("Zone size larger than device capacity\n");
 		return -EINVAL;
@@ -86,9 +85,14 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	zone_capacity_sects = mb_to_sects(dev->zone_capacity);
 	dev_capacity_sects = mb_to_sects(dev->size);
 	dev->zone_size_sects = mb_to_sects(dev->zone_size);
-	dev->nr_zones = round_up(dev_capacity_sects, dev->zone_size_sects)
-		>> ilog2(dev->zone_size_sects);
 
+	if (is_power_of_2(dev->zone_size_sects))
+		dev->zone_size_sects_shift = ilog2(dev->zone_size_sects);
+	else
+		dev->zone_size_sects_shift = 0;
+
+	dev->nr_zones =	DIV_ROUND_UP_SECTOR_T(dev_capacity_sects,
+					      dev->zone_size_sects);
 	dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct nullb_zone),
 				    GFP_KERNEL | __GFP_ZERO);
 	if (!dev->zones)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v14 07/13] zonefs: allow non power of 2 zoned devices
       [not found]   ` <CGME20220920091127eucas1p252a2edaae00ff379bc6f9fbcc2791487@eucas1p2.samsung.com>
@ 2022-09-20  9:11     ` Pankaj Raghav
  0 siblings, 0 replies; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-20  9:11 UTC (permalink / raw)
  To: agk, snitzer, axboe, damien.lemoal, hch
  Cc: gost.dev, jaegeuk, Johannes.Thumshirn, hare, bvanassche,
	dm-devel, matias.bjorling, linux-kernel, linux-nvme, pankydev8,
	linux-block, Pankaj Raghav, Luis Chamberlain, Johannes Thumshirn,
	Chaitanya Kulkarni

The zone size shift variable is useful only if the zone sizes are known
to be power of 2. Remove that variable and use generic helpers from
block layer to calculate zone index in zonefs.

Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/zonefs/super.c  | 6 ++----
 fs/zonefs/zonefs.h | 1 -
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 860f0b1032c6..e549ef16738c 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -476,10 +476,9 @@ static void __zonefs_io_error(struct inode *inode, bool write)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	struct super_block *sb = inode->i_sb;
-	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
 	unsigned int noio_flag;
 	unsigned int nr_zones =
-		zi->i_zone_size >> (sbi->s_zone_sectors_shift + SECTOR_SHIFT);
+		bdev_zone_no(sb->s_bdev, zi->i_zone_size >> SECTOR_SHIFT);
 	struct zonefs_ioerr_data err = {
 		.inode = inode,
 		.write = write,
@@ -1401,7 +1400,7 @@ static int zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	int ret = 0;
 
-	inode->i_ino = zone->start >> sbi->s_zone_sectors_shift;
+	inode->i_ino = bdev_zone_no(sb->s_bdev, zone->start);
 	inode->i_mode = S_IFREG | sbi->s_perm;
 
 	zi->i_ztype = type;
@@ -1776,7 +1775,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 	 * interface constraints.
 	 */
 	sb_set_blocksize(sb, bdev_zone_write_granularity(sb->s_bdev));
-	sbi->s_zone_sectors_shift = ilog2(bdev_zone_sectors(sb->s_bdev));
 	sbi->s_uid = GLOBAL_ROOT_UID;
 	sbi->s_gid = GLOBAL_ROOT_GID;
 	sbi->s_perm = 0640;
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 4b3de66c3233..39895195cda6 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -177,7 +177,6 @@ struct zonefs_sb_info {
 	kgid_t			s_gid;
 	umode_t			s_perm;
 	uuid_t			s_uuid;
-	unsigned int		s_zone_sectors_shift;
 
 	unsigned int		s_nr_files[ZONEFS_ZTYPE_MAX];
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v14 08/13] dm-zoned: ensure only power of 2 zone sizes are allowed
       [not found]   ` <CGME20220920091128eucas1p2d9c9dcebdce4039360f92b0f577ab649@eucas1p2.samsung.com>
@ 2022-09-20  9:11     ` Pankaj Raghav
  0 siblings, 0 replies; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-20  9:11 UTC (permalink / raw)
  To: agk, snitzer, axboe, damien.lemoal, hch
  Cc: gost.dev, jaegeuk, Johannes.Thumshirn, hare, bvanassche,
	dm-devel, matias.bjorling, linux-kernel, linux-nvme, pankydev8,
	linux-block, Luis Chamberlain, Johannes Thumshirn, Pankaj Raghav

From: Luis Chamberlain <mcgrof@kernel.org>

dm-zoned relies on the assumption that the zone size is a
power-of-2(po2) and the zone capacity is same as the zone size.

Ensure only po2 devices can be used as dm-zoned target until a native
support for zoned devices with non-po2 zone size is added.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/md/dm-zoned-target.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
index 95b132b52f33..9325bf5dee81 100644
--- a/drivers/md/dm-zoned-target.c
+++ b/drivers/md/dm-zoned-target.c
@@ -792,6 +792,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
 				return -EINVAL;
 			}
 			zone_nr_sectors = bdev_zone_sectors(bdev);
+			if (!is_power_of_2(zone_nr_sectors)) {
+				ti->error = "Zone size is not a power-of-2 number of sectors";
+				return -EINVAL;
+			}
 			zoned_dev->zone_nr_sectors = zone_nr_sectors;
 			zoned_dev->nr_zones = bdev_nr_zones(bdev);
 		}
@@ -804,6 +808,10 @@ static int dmz_fixup_devices(struct dm_target *ti)
 			return -EINVAL;
 		}
 		zoned_dev->zone_nr_sectors = bdev_zone_sectors(bdev);
+		if (!is_power_of_2(zoned_dev->zone_nr_sectors)) {
+			ti->error = "Zone size is not a power-of-2 number of sectors";
+			return -EINVAL;
+		}
 		zoned_dev->nr_zones = bdev_nr_zones(bdev);
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v14 09/13] dm-zone: use generic helpers to calculate offset from zone start
       [not found]   ` <CGME20220920091129eucas1p26137d368bf7e2cfc2d585ce41f3cdc86@eucas1p2.samsung.com>
@ 2022-09-20  9:11     ` Pankaj Raghav
  0 siblings, 0 replies; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-20  9:11 UTC (permalink / raw)
  To: agk, snitzer, axboe, damien.lemoal, hch
  Cc: gost.dev, jaegeuk, Johannes.Thumshirn, hare, bvanassche,
	dm-devel, matias.bjorling, linux-kernel, linux-nvme, pankydev8,
	linux-block, Pankaj Raghav, Luis Chamberlain, Johannes Thumshirn

Use the bdev_offset_from_zone_start() helper function to calculate
the offset from zone start instead of using power of 2 based
calculation.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
---
 drivers/md/dm-zone.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 3dafc0e8b7a9..ac6fc1293d41 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -390,7 +390,8 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
 	case REQ_OP_WRITE_ZEROES:
 	case REQ_OP_WRITE:
 		/* Writes must be aligned to the zone write pointer */
-		if ((clone->bi_iter.bi_sector & (zsectors - 1)) != zwp_offset)
+		if (bdev_offset_from_zone_start(md->disk->part0,
+						clone->bi_iter.bi_sector) != zwp_offset)
 			return false;
 		break;
 	case REQ_OP_ZONE_APPEND:
@@ -602,11 +603,8 @@ void dm_zone_endio(struct dm_io *io, struct bio *clone)
 		 */
 		if (clone->bi_status == BLK_STS_OK &&
 		    bio_op(clone) == REQ_OP_ZONE_APPEND) {
-			sector_t mask =
-				(sector_t)bdev_zone_sectors(disk->part0) - 1;
-
 			orig_bio->bi_iter.bi_sector +=
-				clone->bi_iter.bi_sector & mask;
+				bdev_offset_from_zone_start(disk->part0, clone->bi_iter.bi_sector);
 		}
 
 		return;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v14 10/13] dm-table: allow zoned devices with non power-of-2 zone sizes
       [not found]   ` <CGME20220920091131eucas1p1ab4ef3ac98b2fd27c546a44b583a1745@eucas1p1.samsung.com>
@ 2022-09-20  9:11     ` Pankaj Raghav
  0 siblings, 0 replies; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-20  9:11 UTC (permalink / raw)
  To: agk, snitzer, axboe, damien.lemoal, hch
  Cc: gost.dev, jaegeuk, Johannes.Thumshirn, hare, bvanassche,
	dm-devel, matias.bjorling, linux-kernel, linux-nvme, pankydev8,
	linux-block, Pankaj Raghav, Johannes Thumshirn

Allow dm to support zoned devices with non power-of-2(po2) zone sizes as
the block layer now supports it.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
---
 drivers/md/dm-table.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index e42016359a77..38b83c383e8f 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -250,7 +250,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
 	if (bdev_is_zoned(bdev)) {
 		unsigned int zone_sectors = bdev_zone_sectors(bdev);
 
-		if (start & (zone_sectors - 1)) {
+		if (!bdev_is_zone_start(bdev, start)) {
 			DMERR("%s: start=%llu not aligned to h/w zone size %u of %pg",
 			      dm_device_name(ti->table->md),
 			      (unsigned long long)start,
@@ -267,7 +267,7 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
 		 * devices do not end up with a smaller zone in the middle of
 		 * the sector range.
 		 */
-		if (len & (zone_sectors - 1)) {
+		if (!bdev_is_zone_start(bdev, len)) {
 			DMERR("%s: len=%llu not aligned to h/w zone size %u of %pg",
 			      dm_device_name(ti->table->md),
 			      (unsigned long long)len,
@@ -1647,8 +1647,7 @@ static int validate_hardware_zoned_model(struct dm_table *t,
 		return -EINVAL;
 	}
 
-	/* Check zone size validity and compatibility */
-	if (!zone_sectors || !is_power_of_2(zone_sectors))
+	if (!zone_sectors)
 		return -EINVAL;
 
 	if (dm_table_any_dev_attr(t, device_not_matches_zone_sectors, &zone_sectors)) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v14 11/13] dm: call dm_zone_endio after the target endio callback for zoned devices
       [not found]   ` <CGME20220920091132eucas1p2e98ce6f411f1c3e8e10c4eae81aba296@eucas1p2.samsung.com>
@ 2022-09-20  9:11     ` Pankaj Raghav
  0 siblings, 0 replies; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-20  9:11 UTC (permalink / raw)
  To: agk, snitzer, axboe, damien.lemoal, hch
  Cc: gost.dev, jaegeuk, Johannes.Thumshirn, hare, bvanassche,
	dm-devel, matias.bjorling, linux-kernel, linux-nvme, pankydev8,
	linux-block, Pankaj Raghav

dm_zone_endio() updates the bi_sector of orig bio for zoned devices that
uses either native append or append emulation, and it is called before the
endio of the target. But target endio can still update the clone bio
after dm_zone_endio is called, thereby, the orig bio does not contain
the updated information anymore.

Currently, this is not a problem as the targets that support zoned devices
such as dm-zoned, dm-linear, and dm-crypt do not have an endio function,
and even if they do (such as dm-flakey), they don't modify the
bio->bi_iter.bi_sector of the cloned bio that is used to update the
orig_bio's bi_sector in dm_zone_endio function.

This is a prep patch for the new dm-po2zoned target as it modifies
bi_sector in the endio callback.

Call dm_zone_endio for zoned devices after calling the target's endio
function.

Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/md/dm.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7c35dea88ed1..874e1dc9fc26 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1122,10 +1122,6 @@ static void clone_endio(struct bio *bio)
 			disable_write_zeroes(md);
 	}
 
-	if (static_branch_unlikely(&zoned_enabled) &&
-	    unlikely(bdev_is_zoned(bio->bi_bdev)))
-		dm_zone_endio(io, bio);
-
 	if (endio) {
 		int r = endio(ti, bio, &error);
 		switch (r) {
@@ -1154,6 +1150,10 @@ static void clone_endio(struct bio *bio)
 		}
 	}
 
+	if (static_branch_unlikely(&zoned_enabled) &&
+	    unlikely(bdev_is_zoned(bio->bi_bdev)))
+		dm_zone_endio(io, bio);
+
 	if (static_branch_unlikely(&swap_bios_enabled) &&
 	    unlikely(swap_bios_limit(ti, bio)))
 		up(&md->swap_bios_semaphore);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v14 12/13] dm: introduce DM_EMULATED_ZONES target feature flag
       [not found]   ` <CGME20220920091133eucas1p1ce01c88d47ce934b2995bea6a6bc55df@eucas1p1.samsung.com>
@ 2022-09-20  9:11     ` Pankaj Raghav
  0 siblings, 0 replies; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-20  9:11 UTC (permalink / raw)
  To: agk, snitzer, axboe, damien.lemoal, hch
  Cc: gost.dev, jaegeuk, Johannes.Thumshirn, hare, bvanassche,
	dm-devel, matias.bjorling, linux-kernel, linux-nvme, pankydev8,
	linux-block, Pankaj Raghav

Introduce a new target feature flag: DM_EMULATED_ZONES for targets with
a different number of sectors per zone (aka zone size) than the underlying
device zone size.

This target feature flag is introduced as the existing zoned targets assume
that the target and the underlying device have the same zone size.
The new target: dm-po2zoned will use this new target feature flag
as it emulates the zone boundary that is different from the underlying
zoned device.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
---
 drivers/md/dm-table.c         | 13 ++++++++++---
 include/linux/device-mapper.h |  9 +++++++++
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 38b83c383e8f..8324bd660f09 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1619,13 +1619,20 @@ static bool dm_table_supports_zoned_model(struct dm_table *t,
 	return true;
 }
 
-static int device_not_matches_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
+/*
+ * Callback function to check for device zone sector across devices. If the
+ * DM_TARGET_EMULATED_ZONES target feature flag is not set, then the target
+ * should have the same zone sector as the underlying devices.
+ */
+static int check_valid_device_zone_sectors(struct dm_target *ti, struct dm_dev *dev,
 					   sector_t start, sector_t len, void *data)
 {
 	unsigned int *zone_sectors = data;
 
-	if (!bdev_is_zoned(dev->bdev))
+	if (!bdev_is_zoned(dev->bdev) ||
+	    dm_target_supports_emulated_zones(ti->type))
 		return 0;
+
 	return bdev_zone_sectors(dev->bdev) != *zone_sectors;
 }
 
@@ -1650,7 +1657,7 @@ static int validate_hardware_zoned_model(struct dm_table *t,
 	if (!zone_sectors)
 		return -EINVAL;
 
-	if (dm_table_any_dev_attr(t, device_not_matches_zone_sectors, &zone_sectors)) {
+	if (dm_table_any_dev_attr(t, check_valid_device_zone_sectors, &zone_sectors)) {
 		DMERR("%s: zone sectors is not consistent across all zoned devices",
 		      dm_device_name(t->md));
 		return -EINVAL;
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index f8e3a96b97b0..8504da98e829 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -294,6 +294,15 @@ struct target_type {
 #define dm_target_supports_mixed_zoned_model(type) (false)
 #endif
 
+#ifdef CONFIG_BLK_DEV_ZONED
+#define DM_TARGET_EMULATED_ZONES	0x00000400
+#define dm_target_supports_emulated_zones(type) \
+	((type)->features & DM_TARGET_EMULATED_ZONES)
+#else
+#define DM_TARGET_EMULATED_ZONES	0x00000000
+#define dm_target_supports_emulated_zones(type) (false)
+#endif
+
 struct dm_target {
 	struct dm_table *table;
 	struct target_type *type;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v14 13/13] dm: add power-of-2 target for zoned devices with non power-of-2 zone sizes
       [not found]   ` <CGME20220920091134eucas1p275585b70fab48ba1f19eb5d2cc515b6d@eucas1p2.samsung.com>
@ 2022-09-20  9:11     ` Pankaj Raghav
  2022-09-21 16:37       ` Mike Snitzer
  0 siblings, 1 reply; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-20  9:11 UTC (permalink / raw)
  To: agk, snitzer, axboe, damien.lemoal, hch
  Cc: gost.dev, jaegeuk, Johannes.Thumshirn, hare, bvanassche,
	dm-devel, matias.bjorling, linux-kernel, linux-nvme, pankydev8,
	linux-block, Pankaj Raghav, Johannes Thumshirn, Damien Le Moal

Only zoned devices with power-of-2(po2) number of sectors per zone(zone
size) were supported in linux but now non power-of-2(npo2) zone sizes
support has been added to the block layer.

Filesystems such as F2FS and btrfs have support for zoned devices with
po2 zone size assumption. Before adding native support for npo2 zone
sizes, it was suggested to create a dm target for npo2 zone size device to
appear as a po2 zone size target so that file systems can initially
work without any explicit changes.

The design of this target is very simple: remap the device zone size to
the zone capacity and change the zone size to be the nearest power of 2
value.

For e.g., a device with a zone size/capacity of 3M will have an equivalent
target layout as follows:

Device layout :-
zone capacity = 3M
zone size = 3M

|--------------|-------------|
0             3M            6M

Target layout :-
zone capacity=3M
zone size = 4M

|--------------|---|--------------|---|
0             3M  4M             7M  8M

The area between target's zone capacity and zone size will be emulated
in the target.
The read IOs that fall in the emulated gap area will return 0 filled
bio and all the other IOs in that area will result in an error.
If a read IO span across the emulated area boundary, then the IOs are
split across them. All other IO operations that span across the emulated
area boundary will result in an error.

The target can be easily created as follows:
dmsetup create <label> --table '0 <size_sects> po2zoned /dev/nvme<id>'

The target does not support partial mapping of the underlying
device as there is no use-case for it.

Note:
This target is not related to dm-zoned target, which exposes a zoned block
device as a regular block device without any write constraint.

This target only exposes a different zone size than the underlying device.
The underlying device's other constraints will be directly exposed to the
target.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Suggested-by: Damien Le Moal <damien.lemoal@wdc.com>
Suggested-by: Hannes Reinecke <hare@suse.de>
---
 .../admin-guide/device-mapper/dm-po2zoned.rst |  79 +++++
 .../admin-guide/device-mapper/index.rst       |   1 +
 drivers/md/Kconfig                            |  10 +
 drivers/md/Makefile                           |   2 +
 drivers/md/dm-po2zoned-target.c               | 291 ++++++++++++++++++
 5 files changed, 383 insertions(+)
 create mode 100644 Documentation/admin-guide/device-mapper/dm-po2zoned.rst
 create mode 100644 drivers/md/dm-po2zoned-target.c

diff --git a/Documentation/admin-guide/device-mapper/dm-po2zoned.rst b/Documentation/admin-guide/device-mapper/dm-po2zoned.rst
new file mode 100644
index 000000000000..8a35eab0b714
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-po2zoned.rst
@@ -0,0 +1,79 @@
+===========
+dm-po2zoned
+===========
+The dm-po2zoned device mapper target exposes a zoned block device with a
+non-power-of-2(npo2) number of sectors per zone as a power-of-2(po2)
+number of sectors per zone(zone size).
+The filesystems that support zoned block devices such as F2FS and BTRFS
+assume po2 zone size as the kernel has traditionally only supported
+those devices. However, as the kernel now supports zoned block devices with
+npo2 zone sizes, the filesystems can run on top of the dm-po2zoned target before
+adding native support.
+
+Partial mapping of the underlying device is not supported by this target as
+there is no use-case for it.
+
+.. note::
+   This target is **not related** to **dm-zoned target**, which exposes a
+   zoned block device as a regular block device without any write constraint.
+
+   This target only exposes a different **zone size** than the underlying device.
+   The underlying device's other **constraints** will be exposed to the target.
+
+Algorithm
+=========
+The device mapper target maps the underlying device's zone size to the
+zone capacity and changes the zone size to the nearest po2 zone size.
+The gap between the zone capacity and the zone size is emulated in the target.
+E.g., a zoned block device with a zone size (and capacity) of 3M will have an
+equivalent target layout with mapping as follows:
+
+::
+
+  0M           3M  4M        6M 8M
+  |             |  |          |  |
+  +x------------+--+x---------+--+x-------  Target
+  |x            |  |x         |  |x
+   x               x             x
+   x               x             x
+   x              x             x
+   x             x             x
+  |x            |x            |x
+  +x------------+x------------+x----------  Device
+  |             |             |
+  0M           3M            6M
+
+A simple remap is performed for all the BIOs that do not cross the
+emulation gap area, i.e., the area between the zone capacity and size.
+
+If a BIO lies in the emulation gap area, the following operations are performed:
+
+	Read:
+		- If the BIO lies entirely in the emulation gap area, then zero out the BIO and complete it.
+		- If the BIO spans the emulation gap area, split the BIO across the zone capacity boundary
+                  and remap only the BIO within the zone capacity boundary. The other part of the split BIO
+                  will be zeroed out.
+
+	Other operations:
+                - Return an error
+
+Table parameters
+================
+
+::
+
+  <dev path>
+
+Mandatory parameters:
+
+    <dev path>:
+        Full pathname to the underlying block-device, or a
+        "major:minor" device-number.
+
+Examples
+========
+
+::
+
+  #!/bin/sh
+  echo "0 `blockdev --getsz $1` po2zoned $1" | dmsetup create po2z
diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst
index cde52cc09645..5df93711cef5 100644
--- a/Documentation/admin-guide/device-mapper/index.rst
+++ b/Documentation/admin-guide/device-mapper/index.rst
@@ -23,6 +23,7 @@ Device Mapper
     dm-service-time
     dm-uevent
     dm-zoned
+    dm-po2zoned
     era
     kcopyd
     linear
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 998a5cfdbc4e..74fdfd02ab5f 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -518,6 +518,16 @@ config DM_FLAKEY
 	help
 	 A target that intermittently fails I/O for debugging purposes.
 
+config DM_PO2ZONED
+	tristate "Zoned block devices target emulating a power-of-2 number of sectors per zone"
+	depends on BLK_DEV_DM
+	depends on BLK_DEV_ZONED
+	help
+	  A target that converts a zoned block device with non-power-of-2(npo2)
+	  number of sectors per zone to be power-of-2(po2). Use this target for
+	  zoned block devices with npo2 number of sectors per zone until native
+	  support is added to the filesystems and applications.
+
 config DM_VERITY
 	tristate "Verity target support"
 	depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 84291e38dca8..ee05722bc637 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -26,6 +26,7 @@ dm-era-y	+= dm-era-target.o
 dm-clone-y	+= dm-clone-target.o dm-clone-metadata.o
 dm-verity-y	+= dm-verity-target.o
 dm-zoned-y	+= dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
+dm-po2zoned-y	+= dm-po2zoned-target.o
 
 md-mod-y	+= md.o md-bitmap.o
 raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
@@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
 obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
 obj-$(CONFIG_DM_DUST)		+= dm-dust.o
 obj-$(CONFIG_DM_FLAKEY)		+= dm-flakey.o
+obj-$(CONFIG_DM_PO2ZONED)	+= dm-po2zoned.o
 obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
 obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
 obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
diff --git a/drivers/md/dm-po2zoned-target.c b/drivers/md/dm-po2zoned-target.c
new file mode 100644
index 000000000000..1d2f46a594f8
--- /dev/null
+++ b/drivers/md/dm-po2zoned-target.c
@@ -0,0 +1,291 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022 Samsung Electronics Co., Ltd.
+ */
+
+#include <linux/device-mapper.h>
+
+#define DM_MSG_PREFIX "po2zoned"
+
+struct dm_po2z_target {
+	struct dm_dev *dev;
+	sector_t zone_size; /* Actual zone size of the underlying dev*/
+	sector_t zone_size_po2; /* zone_size rounded to the nearest po2 value */
+	unsigned int zone_size_po2_shift;
+	sector_t zone_size_diff; /* diff between zone_size_po2 and zone_size */
+	unsigned int nr_zones;
+};
+
+static inline unsigned int npo2_zone_no(struct dm_po2z_target *dmh,
+					sector_t sect)
+{
+	return div64_u64(sect, dmh->zone_size);
+}
+
+static inline unsigned int po2_zone_no(struct dm_po2z_target *dmh,
+				       sector_t sect)
+{
+	return sect >> dmh->zone_size_po2_shift;
+}
+
+static inline sector_t device_to_target_sect(struct dm_target *ti,
+					     sector_t sect)
+{
+	struct dm_po2z_target *dmh = ti->private;
+
+	return sect + (npo2_zone_no(dmh, sect) * dmh->zone_size_diff) +
+	       ti->begin;
+}
+
+/*
+ * This target works on the complete zoned device. Partial mapping is not
+ * supported.
+ * Construct a zoned po2 logical device: <dev-path>
+ */
+static int dm_po2z_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+	struct dm_po2z_target *dmh = NULL;
+	int ret;
+	sector_t zone_size;
+	sector_t dev_capacity;
+
+	if (argc != 1)
+		return -EINVAL;
+
+	dmh = kmalloc(sizeof(*dmh), GFP_KERNEL);
+	if (!dmh)
+		return -ENOMEM;
+
+	ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
+			    &dmh->dev);
+	if (ret) {
+		ti->error = "Device lookup failed";
+		goto bad;
+	}
+
+	if (!bdev_is_zoned(dmh->dev->bdev)) {
+		DMERR("%pg is not a zoned device", dmh->dev->bdev);
+		ret = -EINVAL;
+		goto bad;
+	}
+
+	zone_size = bdev_zone_sectors(dmh->dev->bdev);
+	dev_capacity = get_capacity(dmh->dev->bdev->bd_disk);
+	if (ti->len != dev_capacity) {
+		DMERR("%pg Partial mapping of the target is not supported",
+		      dmh->dev->bdev);
+		ret = -EINVAL;
+		goto bad;
+	}
+
+	if (is_power_of_2(zone_size))
+		DMWARN("%pg: underlying device has a power-of-2 number of sectors per zone",
+		       dmh->dev->bdev);
+
+	dmh->zone_size = zone_size;
+	dmh->zone_size_po2 = 1 << get_count_order_long(zone_size);
+	dmh->zone_size_po2_shift = ilog2(dmh->zone_size_po2);
+	dmh->zone_size_diff = dmh->zone_size_po2 - dmh->zone_size;
+	ti->private = dmh;
+	ti->max_io_len = dmh->zone_size_po2;
+	dmh->nr_zones = npo2_zone_no(dmh, ti->len);
+	ti->len = dmh->zone_size_po2 * dmh->nr_zones;
+	return 0;
+
+bad:
+	kfree(dmh);
+	return ret;
+}
+
+static void dm_po2z_dtr(struct dm_target *ti)
+{
+	struct dm_po2z_target *dmh = ti->private;
+
+	dm_put_device(ti, dmh->dev);
+	kfree(dmh);
+}
+
+static int dm_po2z_report_zones_cb(struct blk_zone *zone, unsigned int idx,
+				   void *data)
+{
+	struct dm_report_zones_args *args = data;
+	struct dm_target *ti = args->tgt;
+	struct dm_po2z_target *dmh = ti->private;
+
+	zone->start = device_to_target_sect(ti, zone->start);
+	zone->wp = device_to_target_sect(ti, zone->wp);
+	zone->len = dmh->zone_size_po2;
+	args->next_sector = zone->start + zone->len;
+
+	return args->orig_cb(zone, args->zone_idx++, args->orig_data);
+}
+
+static int dm_po2z_report_zones(struct dm_target *ti,
+				struct dm_report_zones_args *args,
+				unsigned int nr_zones)
+{
+	struct dm_po2z_target *dmh = ti->private;
+	sector_t sect =
+		po2_zone_no(dmh, dm_target_offset(ti, args->next_sector)) *
+		dmh->zone_size;
+
+	return blkdev_report_zones(dmh->dev->bdev, sect, nr_zones,
+				   dm_po2z_report_zones_cb, args);
+}
+
+static int dm_po2z_end_io(struct dm_target *ti, struct bio *bio,
+			  blk_status_t *error)
+{
+	if (bio->bi_status == BLK_STS_OK && bio_op(bio) == REQ_OP_ZONE_APPEND)
+		bio->bi_iter.bi_sector =
+			device_to_target_sect(ti, bio->bi_iter.bi_sector);
+
+	return DM_ENDIO_DONE;
+}
+
+static void dm_po2z_io_hints(struct dm_target *ti, struct queue_limits *limits)
+{
+	struct dm_po2z_target *dmh = ti->private;
+
+	limits->chunk_sectors = dmh->zone_size_po2;
+}
+
+static void dm_po2z_status(struct dm_target *ti, status_type_t type,
+			   unsigned int status_flags, char *result,
+			   unsigned int maxlen)
+{
+	struct dm_po2z_target *dmh = ti->private;
+	size_t sz = 0;
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		DMEMIT("%s %lld", dmh->dev->name,
+		       (unsigned long long)dmh->zone_size);
+		break;
+
+	case STATUSTYPE_TABLE:
+		DMEMIT("%s", dmh->dev->name);
+		break;
+
+	case STATUSTYPE_IMA:
+		result[0] = '\0';
+		break;
+	}
+}
+
+/**
+ * dm_po2z_bio_in_emulated_zone_area - check if bio is in the emulated zone area
+ * @dmh:	target data
+ * @bio:	bio
+ * @offset:	bio offset to emulated zone boundary
+ *
+ * Check if a @bio is partly or completely in the emulated zone area. If the
+ * @bio is partly in the emulated zone area, @offset can be used to split
+ * the @bio across the emulated zone boundary. @offset
+ * will be negative if the @bio completely lies in the emulated area.
+ *
+ */
+static bool dm_po2z_bio_in_emulated_zone_area(struct dm_po2z_target *dmh,
+					      struct bio *bio, int *offset)
+{
+	unsigned int zone_idx = po2_zone_no(dmh, bio->bi_iter.bi_sector);
+	sector_t nr_sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
+	sector_t sector_offset =
+		bio->bi_iter.bi_sector - (zone_idx << dmh->zone_size_po2_shift);
+
+	*offset = dmh->zone_size - sector_offset;
+
+	return sector_offset + nr_sectors > dmh->zone_size;
+}
+
+static inline void dm_po2z_read_zeroes(struct bio *bio)
+{
+	zero_fill_bio(bio);
+	bio_endio(bio);
+}
+
+static int dm_po2z_map(struct dm_target *ti, struct bio *bio)
+{
+	struct dm_po2z_target *dmh = ti->private;
+	int split_io_pos;
+
+	bio_set_dev(bio, dmh->dev->bdev);
+	bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);
+
+	if (op_is_zone_mgmt(bio_op(bio)) || !bio_sectors(bio))
+		goto remap_sector;
+
+	if (!dm_po2z_bio_in_emulated_zone_area(dmh, bio, &split_io_pos))
+		goto remap_sector;
+
+	/*
+	 * Read operation on the emulated zone area (between zone capacity
+	 * and zone size) will fill the bio with zeroes. Any other operation
+	 * in the emulated area should return an error.
+	 */
+	if (bio_op(bio) == REQ_OP_READ) {
+		/*
+		 * If the bio is across emulated zone boundary, split the bio at
+		 * the boundary.
+		 */
+		if (split_io_pos > 0) {
+			dm_accept_partial_bio(bio, split_io_pos);
+			goto remap_sector;
+		}
+
+		dm_po2z_read_zeroes(bio);
+		return DM_MAPIO_SUBMITTED;
+	}
+	/* Other IOs in emulated zone area should result in an error */
+	return DM_MAPIO_KILL;
+
+remap_sector:
+	/* convert from target sector to device sector */
+	bio->bi_iter.bi_sector -= (po2_zone_no(dmh, bio->bi_iter.bi_sector) *
+				   dmh->zone_size_diff);
+	return DM_MAPIO_REMAPPED;
+}
+
+static int dm_po2z_iterate_devices(struct dm_target *ti,
+				   iterate_devices_callout_fn fn, void *data)
+{
+	struct dm_po2z_target *dmh = ti->private;
+	sector_t len = dmh->nr_zones * dmh->zone_size;
+
+	return fn(ti, dmh->dev, 0, len, data);
+}
+
+static struct target_type dm_po2z_target = {
+	.name = "po2zoned",
+	.version = { 1, 0, 0 },
+	.features = DM_TARGET_ZONED_HM | DM_TARGET_EMULATED_ZONES |
+		    DM_TARGET_NOWAIT,
+	.map = dm_po2z_map,
+	.end_io = dm_po2z_end_io,
+	.report_zones = dm_po2z_report_zones,
+	.iterate_devices = dm_po2z_iterate_devices,
+	.module = THIS_MODULE,
+	.io_hints = dm_po2z_io_hints,
+	.status = dm_po2z_status,
+	.ctr = dm_po2z_ctr,
+	.dtr = dm_po2z_dtr,
+};
+
+static int __init dm_po2z_init(void)
+{
+	return dm_register_target(&dm_po2z_target);
+}
+
+static void __exit dm_po2z_exit(void)
+{
+	dm_unregister_target(&dm_po2z_target);
+}
+
+/* Module hooks */
+module_init(dm_po2z_init);
+module_exit(dm_po2z_exit);
+
+MODULE_DESCRIPTION(DM_NAME "power-of-2 zoned target");
+MODULE_AUTHOR("Pankaj Raghav <p.raghav@samsung.com>");
+MODULE_LICENSE("GPL");
+
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v14 13/13] dm: add power-of-2 target for zoned devices with non power-of-2 zone sizes
  2022-09-20  9:11     ` [PATCH v14 13/13] dm: add power-of-2 target for zoned devices with non power-of-2 zone sizes Pankaj Raghav
@ 2022-09-21 16:37       ` Mike Snitzer
  2022-09-21 17:32         ` Pankaj Raghav
  0 siblings, 1 reply; 24+ messages in thread
From: Mike Snitzer @ 2022-09-21 16:37 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: agk, snitzer, axboe, damien.lemoal, hch, Damien Le Moal,
	bvanassche, pankydev8, gost.dev, linux-kernel, linux-nvme,
	linux-block, dm-devel, Johannes Thumshirn, jaegeuk,
	matias.bjorling

On Tue, Sep 20 2022 at  5:11P -0400,
Pankaj Raghav <p.raghav@samsung.com> wrote:

> Only zoned devices with power-of-2(po2) number of sectors per zone(zone
> size) were supported in linux but now non power-of-2(npo2) zone sizes
> support has been added to the block layer.
> 
> Filesystems such as F2FS and btrfs have support for zoned devices with
> po2 zone size assumption. Before adding native support for npo2 zone
> sizes, it was suggested to create a dm target for npo2 zone size device to
> appear as a po2 zone size target so that file systems can initially
> work without any explicit changes.
> 
> The design of this target is very simple: remap the device zone size to
> the zone capacity and change the zone size to be the nearest power of 2
> value.
> 
> For e.g., a device with a zone size/capacity of 3M will have an equivalent
> target layout as follows:
> 
> Device layout :-
> zone capacity = 3M
> zone size = 3M
> 
> |--------------|-------------|
> 0             3M            6M
> 
> Target layout :-
> zone capacity=3M
> zone size = 4M
> 
> |--------------|---|--------------|---|
> 0             3M  4M             7M  8M
> 
> The area between target's zone capacity and zone size will be emulated
> in the target.
> The read IOs that fall in the emulated gap area will return 0 filled
> bio and all the other IOs in that area will result in an error.
> If a read IO span across the emulated area boundary, then the IOs are
> split across them. All other IO operations that span across the emulated
> area boundary will result in an error.
> 
> The target can be easily created as follows:
> dmsetup create <label> --table '0 <size_sects> po2zoned /dev/nvme<id>'
> 
> The target does not support partial mapping of the underlying
> device as there is no use-case for it.
> 
> Note:
> This target is not related to dm-zoned target, which exposes a zoned block
> device as a regular block device without any write constraint.
> 
> This target only exposes a different zone size than the underlying device.
> The underlying device's other constraints will be directly exposed to the
> target.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Suggested-by: Damien Le Moal <damien.lemoal@wdc.com>
> Suggested-by: Hannes Reinecke <hare@suse.de>
> ---
>  .../admin-guide/device-mapper/dm-po2zoned.rst |  79 +++++
>  .../admin-guide/device-mapper/index.rst       |   1 +
>  drivers/md/Kconfig                            |  10 +
>  drivers/md/Makefile                           |   2 +
>  drivers/md/dm-po2zoned-target.c               | 291 ++++++++++++++++++
>  5 files changed, 383 insertions(+)
>  create mode 100644 Documentation/admin-guide/device-mapper/dm-po2zoned.rst
>  create mode 100644 drivers/md/dm-po2zoned-target.c
> 
> diff --git a/Documentation/admin-guide/device-mapper/dm-po2zoned.rst b/Documentation/admin-guide/device-mapper/dm-po2zoned.rst
> new file mode 100644
> index 000000000000..8a35eab0b714
> --- /dev/null
> +++ b/Documentation/admin-guide/device-mapper/dm-po2zoned.rst
> @@ -0,0 +1,79 @@
> +===========
> +dm-po2zoned
> +===========
> +The dm-po2zoned device mapper target exposes a zoned block device with a
> +non-power-of-2(npo2) number of sectors per zone as a power-of-2(po2)
> +number of sectors per zone(zone size).
> +The filesystems that support zoned block devices such as F2FS and BTRFS
> +assume po2 zone size as the kernel has traditionally only supported
> +those devices. However, as the kernel now supports zoned block devices with
> +npo2 zone sizes, the filesystems can run on top of the dm-po2zoned target before
> +adding native support.
> +
> +Partial mapping of the underlying device is not supported by this target as
> +there is no use-case for it.
> +
> +.. note::
> +   This target is **not related** to **dm-zoned target**, which exposes a
> +   zoned block device as a regular block device without any write constraint.
> +
> +   This target only exposes a different **zone size** than the underlying device.
> +   The underlying device's other **constraints** will be exposed to the target.
> +
> +Algorithm
> +=========
> +The device mapper target maps the underlying device's zone size to the
> +zone capacity and changes the zone size to the nearest po2 zone size.
> +The gap between the zone capacity and the zone size is emulated in the target.
> +E.g., a zoned block device with a zone size (and capacity) of 3M will have an
> +equivalent target layout with mapping as follows:
> +
> +::
> +
> +  0M           3M  4M        6M 8M
> +  |             |  |          |  |
> +  +x------------+--+x---------+--+x-------  Target
> +  |x            |  |x         |  |x
> +   x               x             x
> +   x               x             x
> +   x              x             x
> +   x             x             x
> +  |x            |x            |x
> +  +x------------+x------------+x----------  Device
> +  |             |             |
> +  0M           3M            6M
> +
> +A simple remap is performed for all the BIOs that do not cross the
> +emulation gap area, i.e., the area between the zone capacity and size.
> +
> +If a BIO lies in the emulation gap area, the following operations are performed:
> +
> +	Read:
> +		- If the BIO lies entirely in the emulation gap area, then zero out the BIO and complete it.
> +		- If the BIO spans the emulation gap area, split the BIO across the zone capacity boundary
> +                  and remap only the BIO within the zone capacity boundary. The other part of the split BIO
> +                  will be zeroed out.
> +
> +	Other operations:
> +                - Return an error
> +
> +Table parameters
> +================
> +
> +::
> +
> +  <dev path>
> +
> +Mandatory parameters:
> +
> +    <dev path>:
> +        Full pathname to the underlying block-device, or a
> +        "major:minor" device-number.
> +
> +Examples
> +========
> +
> +::
> +
> +  #!/bin/sh
> +  echo "0 `blockdev --getsz $1` po2zoned $1" | dmsetup create po2z
> diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst
> index cde52cc09645..5df93711cef5 100644
> --- a/Documentation/admin-guide/device-mapper/index.rst
> +++ b/Documentation/admin-guide/device-mapper/index.rst
> @@ -23,6 +23,7 @@ Device Mapper
>      dm-service-time
>      dm-uevent
>      dm-zoned
> +    dm-po2zoned
>      era
>      kcopyd
>      linear
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index 998a5cfdbc4e..74fdfd02ab5f 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -518,6 +518,16 @@ config DM_FLAKEY
>  	help
>  	 A target that intermittently fails I/O for debugging purposes.
>  
> +config DM_PO2ZONED
> +	tristate "Zoned block devices target emulating a power-of-2 number of sectors per zone"
> +	depends on BLK_DEV_DM
> +	depends on BLK_DEV_ZONED
> +	help
> +	  A target that converts a zoned block device with non-power-of-2(npo2)
> +	  number of sectors per zone to be power-of-2(po2). Use this target for
> +	  zoned block devices with npo2 number of sectors per zone until native
> +	  support is added to the filesystems and applications.
> +
>  config DM_VERITY
>  	tristate "Verity target support"
>  	depends on BLK_DEV_DM
> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
> index 84291e38dca8..ee05722bc637 100644
> --- a/drivers/md/Makefile
> +++ b/drivers/md/Makefile
> @@ -26,6 +26,7 @@ dm-era-y	+= dm-era-target.o
>  dm-clone-y	+= dm-clone-target.o dm-clone-metadata.o
>  dm-verity-y	+= dm-verity-target.o
>  dm-zoned-y	+= dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o
> +dm-po2zoned-y	+= dm-po2zoned-target.o
>  
>  md-mod-y	+= md.o md-bitmap.o
>  raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
> @@ -60,6 +61,7 @@ obj-$(CONFIG_DM_CRYPT)		+= dm-crypt.o
>  obj-$(CONFIG_DM_DELAY)		+= dm-delay.o
>  obj-$(CONFIG_DM_DUST)		+= dm-dust.o
>  obj-$(CONFIG_DM_FLAKEY)		+= dm-flakey.o
> +obj-$(CONFIG_DM_PO2ZONED)	+= dm-po2zoned.o
>  obj-$(CONFIG_DM_MULTIPATH)	+= dm-multipath.o dm-round-robin.o
>  obj-$(CONFIG_DM_MULTIPATH_QL)	+= dm-queue-length.o
>  obj-$(CONFIG_DM_MULTIPATH_ST)	+= dm-service-time.o
> diff --git a/drivers/md/dm-po2zoned-target.c b/drivers/md/dm-po2zoned-target.c
> new file mode 100644
> index 000000000000..1d2f46a594f8
> --- /dev/null
> +++ b/drivers/md/dm-po2zoned-target.c
> @@ -0,0 +1,291 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2022 Samsung Electronics Co., Ltd.
> + */
> +
> +#include <linux/device-mapper.h>
> +
> +#define DM_MSG_PREFIX "po2zoned"
> +
> +struct dm_po2z_target {
> +	struct dm_dev *dev;
> +	sector_t zone_size; /* Actual zone size of the underlying dev*/
> +	sector_t zone_size_po2; /* zone_size rounded to the nearest po2 value */
> +	unsigned int zone_size_po2_shift;
> +	sector_t zone_size_diff; /* diff between zone_size_po2 and zone_size */
> +	unsigned int nr_zones;
> +};
> +
> +static inline unsigned int npo2_zone_no(struct dm_po2z_target *dmh,
> +					sector_t sect)
> +{
> +	return div64_u64(sect, dmh->zone_size);
> +}
> +
> +static inline unsigned int po2_zone_no(struct dm_po2z_target *dmh,
> +				       sector_t sect)
> +{
> +	return sect >> dmh->zone_size_po2_shift;
> +}
> +
> +static inline sector_t device_to_target_sect(struct dm_target *ti,
> +					     sector_t sect)
> +{
> +	struct dm_po2z_target *dmh = ti->private;
> +
> +	return sect + (npo2_zone_no(dmh, sect) * dmh->zone_size_diff) +
> +	       ti->begin;
> +}
> +
> +/*
> + * This target works on the complete zoned device. Partial mapping is not
> + * supported.
> + * Construct a zoned po2 logical device: <dev-path>
> + */
> +static int dm_po2z_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> +{
> +	struct dm_po2z_target *dmh = NULL;
> +	int ret;
> +	sector_t zone_size;
> +	sector_t dev_capacity;
> +
> +	if (argc != 1)
> +		return -EINVAL;
> +
> +	dmh = kmalloc(sizeof(*dmh), GFP_KERNEL);
> +	if (!dmh)
> +		return -ENOMEM;
> +
> +	ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
> +			    &dmh->dev);
> +	if (ret) {
> +		ti->error = "Device lookup failed";
> +		goto bad;
> +	}
> +
> +	if (!bdev_is_zoned(dmh->dev->bdev)) {
> +		DMERR("%pg is not a zoned device", dmh->dev->bdev);
> +		ret = -EINVAL;
> +		goto bad;
> +	}
> +
> +	zone_size = bdev_zone_sectors(dmh->dev->bdev);
> +	dev_capacity = get_capacity(dmh->dev->bdev->bd_disk);
> +	if (ti->len != dev_capacity) {
> +		DMERR("%pg Partial mapping of the target is not supported",
> +		      dmh->dev->bdev);
> +		ret = -EINVAL;
> +		goto bad;
> +	}
> +
> +	if (is_power_of_2(zone_size))
> +		DMWARN("%pg: underlying device has a power-of-2 number of sectors per zone",
> +		       dmh->dev->bdev);
> +
> +	dmh->zone_size = zone_size;
> +	dmh->zone_size_po2 = 1 << get_count_order_long(zone_size);
> +	dmh->zone_size_po2_shift = ilog2(dmh->zone_size_po2);
> +	dmh->zone_size_diff = dmh->zone_size_po2 - dmh->zone_size;
> +	ti->private = dmh;
> +	ti->max_io_len = dmh->zone_size_po2;
> +	dmh->nr_zones = npo2_zone_no(dmh, ti->len);
> +	ti->len = dmh->zone_size_po2 * dmh->nr_zones;
> +	return 0;
> +
> +bad:
> +	kfree(dmh);
> +	return ret;
> +}

This error handling still isn't correct.  You're using
dm_get_device().  If you return early due to error, _after_
dm_get_device(), you need to dm_put_device().

Basically you need a new label above "bad:" that calls dm_put_device()
then falls through to "bad:".  Or you need to explcitly call
dm_put_device() before "goto bad;" in the if (ti->len != dev_capacity)
error branch.

Mike


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with  non-power-of-2 zone sizes]
  2022-09-20  9:11 ` [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes Pankaj Raghav
                     ` (12 preceding siblings ...)
       [not found]   ` <CGME20220920091134eucas1p275585b70fab48ba1f19eb5d2cc515b6d@eucas1p2.samsung.com>
@ 2022-09-21 17:27   ` Mike Snitzer
  2022-09-21 23:55     ` Damien Le Moal
  13 siblings, 1 reply; 24+ messages in thread
From: Mike Snitzer @ 2022-09-21 17:27 UTC (permalink / raw)
  To: damien.lemoal, Pankaj Raghav
  Cc: agk, snitzer, axboe, hch, bvanassche, pankydev8, gost.dev,
	linux-kernel, linux-nvme, linux-block, dm-devel,
	Johannes.Thumshirn, jaegeuk, matias.bjorling

On Tue, Sep 20 2022 at  5:11P -0400,
Pankaj Raghav <p.raghav@samsung.com> wrote:

> - Background and Motivation:
> 
> The zone storage implementation in Linux, introduced since v4.10, first
> targetted SMR drives which have a power of 2 (po2) zone size alignment
> requirement. The po2 zone size was further imposed implicitly by the
> block layer's blk_queue_chunk_sectors(), used to prevent IO merging
> across chunks beyond the specified size, since v3.16 through commit
> 762380ad9322 ("block: add notion of a chunk size for request merging").
> But this same general block layer po2 requirement for blk_queue_chunk_sectors()
> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
> to be non-power-of-2").
> 
> NAND, which is the media used in newer zoned storage devices, does not
> naturally align to po2. In these devices, zone capacity(cap) is not the
> same as the po2 zone size. When the zone cap != zone size, then unmapped
> LBAs are introduced to cover the space between the zone cap and zone size.
> po2 requirement does not make sense for these type of zone storage devices.
> This patch series aims to remove these unmapped LBAs for zoned devices when
> zone cap is npo2. This is done by relaxing the po2 zone size constraint
> in the kernel and allowing zoned device with npo2 zone sizes if zone cap
> == zone size.
> 
> Removing the po2 requirement from zone storage should be possible
> now provided that no userspace regression and no performance regressions are
> introduced. Stop-gap patches have been already merged into f2fs-tools to
> proactively not allow npo2 zone sizes until proper support is added [1].
> 
> There were two efforts previously to add support to npo2 devices: 1) via
> device level emulation [2] but that was rejected with a final conclusion
> to add support for non po2 zoned device in the complete stack[3] 2)
> adding support to the complete stack by removing the constraint in the
> block layer and NVMe layer with support to btrfs, zonefs, etc which was
> rejected with a conclusion to add a dm target for FS support [0]
> to reduce the regression impact.
> 
> This series adds support to npo2 zoned devices in the block and nvme
> layer and a new **dm target** is added: dm-po2zoned-target. This new
> target will be initially used for filesystems such as btrfs and
> f2fs until native npo2 zone support is added.

As this patchset nears the point of being "ready for merge" and DM's
"zoned" oriented targets are multiplying, I need to understand: where
are we collectively going?  How long are we expecting to support the
"stop-gap zoned storage" layers we've constructed?

I know https://zonedstorage.io/docs/introduction exists... but it
_seems_ stale given the emergence of ZNS and new permutations of zoned
hardware. Maybe that isn't quite fair (it does cover A LOT!) but I'm
still left wanting (e.g. "bring it all home for me!")...

Damien, as the most "zoned storage" oriented engineer I know, can you
please kick things off by shedding light on where Linux is now, and
where it's going, for "zoned storage"?

To give some additional context to help me when you answer: I'm left
wondering what, if any, role dm-zoned has to play moving forward given
ZNS is "the future" (and yeah "the future" is now but...)?  E.g.: Does
it make sense to stack dm-zoned ontop of dm-po2zoned!?

Yet more context: When I'm asked to add full-blown support for
dm-zoned to RHEL my gut is "please no, why!?".  And if we really
should add dm-zoned is dm-po2zoned now also a requirement (to support
non-power-of-2 ZNS devices in our never-ending engineering of "zoned
storage" compatibility stop-gaps)?

In addition, it was my understanding that WDC had yet another zoned DM
target called "dm-zap" that is for ZNS based devices... It's all a bit
messy in my head (that's on me for not keeping up, but I think we need
a recap!)

So please help me, and others, become more informed as quickly as
possible! ;)

Thanks,
Mike

ps. I'm asking all this in the open on various Linux mailing lists
because it doesn't seem right to request a concall to inform only
me... I think others may need similar "zoned storage" help.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v14 13/13] dm: add power-of-2 target for zoned devices with non power-of-2 zone sizes
  2022-09-21 16:37       ` Mike Snitzer
@ 2022-09-21 17:32         ` Pankaj Raghav
  0 siblings, 0 replies; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-21 17:32 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: agk, snitzer, axboe, damien.lemoal, hch, Damien Le Moal,
	bvanassche, pankydev8, gost.dev, linux-kernel, linux-nvme,
	linux-block, dm-devel, Johannes Thumshirn, jaegeuk,
	matias.bjorling

>> +/*
>> + * This target works on the complete zoned device. Partial mapping is not
>> + * supported.
>> + * Construct a zoned po2 logical device: <dev-path>
>> + */
>> +static int dm_po2z_ctr(struct dm_target *ti, unsigned int argc, char **argv)
>> +{
>> +	struct dm_po2z_target *dmh = NULL;
>> +	int ret;
>> +	sector_t zone_size;
>> +	sector_t dev_capacity;
>> +
>> +	if (argc != 1)
>> +		return -EINVAL;
>> +
>> +	dmh = kmalloc(sizeof(*dmh), GFP_KERNEL);
>> +	if (!dmh)
>> +		return -ENOMEM;
>> +
>> +	ret = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
>> +			    &dmh->dev);
>> +	if (ret) {
>> +		ti->error = "Device lookup failed";
>> +		goto bad;
>> +	}
>> +
>> +	if (!bdev_is_zoned(dmh->dev->bdev)) {
>> +		DMERR("%pg is not a zoned device", dmh->dev->bdev);
>> +		ret = -EINVAL;
>> +		goto bad;
>> +	}
>> +
>> +	zone_size = bdev_zone_sectors(dmh->dev->bdev);
>> +	dev_capacity = get_capacity(dmh->dev->bdev->bd_disk);
>> +	if (ti->len != dev_capacity) {
>> +		DMERR("%pg Partial mapping of the target is not supported",
>> +		      dmh->dev->bdev);
>> +		ret = -EINVAL;
>> +		goto bad;
>> +	}
>> +
>> +	if (is_power_of_2(zone_size))
>> +		DMWARN("%pg: underlying device has a power-of-2 number of sectors per zone",
>> +		       dmh->dev->bdev);
>> +
>> +	dmh->zone_size = zone_size;
>> +	dmh->zone_size_po2 = 1 << get_count_order_long(zone_size);
>> +	dmh->zone_size_po2_shift = ilog2(dmh->zone_size_po2);
>> +	dmh->zone_size_diff = dmh->zone_size_po2 - dmh->zone_size;
>> +	ti->private = dmh;
>> +	ti->max_io_len = dmh->zone_size_po2;
>> +	dmh->nr_zones = npo2_zone_no(dmh, ti->len);
>> +	ti->len = dmh->zone_size_po2 * dmh->nr_zones;
>> +	return 0;
>> +
>> +bad:
>> +	kfree(dmh);
>> +	return ret;
>> +}
> 
> This error handling still isn't correct.  You're using
> dm_get_device().  If you return early due to error, _after_
> dm_get_device(), you need to dm_put_device().
> 
> Basically you need a new label above "bad:" that calls dm_put_device()
> then falls through to "bad:".  Or you need to explcitly call
> dm_put_device() before "goto bad;" in the if (ti->len != dev_capacity)
> error branch.
> 

Ah. I naively assumed dtr will be called to cleanup but not in this case as
the ctr itself fails.

I will add an extra label on top of `bad` and use it for errors that
happens after `dm_get_device`. Thanks for pointing it out Mike.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]
  2022-09-21 17:27   ` Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes] Mike Snitzer
@ 2022-09-21 23:55     ` Damien Le Moal
  2022-09-22 11:53       ` Pankaj Raghav
                         ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Damien Le Moal @ 2022-09-21 23:55 UTC (permalink / raw)
  To: Mike Snitzer, Pankaj Raghav
  Cc: agk, snitzer, axboe, hch, bvanassche, pankydev8, gost.dev,
	linux-kernel, linux-nvme, linux-block, dm-devel,
	Johannes.Thumshirn, jaegeuk, matias.bjorling

On 9/22/22 02:27, Mike Snitzer wrote:
> On Tue, Sep 20 2022 at  5:11P -0400,
> Pankaj Raghav <p.raghav@samsung.com> wrote:
> 
>> - Background and Motivation:
>>
>> The zone storage implementation in Linux, introduced since v4.10, first
>> targetted SMR drives which have a power of 2 (po2) zone size alignment
>> requirement. The po2 zone size was further imposed implicitly by the
>> block layer's blk_queue_chunk_sectors(), used to prevent IO merging
>> across chunks beyond the specified size, since v3.16 through commit
>> 762380ad9322 ("block: add notion of a chunk size for request merging").
>> But this same general block layer po2 requirement for blk_queue_chunk_sectors()
>> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
>> to be non-power-of-2").
>>
>> NAND, which is the media used in newer zoned storage devices, does not
>> naturally align to po2. In these devices, zone capacity(cap) is not the
>> same as the po2 zone size. When the zone cap != zone size, then unmapped
>> LBAs are introduced to cover the space between the zone cap and zone size.
>> po2 requirement does not make sense for these type of zone storage devices.
>> This patch series aims to remove these unmapped LBAs for zoned devices when
>> zone cap is npo2. This is done by relaxing the po2 zone size constraint
>> in the kernel and allowing zoned device with npo2 zone sizes if zone cap
>> == zone size.
>>
>> Removing the po2 requirement from zone storage should be possible
>> now provided that no userspace regression and no performance regressions are
>> introduced. Stop-gap patches have been already merged into f2fs-tools to
>> proactively not allow npo2 zone sizes until proper support is added [1].
>>
>> There were two efforts previously to add support to npo2 devices: 1) via
>> device level emulation [2] but that was rejected with a final conclusion
>> to add support for non po2 zoned device in the complete stack[3] 2)
>> adding support to the complete stack by removing the constraint in the
>> block layer and NVMe layer with support to btrfs, zonefs, etc which was
>> rejected with a conclusion to add a dm target for FS support [0]
>> to reduce the regression impact.
>>
>> This series adds support to npo2 zoned devices in the block and nvme
>> layer and a new **dm target** is added: dm-po2zoned-target. This new
>> target will be initially used for filesystems such as btrfs and
>> f2fs until native npo2 zone support is added.
> 
> As this patchset nears the point of being "ready for merge" and DM's
> "zoned" oriented targets are multiplying, I need to understand: where
> are we collectively going?  How long are we expecting to support the
> "stop-gap zoned storage" layers we've constructed?
> 
> I know https://zonedstorage.io/docs/introduction exists... but it
> _seems_ stale given the emergence of ZNS and new permutations of zoned
> hardware. Maybe that isn't quite fair (it does cover A LOT!) but I'm
> still left wanting (e.g. "bring it all home for me!")...
> 
> Damien, as the most "zoned storage" oriented engineer I know, can you
> please kick things off by shedding light on where Linux is now, and
> where it's going, for "zoned storage"?

Let me first start with what we have seen so far with deployments in the
field.

The largest user base for zoned storage is for now hyperscalers (cloud
services) deploying SMR disks. E.g. Dropbox has many times publicized its
use of SMR HDDs. ZNS is fairly new, and while it is being actively
evaluated by many, there are not yet any large scale deployments that I am
aware of.

Most of the large scale SMR users today mainly use the zoned storage
drives directly, without a file system, similarly to their use of regular
block devices. Some erasure coded object store sits on top of the zoned
drives and manage them. The interface used for that has now switched to
using the kernel API, from libzbc pass-through in the early days of SMR
support. With the inclusion of zonefs in kernel 5.6, many are now
switching to using that instead of directly accessing the block device
file. zonefs makes the application development somewhat easier (there is
no need for issuing zone management ioctls) and can also result in
applications that can actually run almost as-is on top of regular block
devices with a file system. That is a very interesting property,
especially in development phase for the user.

Beside these large scale SMR deployments, there are also many smaller
users. For these cases, dm-zoned seemed to be used a lot. In particular,
the Chia cryptocurrency boom (now fading ?) did generate a fair amount of
new SMR users relying on dm-zoned. With btrfs zoned storage support
maturing, dm-zoned is not as needed as it used to though. SMR drives can
be used directly under btrfs and I certainly am always recommending this
approach over dm-zoned+ext4 or dm-zoned+xfs as performance is much better
for write intensive workloads.

For Linux kernel overall, zoned storage is in a very good shape for raw
block device use and zonefs use. Production deployments we are seeing are
a proof of that. Currently, my team effort is mostly focused on btrfs and
zonefs and increasing zoned storage use cases.

1) For btrfs, Johannes and Naohiro are working on stabilizing support for
ZNS (we still have some issues with the management of active zones) and
implementing de-clustered parity RAID support so that zoned drives can be
used in RAID 0, 1, 10, 5, 6 and erasure coded volumes. This will address
use cases such as home NAS boxes, backup servers, small file servers,
video applications (e.g. video surveillance) etc. Essentially, any
application with large storage capacity needs that is not a distributed
setup. There are many.

2) For zonefs, I have some to-do items lined up to improve performance
(better read IO tail latency) and further improve ease of use (e.g. remove
the O_DIRECT write constraint).

3) At the block device level, we are also working on adding zoned block
device specifications to virtio and implementing that support in qemu and
the kernel. Patches are floating around now but not yet merged. This
addresses the use of zoned storage in VM environments through virtio
interface instead of directly attaching devices to guests.

> To give some additional context to help me when you answer: I'm left
> wondering what, if any, role dm-zoned has to play moving forward given
> ZNS is "the future" (and yeah "the future" is now but...)?  E.g.: Does
> it make sense to stack dm-zoned ontop of dm-po2zoned!?

That is a lot to unfold in a small paragraph :)

First of all, I would rephrase "ZNS is the future" into "ZNS is a very
interesting alternative to generic NVMe SSDs". The reason being that HDD
are not dead, far from it. They still are way cheaper than SSDs in $/TB :)
So ZNS is not really in competition with SMR HDDs jere. The 2 are
complementary, exactly like regular SSDs are complementary to regular HDDs.

dm-zoned serves some use cases for SMR HDDs (see above) but does not
address ZNS (more on this below). And given that all SMR HDD on the market
today have a zone size that is a power of 2 number of LBAs (256MB zone
size is by far the most common), dm-po2zoned is not required at all for SMR.

Pankaj patch series is all about supporting ZNS devices that have a zone
size that is not a power of 2 number of LBAs as some vendors want to
produce such drives. There is no such move happening in the SMR world as
all users are happy with the current zone sizes which match the kernel
support (which currently requires power-of-2 number of LBAs for the zone
size).

I do not think we have yet reached a consensus on if we really want to
accept any zone size for zoned storage. I personally am not a big fan of
removing the existing constraint as that makes the code somewhat heavier
(multiplication & divisions instead of bit shifts) without introducing any
benefit to the user that I can see (or agree with). And there is also a
risk of forcing onto the users to redesign/change their code to support
different devices in the same system. That is never nice to fragment
support like this for the same device class. This is why several people,
including me, requested something like dm-po2zoned, to avoid breaking user
applications if non-power-of-2 zone size drives support is merged. Better
than nothing for sure, but not ideal either. That is only my opinion.
There are different opinions out there.

> Yet more context: When I'm asked to add full-blown support for
> dm-zoned to RHEL my gut is "please no, why!?".  And if we really
> should add dm-zoned is dm-po2zoned now also a requirement (to support
> non-power-of-2 ZNS devices in our never-ending engineering of "zoned
> storage" compatibility stop-gaps)?

Support for dm-zoned in RHEL really depends on if your customers need it.
Having SMR and ZNS block device (CONFIG_BLK_DEV_ZONED) and zonefs support
enabled would already cover a lot of use cases on their own, at least the
ones we see in the field today.

Going forward, we expect more use cases to rely on btrfs rather than
dm-zoned or any equivalent DM target for ZNS. And that can also include
non power of 2 zone size drives as btrfs should normally be able to handle
such devices, if the support for them is merged. But we are not there yet
with btrfs support, hence dm-po2zoned.

But again, that all depends on if Pankaj patch series is accepted, that
is, on everybody accepting that we lift the power-of-2 zone size constraint.
> In addition, it was my understanding that WDC had yet another zoned DM
> target called "dm-zap" that is for ZNS based devices... It's all a bit
> messy in my head (that's on me for not keeping up, but I think we need
> a recap!)

Since the ZNS specification does not define conventional zones, dm-zoned
cannot be used as a standalone DM target (read: single block device) with
NVMe zoned block devices. Furthermore, due to its block mapping scheme,
dm-zoned does not support devices with zones that have a capacity lower
than the zone size. So ZNS is really a big *no* for dm-zoned. dm-zap is a
prototype and in a nutshell is the equivalent of dm-zoned for ZNS. dm-zap
can deal with the smaller zone capacity and does not require conventional
zones. We are not trying to push for dm-zap to be merged for now as we are
still evaluating its potential use cases. We also have a different but
functionally equivalent approach implemented as a block device driver that
we are evaluating internally.

Given the above mentioned usage pattern we have seen so far for zoned
storage, it is not yet clear if something like dm-zap for ZNS is needed
beside some niche use cases.

> So please help me, and others, become more informed as quickly as
> possible! ;)

I hope the above helps. If you want me to develop further any of the
points above, feel free to let me know.

> ps. I'm asking all this in the open on various Linux mailing lists
> because it doesn't seem right to request a concall to inform only
> me... I think others may need similar "zoned storage" help.

All good with me :)

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]
  2022-09-21 23:55     ` Damien Le Moal
@ 2022-09-22 11:53       ` Pankaj Raghav
  2022-09-22 19:37       ` Mike Snitzer
  2022-09-22 23:56       ` Bart Van Assche
  2 siblings, 0 replies; 24+ messages in thread
From: Pankaj Raghav @ 2022-09-22 11:53 UTC (permalink / raw)
  To: Damien Le Moal, Mike Snitzer, axboe, hch, Keith Busch
  Cc: agk, snitzer, bvanassche, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, Johannes.Thumshirn, jaegeuk,
	matias.bjorling

Thanks a lot Damien for the summary. Your feedback has made this series
much better.

> Pankaj patch series is all about supporting ZNS devices that have a zone
> size that is not a power of 2 number of LBAs as some vendors want to
> produce such drives. There is no such move happening in the SMR world as
> all users are happy with the current zone sizes which match the kernel
> support (which currently requires power-of-2 number of LBAs for the zone
> size).
> 
> I do not think we have yet reached a consensus on if we really want to
> accept any zone size for zoned storage. I personally am not a big fan of
> removing the existing constraint as that makes the code somewhat heavier
> (multiplication & divisions instead of bit shifts) without introducing any
> benefit to the user that I can see (or agree with). And there is also a
> risk of forcing onto the users to redesign/change their code to support
> different devices in the same system. That is never nice to fragment
> support like this for the same device class. This is why several people,
> including me, requested something like dm-po2zoned, to avoid breaking user
> applications if non-power-of-2 zone size drives support is merged. Better
> than nothing for sure, but not ideal either. That is only my opinion.
> There are different opinions out there.

I appreciate that you have explained the different perspectives. We have
covered this written and orally, and it seems to me that we have a good
coverage of the arguments in the list.

At this point, I would like to ask the opinion of Jens, Christoph and
Keith. Do you think we are missing anything in the series? Can this be
queued up for 6.1 (after I send the next version with a minor fix suggested
by Mike)?

--
Regards,
Pankaj

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]
  2022-09-21 23:55     ` Damien Le Moal
  2022-09-22 11:53       ` Pankaj Raghav
@ 2022-09-22 19:37       ` Mike Snitzer
  2022-09-22 21:49         ` Damien Le Moal
  2022-09-22 23:56       ` Bart Van Assche
  2 siblings, 1 reply; 24+ messages in thread
From: Mike Snitzer @ 2022-09-22 19:37 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Pankaj Raghav, agk, snitzer, axboe, hch, bvanassche, pankydev8,
	gost.dev, linux-kernel, linux-nvme, linux-block, dm-devel,
	Johannes.Thumshirn, jaegeuk, matias.bjorling

On Wed, Sep 21 2022 at  7:55P -0400,
Damien Le Moal <damien.lemoal@opensource.wdc.com> wrote:

> On 9/22/22 02:27, Mike Snitzer wrote:
> > On Tue, Sep 20 2022 at  5:11P -0400,
> > Pankaj Raghav <p.raghav@samsung.com> wrote:
> > 
> >> - Background and Motivation:
> >>
> >> The zone storage implementation in Linux, introduced since v4.10, first
> >> targetted SMR drives which have a power of 2 (po2) zone size alignment
> >> requirement. The po2 zone size was further imposed implicitly by the
> >> block layer's blk_queue_chunk_sectors(), used to prevent IO merging
> >> across chunks beyond the specified size, since v3.16 through commit
> >> 762380ad9322 ("block: add notion of a chunk size for request merging").
> >> But this same general block layer po2 requirement for blk_queue_chunk_sectors()
> >> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
> >> to be non-power-of-2").
> >>
> >> NAND, which is the media used in newer zoned storage devices, does not
> >> naturally align to po2. In these devices, zone capacity(cap) is not the
> >> same as the po2 zone size. When the zone cap != zone size, then unmapped
> >> LBAs are introduced to cover the space between the zone cap and zone size.
> >> po2 requirement does not make sense for these type of zone storage devices.
> >> This patch series aims to remove these unmapped LBAs for zoned devices when
> >> zone cap is npo2. This is done by relaxing the po2 zone size constraint
> >> in the kernel and allowing zoned device with npo2 zone sizes if zone cap
> >> == zone size.
> >>
> >> Removing the po2 requirement from zone storage should be possible
> >> now provided that no userspace regression and no performance regressions are
> >> introduced. Stop-gap patches have been already merged into f2fs-tools to
> >> proactively not allow npo2 zone sizes until proper support is added [1].
> >>
> >> There were two efforts previously to add support to npo2 devices: 1) via
> >> device level emulation [2] but that was rejected with a final conclusion
> >> to add support for non po2 zoned device in the complete stack[3] 2)
> >> adding support to the complete stack by removing the constraint in the
> >> block layer and NVMe layer with support to btrfs, zonefs, etc which was
> >> rejected with a conclusion to add a dm target for FS support [0]
> >> to reduce the regression impact.
> >>
> >> This series adds support to npo2 zoned devices in the block and nvme
> >> layer and a new **dm target** is added: dm-po2zoned-target. This new
> >> target will be initially used for filesystems such as btrfs and
> >> f2fs until native npo2 zone support is added.
> > 
> > As this patchset nears the point of being "ready for merge" and DM's
> > "zoned" oriented targets are multiplying, I need to understand: where
> > are we collectively going?  How long are we expecting to support the
> > "stop-gap zoned storage" layers we've constructed?
> > 
> > I know https://zonedstorage.io/docs/introduction exists... but it
> > _seems_ stale given the emergence of ZNS and new permutations of zoned
> > hardware. Maybe that isn't quite fair (it does cover A LOT!) but I'm
> > still left wanting (e.g. "bring it all home for me!")...
> > 
> > Damien, as the most "zoned storage" oriented engineer I know, can you
> > please kick things off by shedding light on where Linux is now, and
> > where it's going, for "zoned storage"?
> 
> Let me first start with what we have seen so far with deployments in the
> field.

<snip>

Thanks for all your insights on zoned storage, very appreciated!

> > In addition, it was my understanding that WDC had yet another zoned DM
> > target called "dm-zap" that is for ZNS based devices... It's all a bit
> > messy in my head (that's on me for not keeping up, but I think we need
> > a recap!)
> 
> Since the ZNS specification does not define conventional zones, dm-zoned
> cannot be used as a standalone DM target (read: single block device) with
> NVMe zoned block devices. Furthermore, due to its block mapping scheme,
> dm-zoned does not support devices with zones that have a capacity lower
> than the zone size. So ZNS is really a big *no* for dm-zoned. dm-zap is a
> prototype and in a nutshell is the equivalent of dm-zoned for ZNS. dm-zap
> can deal with the smaller zone capacity and does not require conventional
> zones. We are not trying to push for dm-zap to be merged for now as we are
> still evaluating its potential use cases. We also have a different but
> functionally equivalent approach implemented as a block device driver that
> we are evaluating internally.
> 
> Given the above mentioned usage pattern we have seen so far for zoned
> storage, it is not yet clear if something like dm-zap for ZNS is needed
> beside some niche use cases.

OK, good to know.  I do think dm-zoned should be trained to _not_
allow use with ZNS NVMe devices (maybe that is in place and I just
missed it?).  Because there is some confusion with at least one
customer that is asserting dm-zoned is somehow enabling them to use
ZNS NVMe devices!

Maybe they somehow don't _need_ conventional zones (writes are handled
by some other layer? and dm-zoned access is confined to read only)!?
And might they also be using ZNS NVMe devices to do _not_ have a
zone capacity lower than the zone size?

Or maybe they are mistaken and we should ask more specific questions
of them?

> > So please help me, and others, become more informed as quickly as
> > possible! ;)
> 
> I hope the above helps. If you want me to develop further any of the
> points above, feel free to let me know.

You've been extremely helpful, thanks!


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]
  2022-09-22 19:37       ` Mike Snitzer
@ 2022-09-22 21:49         ` Damien Le Moal
  0 siblings, 0 replies; 24+ messages in thread
From: Damien Le Moal @ 2022-09-22 21:49 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Pankaj Raghav, agk, snitzer, axboe, hch, bvanassche, pankydev8,
	gost.dev, linux-kernel, linux-nvme, linux-block, dm-devel,
	Johannes.Thumshirn, jaegeuk, matias.bjorling

On 9/23/22 04:37, Mike Snitzer wrote:
> On Wed, Sep 21 2022 at  7:55P -0400,
> Damien Le Moal <damien.lemoal@opensource.wdc.com> wrote:
> 
>> On 9/22/22 02:27, Mike Snitzer wrote:
>>> On Tue, Sep 20 2022 at  5:11P -0400,
>>> Pankaj Raghav <p.raghav@samsung.com> wrote:
>>>
>>>> - Background and Motivation:
>>>>
>>>> The zone storage implementation in Linux, introduced since v4.10, first
>>>> targetted SMR drives which have a power of 2 (po2) zone size alignment
>>>> requirement. The po2 zone size was further imposed implicitly by the
>>>> block layer's blk_queue_chunk_sectors(), used to prevent IO merging
>>>> across chunks beyond the specified size, since v3.16 through commit
>>>> 762380ad9322 ("block: add notion of a chunk size for request merging").
>>>> But this same general block layer po2 requirement for blk_queue_chunk_sectors()
>>>> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
>>>> to be non-power-of-2").
>>>>
>>>> NAND, which is the media used in newer zoned storage devices, does not
>>>> naturally align to po2. In these devices, zone capacity(cap) is not the
>>>> same as the po2 zone size. When the zone cap != zone size, then unmapped
>>>> LBAs are introduced to cover the space between the zone cap and zone size.
>>>> po2 requirement does not make sense for these type of zone storage devices.
>>>> This patch series aims to remove these unmapped LBAs for zoned devices when
>>>> zone cap is npo2. This is done by relaxing the po2 zone size constraint
>>>> in the kernel and allowing zoned device with npo2 zone sizes if zone cap
>>>> == zone size.
>>>>
>>>> Removing the po2 requirement from zone storage should be possible
>>>> now provided that no userspace regression and no performance regressions are
>>>> introduced. Stop-gap patches have been already merged into f2fs-tools to
>>>> proactively not allow npo2 zone sizes until proper support is added [1].
>>>>
>>>> There were two efforts previously to add support to npo2 devices: 1) via
>>>> device level emulation [2] but that was rejected with a final conclusion
>>>> to add support for non po2 zoned device in the complete stack[3] 2)
>>>> adding support to the complete stack by removing the constraint in the
>>>> block layer and NVMe layer with support to btrfs, zonefs, etc which was
>>>> rejected with a conclusion to add a dm target for FS support [0]
>>>> to reduce the regression impact.
>>>>
>>>> This series adds support to npo2 zoned devices in the block and nvme
>>>> layer and a new **dm target** is added: dm-po2zoned-target. This new
>>>> target will be initially used for filesystems such as btrfs and
>>>> f2fs until native npo2 zone support is added.
>>>
>>> As this patchset nears the point of being "ready for merge" and DM's
>>> "zoned" oriented targets are multiplying, I need to understand: where
>>> are we collectively going?  How long are we expecting to support the
>>> "stop-gap zoned storage" layers we've constructed?
>>>
>>> I know https://zonedstorage.io/docs/introduction exists... but it
>>> _seems_ stale given the emergence of ZNS and new permutations of zoned
>>> hardware. Maybe that isn't quite fair (it does cover A LOT!) but I'm
>>> still left wanting (e.g. "bring it all home for me!")...
>>>
>>> Damien, as the most "zoned storage" oriented engineer I know, can you
>>> please kick things off by shedding light on where Linux is now, and
>>> where it's going, for "zoned storage"?
>>
>> Let me first start with what we have seen so far with deployments in the
>> field.
> 
> <snip>
> 
> Thanks for all your insights on zoned storage, very appreciated!
> 
>>> In addition, it was my understanding that WDC had yet another zoned DM
>>> target called "dm-zap" that is for ZNS based devices... It's all a bit
>>> messy in my head (that's on me for not keeping up, but I think we need
>>> a recap!)
>>
>> Since the ZNS specification does not define conventional zones, dm-zoned
>> cannot be used as a standalone DM target (read: single block device) with
>> NVMe zoned block devices. Furthermore, due to its block mapping scheme,
>> dm-zoned does not support devices with zones that have a capacity lower
>> than the zone size. So ZNS is really a big *no* for dm-zoned. dm-zap is a
>> prototype and in a nutshell is the equivalent of dm-zoned for ZNS. dm-zap
>> can deal with the smaller zone capacity and does not require conventional
>> zones. We are not trying to push for dm-zap to be merged for now as we are
>> still evaluating its potential use cases. We also have a different but
>> functionally equivalent approach implemented as a block device driver that
>> we are evaluating internally.
>>
>> Given the above mentioned usage pattern we have seen so far for zoned
>> storage, it is not yet clear if something like dm-zap for ZNS is needed
>> beside some niche use cases.
> 
> OK, good to know.  I do think dm-zoned should be trained to _not_
> allow use with ZNS NVMe devices (maybe that is in place and I just
> missed it?).  Because there is some confusion with at least one
> customer that is asserting dm-zoned is somehow enabling them to use
> ZNS NVMe devices!

dm-zoned checks for conventional zones and also that all zones have a zone
capacity that is equal to the zone size. The first point puts ZNS out but
a second regular drive can be used to emulate conventional zones. However,
the second point (zone cap < zone size) is pretty much a given with ZNS
and so rules it out.

If anything, we should also add a check on the max number of active zones,
which is also a limitation that ZNS drives have, unlike SMR drives. Since
dm-zoned does not handle active zones at all, any drive with a limit
should be excluded. I will send patches for that.
> 
> Maybe they somehow don't _need_ conventional zones (writes are handled
> by some other layer? and dm-zoned access is confined to read only)!?
> And might they also be using ZNS NVMe devices to do _not_ have a
> zone capacity lower than the zone size?

It is a possibility. Indeed, if the ZNS drive has:
1) zone capacity equal to zone size
2) a second regular drive is used to emulate conventional zones
3) no limit on the max number of active zones

Then dm-zoned will work just fine. But again, I seriously doubt that point
(3) holds. And we should check that upfront in dm-zoned ctr.

> Or maybe they are mistaken and we should ask more specific questions
> of them?

Getting the exact drive characteristics (zone size, capacity and zone
resource limits) will tell you if dm-zoned can work or not.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]
  2022-09-21 23:55     ` Damien Le Moal
  2022-09-22 11:53       ` Pankaj Raghav
  2022-09-22 19:37       ` Mike Snitzer
@ 2022-09-22 23:56       ` Bart Van Assche
  2022-09-23  6:29         ` Matias Bjørling
  2 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2022-09-22 23:56 UTC (permalink / raw)
  To: Damien Le Moal, Mike Snitzer, Pankaj Raghav
  Cc: agk, snitzer, axboe, hch, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, Johannes.Thumshirn, jaegeuk,
	matias.bjorling

On 9/21/22 16:55, Damien Le Moal wrote:
> But again, that all depends on if Pankaj patch series is accepted, that
> is, on everybody accepting that we lift the power-of-2 zone size constraint.

The companies that are busy with implementing zoned storage for UFS 
devices are asking for kernel support for non-power-of-2 zone sizes.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]
  2022-09-22 23:56       ` Bart Van Assche
@ 2022-09-23  6:29         ` Matias Bjørling
  2022-09-23 16:19           ` Bart Van Assche
  0 siblings, 1 reply; 24+ messages in thread
From: Matias Bjørling @ 2022-09-23  6:29 UTC (permalink / raw)
  To: Bart Van Assche, Damien Le Moal, Mike Snitzer, Pankaj Raghav
  Cc: agk, snitzer, axboe, hch, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, Johannes Thumshirn, jaegeuk

> -----Original Message-----
> From: Bart Van Assche <bvanassche@acm.org>
> Sent: Friday, 23 September 2022 01.56
> To: Damien Le Moal <damien.lemoal@opensource.wdc.com>; Mike Snitzer
> <snitzer@redhat.com>; Pankaj Raghav <p.raghav@samsung.com>
> Cc: agk@redhat.com; snitzer@kernel.org; axboe@kernel.dk; hch@lst.de;
> pankydev8@gmail.com; gost.dev@samsung.com; linux-kernel@vger.kernel.org;
> linux-nvme@lists.infradead.org; linux-block@vger.kernel.org; dm-
> devel@redhat.com; Johannes Thumshirn <Johannes.Thumshirn@wdc.com>;
> jaegeuk@kernel.org; Matias Bjørling <Matias.Bjorling@wdc.com>
> Subject: Re: Please further explain Linux's "zoned storage" roadmap [was: Re:
> [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone
> sizes]
> 
> On 9/21/22 16:55, Damien Le Moal wrote:
> > But again, that all depends on if Pankaj patch series is accepted,
> > that is, on everybody accepting that we lift the power-of-2 zone size
> constraint.
> 
> The companies that are busy with implementing zoned storage for UFS devices
> are asking for kernel support for non-power-of-2 zone sizes.
> 
> Thanks,
> 
> Bart.

Hi Bart,

With UFS, in the proposed copy I have (may been changed) - there's the concept of gap zones, which is zones that cannot be accessed by the host. The gap zones are essentially "LBA fillers", enabling the next writeable zone to start at a X * pow2 size offset. My understanding is that this specific approach was chosen to simplify standardization in UFS and avoid updating T10's ZBC with zone capacity support. 

While UFS would technically expose non-power of 2 zone sizes, they're also, due to the gap zones, could also be considered power of 2 zones if one considers the seq. write zone + the gap zone as a single unit. 

When I think about having UFS support in the kernel, the SWR and the gap zone could be represented as a single unit. For example:

UFS - Zone Report
  Zone 0: SWR, LBA 0-11
  Zone 1: Gap, LBA 12-15
  Zone 2: SWR, LBA 16-27
  Zone 3: Gap, LBA 28-31
  ...

Kernel representation - Zone Report (as supported today)
  Zone 0: SWR, LBA 0-15, Zone Capacity 12
  Zone 1: SWR, LBA 16-31, Zone Capacity 12
  ...

If doing it this way, it removes the need for filesystems, device-mappers, user-space applications having to understand gap zones, and allows UFS to work out of the box with no changes to the rest of the zoned storage eco-system. 

Has the above representation been considered?

Best, Matias

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]
  2022-09-23  6:29         ` Matias Bjørling
@ 2022-09-23 16:19           ` Bart Van Assche
  0 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2022-09-23 16:19 UTC (permalink / raw)
  To: Matias Bjørling, Damien Le Moal, Mike Snitzer, Pankaj Raghav
  Cc: agk, snitzer, axboe, hch, pankydev8, gost.dev, linux-kernel,
	linux-nvme, linux-block, dm-devel, Johannes Thumshirn, jaegeuk

On 9/22/22 23:29, Matias Bjørling wrote:
> With UFS, in the proposed copy I have (may been changed) - there's
> the concept of gap zones, which is zones that cannot be accessed by
> the host. The gap zones are essentially "LBA fillers", enabling the
> next writeable zone to start at a X * pow2 size offset. My
> understanding is that this specific approach was chosen to simplify
> standardization in UFS and avoid updating T10's ZBC with zone
> capacity support.
> 
> While UFS would technically expose non-power of 2 zone sizes, they're
> also, due to the gap zones, could also be considered power of 2 zones
> if one considers the seq. write zone + the gap zone as a single
> unit.
> 
> When I think about having UFS support in the kernel, the SWR and the
> gap zone could be represented as a single unit. For example:
> 
> UFS - Zone Report
>    Zone 0: SWR, LBA 0-11
>    Zone 1: Gap, LBA 12-15
>    Zone 2: SWR, LBA 16-27
>    Zone 3: Gap, LBA 28-31
>    ...
> 
> Kernel representation - Zone Report (as supported today)
>    Zone 0: SWR, LBA 0-15, Zone Capacity 12
>    Zone 1: SWR, LBA 16-31, Zone Capacity 12
>    ...
> 
> If doing it this way, it removes the need for filesystems,
> device-mappers, user-space applications having to understand gap
> zones, and allows UFS to work out of the box with no changes to the
> rest of the zoned storage eco-system.
> 
> Has the above representation been considered?

Hi Matias,

What has been described above is the approach from the first version of 
the zoned storage for UFS (ZUFS) draft standard. Support for this 
approach is available in the upstream kernel. See also "[PATCH v2 0/9] 
Support zoned devices with gap zones", 2022-04-21 
(https://lore.kernel.org/linux-scsi/20220421183023.3462291-1-bvanassche@acm.org/).

Since F2FS extents must be split at gap zones, gap zones negatively 
affect sequential read and write performance. So we abandoned the gap 
zone approach. The current approach is as follows:
* The power-of-two restriction for the offset between zone starts has 
been removed. Gap zones are no longer required. Hence, we will need the 
patches that add support for zone sizes that are not a power of two.
* The Sequential Write Required (SWR) and Sequential Write Preferred 
(SWP) zone types are supported. The feedback we received from UFS 
vendors is that which zone type works best depends on their firmware and 
ASIC design.
* We need a queue depth larger than one (QD > 1) for writes to achieve 
the full sequential write bandwidth. We plan to support QD > 1 as follows:
   - If writes have to be serialized, submit these to the same hardware
     queue. According to the UFS host controller interface (UFSHCI)
     standard, UFS host controllers are not allowed to reorder SCSI
     commands that are submitted to the same hardware queue. A source of
     command reordering that remains is the SCSI retry mechanism. Retries
     happen e.g. after a command timeout.
   - For SWP zones, require the UFS device firmware to use its garbage
     collection mechanism to reorder data in the unlikely case that
     out-of-order writes happened.
   - For SWR zones, retry writes that failed because these were received
     out-of-order by a UFS device. ZBC-1 requires compliant devices to
     respond with ILLEGAL REQUEST / UNALIGNED WRITE COMMAND to out-of-
     order writes.

We have considered the zone append approach but decided not to use it 
because if zone append commands get reordered the data ends up 
permanently out-of-order on the storage medium. This affects sequential 
read performance negatively.

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-09-23 16:21 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20220920091120eucas1p2c82c18f552d6298d24547cba2f70b7fc@eucas1p2.samsung.com>
2022-09-20  9:11 ` [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes Pankaj Raghav
     [not found]   ` <CGME20220920091121eucas1p26eed14714ff34e2489bc9adb40fd1250@eucas1p2.samsung.com>
2022-09-20  9:11     ` [PATCH v14 01/13] block: make bdev_nr_zones and disk_zone_no generic for npo2 zone size Pankaj Raghav
     [not found]   ` <CGME20220920091122eucas1p2934bc26b6c11bdbafa7ebd3004ce72ee@eucas1p2.samsung.com>
2022-09-20  9:11     ` [PATCH v14 02/13] block: rearrange bdev_{is_zoned,zone_sectors,get_queue} helper in blkdev.h Pankaj Raghav
     [not found]   ` <CGME20220920091123eucas1p26623d067c9a2d36d4f3b29b77c6d937a@eucas1p2.samsung.com>
2022-09-20  9:11     ` [PATCH v14 03/13] block: allow blk-zoned devices to have non-power-of-2 zone size Pankaj Raghav
     [not found]   ` <CGME20220920091124eucas1p1dc8fd9969ebd1839fca5a3c627a29b75@eucas1p1.samsung.com>
2022-09-20  9:11     ` [PATCH v14 04/13] nvmet: Allow ZNS target to support non-power_of_2 zone sizes Pankaj Raghav
     [not found]   ` <CGME20220920091125eucas1p2e994ea9a52fcbf50fd7242ca8949c179@eucas1p2.samsung.com>
2022-09-20  9:11     ` [PATCH v14 05/13] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size Pankaj Raghav
     [not found]   ` <CGME20220920091126eucas1p2e2665221ea346937bff44c9fd6963928@eucas1p2.samsung.com>
2022-09-20  9:11     ` [PATCH v14 06/13] null_blk: allow zoned devices with non power-of-2 zone sizes Pankaj Raghav
     [not found]   ` <CGME20220920091127eucas1p252a2edaae00ff379bc6f9fbcc2791487@eucas1p2.samsung.com>
2022-09-20  9:11     ` [PATCH v14 07/13] zonefs: allow non power of 2 zoned devices Pankaj Raghav
     [not found]   ` <CGME20220920091128eucas1p2d9c9dcebdce4039360f92b0f577ab649@eucas1p2.samsung.com>
2022-09-20  9:11     ` [PATCH v14 08/13] dm-zoned: ensure only power of 2 zone sizes are allowed Pankaj Raghav
     [not found]   ` <CGME20220920091129eucas1p26137d368bf7e2cfc2d585ce41f3cdc86@eucas1p2.samsung.com>
2022-09-20  9:11     ` [PATCH v14 09/13] dm-zone: use generic helpers to calculate offset from zone start Pankaj Raghav
     [not found]   ` <CGME20220920091131eucas1p1ab4ef3ac98b2fd27c546a44b583a1745@eucas1p1.samsung.com>
2022-09-20  9:11     ` [PATCH v14 10/13] dm-table: allow zoned devices with non power-of-2 zone sizes Pankaj Raghav
     [not found]   ` <CGME20220920091132eucas1p2e98ce6f411f1c3e8e10c4eae81aba296@eucas1p2.samsung.com>
2022-09-20  9:11     ` [PATCH v14 11/13] dm: call dm_zone_endio after the target endio callback for zoned devices Pankaj Raghav
     [not found]   ` <CGME20220920091133eucas1p1ce01c88d47ce934b2995bea6a6bc55df@eucas1p1.samsung.com>
2022-09-20  9:11     ` [PATCH v14 12/13] dm: introduce DM_EMULATED_ZONES target feature flag Pankaj Raghav
     [not found]   ` <CGME20220920091134eucas1p275585b70fab48ba1f19eb5d2cc515b6d@eucas1p2.samsung.com>
2022-09-20  9:11     ` [PATCH v14 13/13] dm: add power-of-2 target for zoned devices with non power-of-2 zone sizes Pankaj Raghav
2022-09-21 16:37       ` Mike Snitzer
2022-09-21 17:32         ` Pankaj Raghav
2022-09-21 17:27   ` Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes] Mike Snitzer
2022-09-21 23:55     ` Damien Le Moal
2022-09-22 11:53       ` Pankaj Raghav
2022-09-22 19:37       ` Mike Snitzer
2022-09-22 21:49         ` Damien Le Moal
2022-09-22 23:56       ` Bart Van Assche
2022-09-23  6:29         ` Matias Bjørling
2022-09-23 16:19           ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).