linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/16] support non power of 2 zoned devices
       [not found] <CGME20220427160256eucas1p2db2b58792ffc93026d870c260767da14@eucas1p2.samsung.com>
@ 2022-04-27 16:02 ` Pankaj Raghav
       [not found]   ` <CGME20220427160257eucas1p21fb58d0129376a135fdf0b9c2fe88895@eucas1p2.samsung.com>
                     ` (16 more replies)
  0 siblings, 17 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

- Background and Motivation:

The zone storage implementation in Linux, introduced since v4.10, first
targetted SMR drives which have a power of 2 (po2) zone size alignment
requirement. The po2 zone size was further imposed implicitly by the
block layer's blk_queue_chunk_sectors(), used to prevent IO merging
across chunks beyond the specified size, since v3.16 through commit
762380ad9322 ("block: add notion of a chunk size for request merging").
But this same general block layer po2 requirement for blk_queue_chunk_sectors()
was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
to be non-power-of-2"). NAND, which is the media used in newer zoned storage
devices, does not naturally align to po2, and so the po2 requirement
does not make sense for those type of zone storage devices.

Removing the po2 requirement from zone storage should therefore be possible
now provided that no userspace regression and no performance regressions are
introduced. Stop-gap patches have been already merged into f2fs-tools to
proactively not allow npo2 zone sizes until proper support is added [0].
Additional kernel stop-gap patches are provided in this series for dm-zoned.
Support for npo2 zonefs and btrfs support is addressed in this series.

There was an effort previously [1] to add support to non po2 devices via
device level emulation but that was rejected with a final conclusion
to add support for non po2 zoned device in the complete stack[2].

- Patchset description:
This patchset aims at adding support to non power of 2 zoned devices in
the block layer, nvme layer, null blk and adds support to btrfs and
zonefs.

This round of patches **will not** support f2fs and DM layer for non
power of 2 zoned devices. More about this in the future work section.

Patches 1-4 deals with removing the po2 constraint from the
block layer. Note that the patches have been split for clarity and it can
be squashed together if the community feels that is better.

Patches 5-6 deals with removing the constraint from nvme zns.

Patches 7-11 adds support to btrfs for non po2 zoned devices.

Patch 12 removes the po2 constraint in ZoneFS

Patch 13 removes the po2 contraint in null blk

Patches 14-16 adds conditions to not allow non power of 2 devices in
f2fs and DM.

- Performance:
PO2 zone sizes utilizes log and shifts instead of division when
determing alignment, zone number, etc. The same math cannot be used when
using a zoned device with non po2 zone size. Hence, to avoid any performance
regression on zoned devices with po2 zone sizes, the optimized math in the
hot paths has been retained with branching.

The performance was measured using null blk for regression
and the results have been posted in the appropriate commit log. No
performance regression was noticed.

- Testing
With respect to testing we need to tackle two things: one for regression
on po2 zoned device and progression on non po2 zoned devices.

kdevops (https://github.com/mcgrof/kdevops) was extensively used to
automate the testing for blktests and (x)fstests for btrfs changes. The
known failures were excluded during the test based on the baseline
v5.17.0-rc7

-- regression
Emulated zoned device with zone size =128M , nr_zones = 10000

Block and nvme zns:
blktests were run with no new failures

Btrfs:
Changes were tested with the following profile in QEMU:
[btrfs_simple_zns]
TEST_DIR=<dir>
SCRATCH_MNT=<mnt>
FSTYP=btrfs
MKFS_OPTIONS="-f -d single -m single"
TEST_DEV=<dev>
SCRATCH_DEV_POOL=<dev-pool>

No new failures were observed in btrfs, generic and shared test suite

ZoneFS:
zonefs-tests-nullblk.sh and zonefs-tests.sh from zonefs-tools were run
with no failures.

nullblk:
t/zbd/run-tests-against-nullb from fio was run with no failures.

F2FS and DM:
It was verified if f2fs and dm-zoned successfully mounts without any
error.

-- progression
Emulated zoned device with zone size = 96M , nr_zones = 10000

Block and nvme zns:
blktests were run with no new failures

Btrfs:
Same profile as po2 zone size was used.

Many tests in xfstests for btrfs included dm-flakey and some tests
required dm-linear. As they are not supported at the moment for non
po2 devices, those **tests were excluded for non po2 devices**.

No new failures were observed in btrfs, generic and shared test suite

ZoneFS:
zonefs-tests.sh from zonefs-tools were run with no failures.

nullblk:
A new section was added to cover non po2 devices:

section14()
{
       conv_pcnt=10
       zone_size=3
       zone_capacity=3
       max_open=${set_max_open}
       zbd_test_opts+=("-o ${max_open}")
}
t/zbd/run-tests-against-nullb from fio was run with no failures.

F2FS and DM:
It was verified that f2fs and dm-zoned does not mount.

- Open issue:
* btrfs superblock location for zoned devices is expected to be in 0,
  512GB(mirror) and 4TB(mirror) in the device. Zoned devices with po2
  zone size will naturally align with these superblock location but non
  po2 devices will not align with 512GB and 4TB offset.

  The current approach for npo2 devices is to place the superblock mirror
  zones near   512GB and 4TB that is **aligned to the zone size**. This
  is of no issue for normal operation as we keep track where the superblock
  mirror are placed but this can cause an issue with recovery tools for
  zoned devices as they expect mirror superblock to be in 512GB and 4TB.

  Note that ATM, recovery tools such as `btrfs check` does not work for
  image dumps for zoned devices even for po2 zone sizes.

  I hope this issue could be discussed as a part of the BoF on ZNS
  during the upcoming LSFMM.

- Tools:
Some tools had to be updated to support non po2 devices. Once these
patches are accepted in the kernel, these tool updates will also be
upstreamed.
* btrfs-prog: https://github.com/Panky-codes/btrfs-progs/tree/remove-po2-btrfs
* blkzone: https://github.com/Panky-codes/util-linux/tree/remove-po2

- Future work
To reduce the amount of changes and testing, support for F2FS and DM was
excluded in this round of patches. The plan is to add support to them in
the forthcoming future.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git/commit/?h=dev-test&id=6afcf6493578e77528abe65ab8b12f3e1c16749f
[1] https://lore.kernel.org/all/20220310094725.GA28499@lst.de/T/
[2] https://lore.kernel.org/all/20220315135245.eqf4tqngxxb7ymqa@unifi/

Luis Chamberlain (4):
  nvmet: use blk_queue_zone_no()
  f2fs: call bdev_zone_sectors() only once on init_blkz_info()
  f2fs: ensure only power of 2 zone sizes are allowed
  dm-zoned: ensure only power of 2 zone sizes are allowed

Pankaj Raghav (12):
  block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2
    zsze
  block: add blk_queue_zone_aligned and bdev_zone_aligned helper
  block: add bdev_zone_no helper
  block: allow blk-zoned devices to have non-power-of-2 zone size
  nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  btrfs: zoned: Cache superblock location in btrfs_zoned_device_info
  btrfs: zoned: add generic btrfs helpers for zoned devices
  btrfs: zoned: Make sb_zone_number function non power of 2 compatible
  btrfs: zoned: use btrfs zone helpers to support non po2 zoned devices
  btrfs: zoned: relax the alignment constraint for zoned devices
  zonefs: allow non power of 2 zoned devices
  null_blk: allow non power of 2 zoned devices

 block/blk-core.c               |   3 +-
 block/blk-zoned.c              |  20 ++++--
 drivers/block/null_blk/main.c  |   4 +-
 drivers/block/null_blk/zoned.c |  14 ++--
 drivers/md/dm-zone.c           |  12 ++++
 drivers/nvme/host/zns.c        |  11 +---
 drivers/nvme/target/zns.c      |   2 +-
 fs/btrfs/volumes.c             |  24 +++----
 fs/btrfs/zoned.c               | 115 +++++++++++++++++----------------
 fs/btrfs/zoned.h               |  41 ++++++++++--
 fs/f2fs/super.c                |  15 +++--
 fs/zonefs/super.c              |   6 +-
 fs/zonefs/zonefs.h             |   1 -
 include/linux/blkdev.h         |  48 +++++++++++++-
 14 files changed, 208 insertions(+), 108 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze
       [not found]   ` <CGME20220427160257eucas1p21fb58d0129376a135fdf0b9c2fe88895@eucas1p2.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  2022-04-29 17:16       ` Adam Manzanares
                         ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
also work for non-power-of-2 zone sizes.

As the existing deployments of zoned devices had power-of-2
assumption, power-of-2 optimized calculation is kept for those devices.

There are no direct hot paths modified and the changes just
introduce one new branch per call.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 block/blk-zoned.c      | 8 +++++++-
 include/linux/blkdev.h | 8 +++++++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 38cd840d8838..1dff4a8bd51d 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -117,10 +117,16 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
 unsigned int blkdev_nr_zones(struct gendisk *disk)
 {
 	sector_t zone_sectors = blk_queue_zone_sectors(disk->queue);
+	sector_t capacity = get_capacity(disk);
 
 	if (!blk_queue_is_zoned(disk->queue))
 		return 0;
-	return (get_capacity(disk) + zone_sectors - 1) >> ilog2(zone_sectors);
+
+	if (is_power_of_2(zone_sectors))
+		return (capacity + zone_sectors - 1) >>
+		       ilog2(zone_sectors);
+
+	return div64_u64(capacity + zone_sectors - 1, zone_sectors);
 }
 EXPORT_SYMBOL_GPL(blkdev_nr_zones);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 60d016138997..c4e4c7071b7b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -665,9 +665,15 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
 static inline unsigned int blk_queue_zone_no(struct request_queue *q,
 					     sector_t sector)
 {
+	sector_t zone_sectors = blk_queue_zone_sectors(q);
+
 	if (!blk_queue_is_zoned(q))
 		return 0;
-	return sector >> ilog2(q->limits.chunk_sectors);
+
+	if (is_power_of_2(zone_sectors))
+		return sector >> ilog2(zone_sectors);
+
+	return div64_u64(sector, zone_sectors);
 }
 
 static inline bool blk_queue_zone_is_seq(struct request_queue *q,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 02/16] block: add blk_queue_zone_aligned and bdev_zone_aligned helper
       [not found]   ` <CGME20220427160258eucas1p19548a7094f67b4c9f340add776f60082@eucas1p1.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  2022-04-27 23:52       ` Bart Van Assche
  2022-05-04 16:55       ` Hannes Reinecke
  0 siblings, 2 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

Checking if a given sector is aligned to a zone is a very common
operation that is performed for zoned devices. Add
blk_queue_zone_aligned helper to check for this instead of opencoding it
everywhere.

The helper is made to be generic so that it can also check for alignment
for non non-power-of-2 zone size devices.

As the existing deployments of zoned devices had power-of-2
assumption, power-of-2 optimized calculation is done for devices with
power-of-2 zone size

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 include/linux/blkdev.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c4e4c7071b7b..f8f2d2998afb 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -676,6 +676,22 @@ static inline unsigned int blk_queue_zone_no(struct request_queue *q,
 	return div64_u64(sector, zone_sectors);
 }
 
+static inline bool blk_queue_zone_aligned(struct request_queue *q, sector_t sec)
+{
+	sector_t zone_sectors = blk_queue_zone_sectors(q);
+	u64 remainder = 0;
+
+	if (!blk_queue_is_zoned(q))
+		return false;
+
+	if (is_power_of_2(zone_sectors))
+		return IS_ALIGNED(sec, zone_sectors);
+
+	div64_u64_rem(sec, zone_sectors, &remainder);
+	/* if there is a remainder, then the sector is not aligned */
+	return remainder == 0;
+}
+
 static inline bool blk_queue_zone_is_seq(struct request_queue *q,
 					 sector_t sector)
 {
@@ -722,6 +738,12 @@ static inline unsigned int blk_queue_zone_no(struct request_queue *q,
 {
 	return 0;
 }
+
+static inline bool blk_queue_zone_aligned(struct request_queue *q, sector_t sec)
+{
+	return false;
+}
+
 static inline unsigned int queue_max_open_zones(const struct request_queue *q)
 {
 	return 0;
@@ -1361,6 +1383,15 @@ static inline sector_t bdev_zone_sectors(struct block_device *bdev)
 	return 0;
 }
 
+static inline bool bdev_zone_aligned(struct block_device *bdev, sector_t sec)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (q)
+		return blk_queue_zone_aligned(q, sec);
+	return false;
+}
+
 static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 03/16] block: add bdev_zone_no helper
       [not found]   ` <CGME20220427160259eucas1p25aab0637fec229cd1140e6aa08678f38@eucas1p2.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  2022-04-27 23:31       ` Damien Le Moal
                         ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

Many places in the filesystem for zoned devices open code this function
to find the zone number for a given sector with power of 2 assumption.
This generic helper can be used to calculate zone number for a given
sector in a block device

This helper internally uses blk_queue_zone_no to find the zone number.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 include/linux/blkdev.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f8f2d2998afb..55293e0a8702 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1392,6 +1392,15 @@ static inline bool bdev_zone_aligned(struct block_device *bdev, sector_t sec)
 	return false;
 }
 
+static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (q)
+		return blk_queue_zone_no(q, sec);
+	return 0;
+}
+
 static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 04/16] block: allow blk-zoned devices to have non-power-of-2 zone size
       [not found]   ` <CGME20220427160300eucas1p1470fe30535849de6204bb78d7083cb3a@eucas1p1.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  2022-04-27 23:37       ` Damien Le Moal
  2022-05-04 16:59       ` Hannes Reinecke
  0 siblings, 2 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

Convert the calculations on zone size to be generic instead of relying on
power_of_2 based logic in the block layer using the helpers wherever
possible.

The only hot path affected by this change for power_of_2 zoned devices
is in blk_check_zone_append() but the effects should be negligible as the
helper blk_queue_zone_aligned() optimizes the calculation for those
devices. Note that the append path cannot be accessed by direct raw access
to the block device but only through a filesystem abstraction.

Finally, remove the check for power_of_2 zone size requirement in
blk-zoned.c

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 block/blk-core.c  |  3 +--
 block/blk-zoned.c | 12 ++++++------
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 937bb6b86331..850caf311064 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -634,8 +634,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 		return BLK_STS_NOTSUPP;
 
 	/* The bio sector must point to the start of a sequential zone */
-	if (pos & (blk_queue_zone_sectors(q) - 1) ||
-	    !blk_queue_zone_is_seq(q, pos))
+	if (!blk_queue_zone_aligned(q, pos) || !blk_queue_zone_is_seq(q, pos))
 		return BLK_STS_IOERR;
 
 	/*
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 1dff4a8bd51d..f7c7c3bd148d 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -288,10 +288,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
 		return -EINVAL;
 
 	/* Check alignment (handle eventual smaller last zone) */
-	if (sector & (zone_sectors - 1))
+	if (!blk_queue_zone_aligned(q, sector))
 		return -EINVAL;
 
-	if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
+	if (!blk_queue_zone_aligned(q, nr_sectors) && end_sector != capacity)
 		return -EINVAL;
 
 	/*
@@ -489,14 +489,14 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 	 * smaller last zone.
 	 */
 	if (zone->start == 0) {
-		if (zone->len == 0 || !is_power_of_2(zone->len)) {
-			pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
-				disk->disk_name, zone->len);
+		if (zone->len == 0) {
+			pr_warn("%s: Invalid zoned device size",
+				disk->disk_name);
 			return -ENODEV;
 		}
 
 		args->zone_sectors = zone->len;
-		args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
+		args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
 	} else if (zone->start + args->zone_sectors < capacity) {
 		if (zone->len != args->zone_sectors) {
 			pr_warn("%s: Invalid zoned device with non constant zone size\n",
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
       [not found]   ` <CGME20220427160301eucas1p147d0dced70946e20dd2dd046b94b8224@eucas1p1.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  2022-04-29 17:23       ` Adam Manzanares
                         ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

Remove the condition which disallows non-power_of_2 zone size ZNS drive
to be updated and use generic method to calculate number of zones
instead of relying on log and shift based calculation on zone size.

The power_of_2 calculation has been replaced directly with generic
calculation without special handling. Both modified functions are not
used in hot paths, they are only used during initialization &
revalidation of the ZNS device.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/host/zns.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 9f81beb4df4e..2087de0768ee 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
 	}
 
 	ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
-	if (!is_power_of_2(ns->zsze)) {
-		dev_warn(ns->ctrl->device,
-			"invalid zone size:%llu for namespace:%u\n",
-			ns->zsze, ns->head->ns_id);
-		status = -ENODEV;
-		goto free_data;
-	}
 
 	blk_queue_set_zoned(ns->disk, BLK_ZONED_HM);
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
@@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
 				   sizeof(struct nvme_zone_descriptor);
 
 	nr_zones = min_t(unsigned int, nr_zones,
-			 get_capacity(ns->disk) >> ilog2(ns->zsze));
+			 div64_u64(get_capacity(ns->disk), ns->zsze));
 
 	bufsize = sizeof(struct nvme_zone_report) +
 		nr_zones * sizeof(struct nvme_zone_descriptor);
@@ -197,7 +190,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
 	c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
 
-	sector &= ~(ns->zsze - 1);
+	sector = rounddown(sector, ns->zsze);
 	while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
 		memset(report, 0, buflen);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 06/16] nvmet: use blk_queue_zone_no()
       [not found]   ` <CGME20220427160302eucas1p1aaba7a309778d3440c3315ad899e4035@eucas1p1.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  2022-04-29 17:27       ` Adam Manzanares
                         ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

From: Luis Chamberlain <mcgrof@kernel.org>

Instead of open coding the number of zones given a sector, use the helper
blk_queue_zone_no(). This let's us make modifications to the math if
needed in one place and adds now support for npo2 zone devices.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/target/zns.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
index e34718b09550..e41b6a6ef048 100644
--- a/drivers/nvme/target/zns.c
+++ b/drivers/nvme/target/zns.c
@@ -243,7 +243,7 @@ static unsigned long nvmet_req_nr_zones_from_slba(struct nvmet_req *req)
 	unsigned int sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);
 
 	return blkdev_nr_zones(req->ns->bdev->bd_disk) -
-		(sect >> ilog2(bdev_zone_sectors(req->ns->bdev)));
+	       bdev_zone_no(req->ns->bdev, sect);
 }
 
 static unsigned long get_nr_zones_from_buf(struct nvmet_req *req, u32 bufsize)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 07/16] btrfs: zoned: Cache superblock location in btrfs_zoned_device_info
       [not found]   ` <CGME20220427160303eucas1p1c7d1b743e9ecf77b4f203bdeccbe382e@eucas1p1.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  0 siblings, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

Instead of calculating the superblock location every time, cache the
superblock zone location in btrfs_zoned_device_info struct and use it to
locate the zone index.

The functions such as btrfs_sb_log_location_bdev() and
btrfs_reset_sb_log_zones() which work directly on block_device shall
continue to use the sb_zone_number because btrfs_zoned_device_info
struct might not have been initialized at that point.

This patch will enable non power-of-2 zoned devices to not perform
division to lookup superblock and its mirror location.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/btrfs/zoned.c | 13 +++++++++----
 fs/btrfs/zoned.h |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 1b1b310c3c51..6f76942d0ea5 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -512,6 +512,11 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
 			   max_active_zones - nactive);
 	}
 
+	/* Cache the sb zone number */
+	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; ++i) {
+		zone_info->sb_zone_location[i] =
+			sb_zone_number(zone_info->zone_size_shift, i);
+	}
 	/* Validate superblock log */
 	nr_zones = BTRFS_NR_SB_LOG_ZONES;
 	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
@@ -519,7 +524,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
 		u64 sb_wp;
 		int sb_pos = BTRFS_NR_SB_LOG_ZONES * i;
 
-		sb_zone = sb_zone_number(zone_info->zone_size_shift, i);
+		sb_zone = zone_info->sb_zone_location[i];
 		if (sb_zone + 1 >= zone_info->nr_zones)
 			continue;
 
@@ -867,7 +872,7 @@ int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
 		return 0;
 	}
 
-	zone_num = sb_zone_number(zinfo->zone_size_shift, mirror);
+	zone_num = zinfo->sb_zone_location[mirror];
 	if (zone_num + 1 >= zinfo->nr_zones)
 		return -ENOENT;
 
@@ -884,7 +889,7 @@ static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo,
 	if (!zinfo)
 		return false;
 
-	zone_num = sb_zone_number(zinfo->zone_size_shift, mirror);
+	zone_num = zinfo->sb_zone_location[mirror];
 	if (zone_num + 1 >= zinfo->nr_zones)
 		return false;
 
@@ -1012,7 +1017,7 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
 			u32 sb_zone;
 			u64 sb_pos;
 
-			sb_zone = sb_zone_number(shift, i);
+			sb_zone = zinfo->sb_zone_location[i];
 			if (!(end <= sb_zone ||
 			      sb_zone + BTRFS_NR_SB_LOG_ZONES <= begin)) {
 				have_sb = true;
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index cbf016a7bb5d..49317524e9a6 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -31,6 +31,7 @@ struct btrfs_zoned_device_info {
 	unsigned long *active_zones;
 	struct blk_zone *zone_cache;
 	struct blk_zone sb_zones[2 * BTRFS_SUPER_MIRROR_MAX];
+	u32 sb_zone_location[BTRFS_SUPER_MIRROR_MAX];
 };
 
 #ifdef CONFIG_BLK_DEV_ZONED
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 08/16] btrfs: zoned: add generic btrfs helpers for zoned devices
       [not found]   ` <CGME20220427160304eucas1p1a0080df82f76c39882c4298c3c3d99fd@eucas1p1.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  0 siblings, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

Add helpers to calculate alignment, round up and round down
for zoned devices. These helpers encapsulates the necessary handling for
power_of_2 and non-power_of_2 zone sizes. Optimized calculations are
performed for zone sizes that are power_of_2 with log and shifts.

btrfs_zoned_is_aligned() is added instead of reusing bdev_zone_aligned()
helper is due to some use cases in btrfs where zone alignment is checked
before having access to the underlying block device such as in this
function: btrfs_load_block_group_zone_info().

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/btrfs/zoned.h | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 49317524e9a6..b9c435961361 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -9,6 +9,7 @@
 #include "disk-io.h"
 #include "block-group.h"
 #include "btrfs_inode.h"
+#include "misc.h"
 
 /*
  * Block groups with more than this value (percents) of unusable space will be
@@ -34,6 +35,33 @@ struct btrfs_zoned_device_info {
 	u32 sb_zone_location[BTRFS_SUPER_MIRROR_MAX];
 };
 
+static inline bool btrfs_zoned_is_aligned(u64 pos, u64 zone_size)
+{
+	u64 remainder = 0;
+
+	if (is_power_of_two_u64(zone_size))
+		return IS_ALIGNED(pos, zone_size);
+
+	div64_u64_rem(pos, zone_size, &remainder);
+	return remainder == 0;
+}
+
+static inline u64 btrfs_zoned_roundup(u64 pos, u64 zone_size)
+{
+	if (is_power_of_two_u64(zone_size))
+		return ALIGN(pos, zone_size);
+
+	return roundup(pos, zone_size);
+}
+
+static inline u64 btrfs_zoned_rounddown(u64 pos, u64 zone_size)
+{
+	if (is_power_of_two_u64(zone_size))
+		return round_down(pos, zone_size);
+
+	return rounddown(pos, zone_size);
+}
+
 #ifdef CONFIG_BLK_DEV_ZONED
 int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 		       struct blk_zone *zone);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 09/16] btrfs: zoned: Make sb_zone_number function non power of 2 compatible
       [not found]   ` <CGME20220427160305eucas1p26831c19df0b2097e42209edcf73526b7@eucas1p2.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  0 siblings, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

Make the calculation in sb_zone_number function to be generic and work
for both power-of-2 and non power-of-2 zone sizes.

The function signature has been modified to take block device and mirror
as input as this function is only invoked from callers that have access
to the block device. This enables to use the generic bdev_zone_no
function provided by the block layer to calculate the zone number.

Even though division is used to calculate the zone index for non
power-of-2 zone sizes, this function will not be used in the fast path as
the sb_zone_location cache is used for the superblock zone location.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/btrfs/zoned.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 6f76942d0ea5..8f574a474420 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -34,9 +34,6 @@
 #define BTRFS_SB_LOG_FIRST_OFFSET	(512ULL * SZ_1G)
 #define BTRFS_SB_LOG_SECOND_OFFSET	(4096ULL * SZ_1G)
 
-#define BTRFS_SB_LOG_FIRST_SHIFT	const_ilog2(BTRFS_SB_LOG_FIRST_OFFSET)
-#define BTRFS_SB_LOG_SECOND_SHIFT	const_ilog2(BTRFS_SB_LOG_SECOND_OFFSET)
-
 /* Number of superblock log zones */
 #define BTRFS_NR_SB_LOG_ZONES 2
 
@@ -153,15 +150,23 @@ static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones,
 /*
  * Get the first zone number of the superblock mirror
  */
-static inline u32 sb_zone_number(int shift, int mirror)
+static inline u32 sb_zone_number(struct block_device *bdev, int mirror)
 {
 	u64 zone;
 
 	ASSERT(mirror < BTRFS_SUPER_MIRROR_MAX);
 	switch (mirror) {
-	case 0: zone = 0; break;
-	case 1: zone = 1ULL << (BTRFS_SB_LOG_FIRST_SHIFT - shift); break;
-	case 2: zone = 1ULL << (BTRFS_SB_LOG_SECOND_SHIFT - shift); break;
+	case 0:
+		zone = 0;
+		break;
+	case 1:
+		zone = bdev_zone_no(bdev,
+				    BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT);
+		break;
+	case 2:
+		zone = bdev_zone_no(bdev,
+				    BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT);
+		break;
 	}
 
 	ASSERT(zone <= U32_MAX);
@@ -515,7 +520,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
 	/* Cache the sb zone number */
 	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; ++i) {
 		zone_info->sb_zone_location[i] =
-			sb_zone_number(zone_info->zone_size_shift, i);
+			sb_zone_number(bdev, i);
 	}
 	/* Validate superblock log */
 	nr_zones = BTRFS_NR_SB_LOG_ZONES;
@@ -840,7 +845,7 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
 	nr_sectors = bdev_nr_sectors(bdev);
 	nr_zones = nr_sectors >> zone_sectors_shift;
 
-	sb_zone = sb_zone_number(zone_sectors_shift + SECTOR_SHIFT, mirror);
+	sb_zone = sb_zone_number(bdev, mirror);
 	if (sb_zone + 1 >= nr_zones)
 		return -ENOENT;
 
@@ -964,7 +969,7 @@ int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror)
 	nr_sectors = bdev_nr_sectors(bdev);
 	nr_zones = nr_sectors >> zone_sectors_shift;
 
-	sb_zone = sb_zone_number(zone_sectors_shift + SECTOR_SHIFT, mirror);
+	sb_zone = sb_zone_number(bdev, mirror);
 	if (sb_zone + 1 >= nr_zones)
 		return -ENOENT;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 10/16] btrfs: zoned: use btrfs zone helpers to support non po2 zoned devices
       [not found]   ` <CGME20220427160306eucas1p10514a8597007ed9d5e269d659df58d35@eucas1p1.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  0 siblings, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

Use the generic btrfs zone helpers to calculate zone index, check zone
alignment, round up and round down operations.

The zone_size_shift field is not needed anymore as generic helpers are
used for calculation.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/btrfs/volumes.c | 24 +++++++++-------
 fs/btrfs/zoned.c   | 72 ++++++++++++++++++++++------------------------
 fs/btrfs/zoned.h   | 12 ++++----
 3 files changed, 54 insertions(+), 54 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a8cc736731fd..0c6d1600d8b3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1421,7 +1421,7 @@ static u64 dev_extent_search_start(struct btrfs_device *device, u64 start)
 		 * allocator, because we anyway use/reserve the first two zones
 		 * for superblock logging.
 		 */
-		return ALIGN(start, device->zone_info->zone_size);
+		return btrfs_zoned_roundup(start, device->zone_info->zone_size);
 	default:
 		BUG();
 	}
@@ -1436,7 +1436,7 @@ static bool dev_extent_hole_check_zoned(struct btrfs_device *device,
 	int ret;
 	bool changed = false;
 
-	ASSERT(IS_ALIGNED(*hole_start, zone_size));
+	ASSERT(btrfs_zoned_is_aligned(*hole_start, zone_size));
 
 	while (*hole_size > 0) {
 		pos = btrfs_find_allocatable_zones(device, *hole_start,
@@ -1573,7 +1573,7 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 	search_start = dev_extent_search_start(device, search_start);
 
 	WARN_ON(device->zone_info &&
-		!IS_ALIGNED(num_bytes, device->zone_info->zone_size));
+		!btrfs_zoned_is_aligned(num_bytes, device->zone_info->zone_size));
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -5131,8 +5131,8 @@ static void init_alloc_chunk_ctl_policy_zoned(
 
 	ctl->max_stripe_size = zone_size;
 	if (type & BTRFS_BLOCK_GROUP_DATA) {
-		ctl->max_chunk_size = round_down(BTRFS_MAX_DATA_CHUNK_SIZE,
-						 zone_size);
+		ctl->max_chunk_size = btrfs_zoned_rounddown(
+			BTRFS_MAX_DATA_CHUNK_SIZE, zone_size);
 	} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
 		ctl->max_chunk_size = ctl->max_stripe_size;
 	} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
@@ -5144,9 +5144,10 @@ static void init_alloc_chunk_ctl_policy_zoned(
 	}
 
 	/* We don't want a chunk larger than 10% of writable space */
-	limit = max(round_down(div_factor(fs_devices->total_rw_bytes, 1),
-			       zone_size),
-		    min_chunk_size);
+	limit = max(
+		btrfs_zoned_rounddown(div_factor(fs_devices->total_rw_bytes, 1),
+				      zone_size),
+		min_chunk_size);
 	ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
 	ctl->dev_extent_min = zone_size * ctl->dev_stripes;
 }
@@ -6725,7 +6726,8 @@ static void submit_stripe_bio(struct btrfs_io_context *bioc, struct bio *bio,
 	 */
 	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
 		if (btrfs_dev_is_sequential(dev, physical)) {
-			u64 zone_start = round_down(physical, fs_info->zone_size);
+			u64 zone_start = btrfs_zoned_rounddown(physical,
+							fs_info->zone_size);
 
 			bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;
 		} else {
@@ -8119,8 +8121,8 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 	if (dev->zone_info) {
 		u64 zone_size = dev->zone_info->zone_size;
 
-		if (!IS_ALIGNED(physical_offset, zone_size) ||
-		    !IS_ALIGNED(physical_len, zone_size)) {
+		if (!btrfs_zoned_is_aligned(physical_offset, zone_size) ||
+		    !btrfs_zoned_is_aligned(physical_len, zone_size)) {
 			btrfs_err(fs_info,
 "zoned: dev extent devid %llu physical offset %llu len %llu is not aligned to device zone",
 				  devid, physical_offset, physical_len);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 8f574a474420..8f3f542e174c 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -177,13 +177,13 @@ static inline u32 sb_zone_number(struct block_device *bdev, int mirror)
 static inline sector_t zone_start_sector(u32 zone_number,
 					 struct block_device *bdev)
 {
-	return (sector_t)zone_number << ilog2(bdev_zone_sectors(bdev));
+	return zone_number * bdev_zone_sectors(bdev);
 }
 
 static inline u64 zone_start_physical(u32 zone_number,
 				      struct btrfs_zoned_device_info *zone_info)
 {
-	return (u64)zone_number << zone_info->zone_size_shift;
+	return zone_number * zone_info->zone_size;
 }
 
 /*
@@ -236,8 +236,8 @@ static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
 	if (zinfo->zone_cache) {
 		unsigned int i;
 
-		ASSERT(IS_ALIGNED(pos, zinfo->zone_size));
-		zno = pos >> zinfo->zone_size_shift;
+		ASSERT(btrfs_zoned_is_aligned(pos, zinfo->zone_size));
+		zno = bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT);
 		/*
 		 * We cannot report zones beyond the zone end. So, it is OK to
 		 * cap *nr_zones to at the end.
@@ -410,9 +410,8 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
 	}
 
 	nr_sectors = bdev_nr_sectors(bdev);
-	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
-	zone_info->nr_zones = nr_sectors >> ilog2(zone_sectors);
-	if (!IS_ALIGNED(nr_sectors, zone_sectors))
+	zone_info->nr_zones = bdev_zone_no(bdev, nr_sectors);
+	if (!btrfs_zoned_is_aligned(nr_sectors, zone_sectors))
 		zone_info->nr_zones++;
 
 	max_active_zones = queue_max_active_zones(queue);
@@ -824,10 +823,8 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
 			       u64 *bytenr_ret)
 {
 	struct blk_zone zones[BTRFS_NR_SB_LOG_ZONES];
-	sector_t zone_sectors;
 	u32 sb_zone;
 	int ret;
-	u8 zone_sectors_shift;
 	sector_t nr_sectors;
 	u32 nr_zones;
 
@@ -838,12 +835,10 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
 
 	ASSERT(rw == READ || rw == WRITE);
 
-	zone_sectors = bdev_zone_sectors(bdev);
-	if (!is_power_of_2(zone_sectors))
+	if (!is_power_of_2(bdev_zone_sectors(bdev)))
 		return -EINVAL;
-	zone_sectors_shift = ilog2(zone_sectors);
 	nr_sectors = bdev_nr_sectors(bdev);
-	nr_zones = nr_sectors >> zone_sectors_shift;
+	nr_zones = bdev_zone_no(bdev, nr_sectors);
 
 	sb_zone = sb_zone_number(bdev, mirror);
 	if (sb_zone + 1 >= nr_zones)
@@ -960,14 +955,12 @@ int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror)
 {
 	sector_t zone_sectors;
 	sector_t nr_sectors;
-	u8 zone_sectors_shift;
 	u32 sb_zone;
 	u32 nr_zones;
 
 	zone_sectors = bdev_zone_sectors(bdev);
-	zone_sectors_shift = ilog2(zone_sectors);
 	nr_sectors = bdev_nr_sectors(bdev);
-	nr_zones = nr_sectors >> zone_sectors_shift;
+	nr_zones = bdev_zone_no(bdev, nr_sectors);
 
 	sb_zone = sb_zone_number(bdev, mirror);
 	if (sb_zone + 1 >= nr_zones)
@@ -993,18 +986,17 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
 				 u64 hole_end, u64 num_bytes)
 {
 	struct btrfs_zoned_device_info *zinfo = device->zone_info;
-	const u8 shift = zinfo->zone_size_shift;
-	u64 nzones = num_bytes >> shift;
+	u64 nzones = bdev_zone_no(device->bdev, num_bytes >> SECTOR_SHIFT);
 	u64 pos = hole_start;
 	u64 begin, end;
 	bool have_sb;
 	int i;
 
-	ASSERT(IS_ALIGNED(hole_start, zinfo->zone_size));
-	ASSERT(IS_ALIGNED(num_bytes, zinfo->zone_size));
+	ASSERT(btrfs_zoned_is_aligned(hole_start, zinfo->zone_size));
+	ASSERT(btrfs_zoned_is_aligned(num_bytes, zinfo->zone_size));
 
 	while (pos < hole_end) {
-		begin = pos >> shift;
+		begin = bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT);
 		end = begin + nzones;
 
 		if (end > zinfo->nr_zones)
@@ -1036,8 +1028,9 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
 			if (!(pos + num_bytes <= sb_pos ||
 			      sb_pos + BTRFS_SUPER_INFO_SIZE <= pos)) {
 				have_sb = true;
-				pos = ALIGN(sb_pos + BTRFS_SUPER_INFO_SIZE,
-					    zinfo->zone_size);
+				pos = btrfs_zoned_roundup(
+					sb_pos + BTRFS_SUPER_INFO_SIZE,
+					zinfo->zone_size);
 				break;
 			}
 		}
@@ -1051,7 +1044,7 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
 static bool btrfs_dev_set_active_zone(struct btrfs_device *device, u64 pos)
 {
 	struct btrfs_zoned_device_info *zone_info = device->zone_info;
-	unsigned int zno = (pos >> zone_info->zone_size_shift);
+	unsigned int zno = bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT);
 
 	/* We can use any number of zones */
 	if (zone_info->max_active_zones == 0)
@@ -1073,7 +1066,7 @@ static bool btrfs_dev_set_active_zone(struct btrfs_device *device, u64 pos)
 static void btrfs_dev_clear_active_zone(struct btrfs_device *device, u64 pos)
 {
 	struct btrfs_zoned_device_info *zone_info = device->zone_info;
-	unsigned int zno = (pos >> zone_info->zone_size_shift);
+	unsigned int zno = bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT);
 
 	/* We can use any number of zones */
 	if (zone_info->max_active_zones == 0)
@@ -1109,14 +1102,14 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
 {
 	struct btrfs_zoned_device_info *zinfo = device->zone_info;
-	const u8 shift = zinfo->zone_size_shift;
-	unsigned long begin = start >> shift;
-	unsigned long end = (start + size) >> shift;
+	unsigned long begin = bdev_zone_no(device->bdev, start >> SECTOR_SHIFT);
+	unsigned long end =
+		bdev_zone_no(device->bdev, (start + size) >> SECTOR_SHIFT);
 	u64 pos;
 	int ret;
 
-	ASSERT(IS_ALIGNED(start, zinfo->zone_size));
-	ASSERT(IS_ALIGNED(size, zinfo->zone_size));
+	ASSERT(btrfs_zoned_is_aligned(start, zinfo->zone_size));
+	ASSERT(btrfs_zoned_is_aligned(size, zinfo->zone_size));
 
 	if (end > zinfo->nr_zones)
 		return -ERANGE;
@@ -1140,8 +1133,9 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
 		/* Free regions should be empty */
 		btrfs_warn_in_rcu(
 			device->fs_info,
-		"zoned: resetting device %s (devid %llu) zone %llu for allocation",
-			rcu_str_deref(device->name), device->devid, pos >> shift);
+			"zoned: resetting device %s (devid %llu) zone %u for allocation",
+			rcu_str_deref(device->name), device->devid,
+			bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT));
 		WARN_ON_ONCE(1);
 
 		ret = btrfs_reset_device_zone(device, pos, zinfo->zone_size,
@@ -1238,7 +1232,7 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 		return 0;
 
 	/* Sanity check */
-	if (!IS_ALIGNED(length, fs_info->zone_size)) {
+	if (!btrfs_zoned_is_aligned(length, fs_info->zone_size)) {
 		btrfs_err(fs_info,
 		"zoned: block group %llu len %llu unaligned to zone size %llu",
 			  logical, length, fs_info->zone_size);
@@ -1326,7 +1320,7 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 		 * The group is mapped to a sequential zone. Get the zone write
 		 * pointer to determine the allocation offset within the zone.
 		 */
-		WARN_ON(!IS_ALIGNED(physical[i], fs_info->zone_size));
+		WARN_ON(!btrfs_zoned_is_aligned(physical[i], fs_info->zone_size));
 		nofs_flag = memalloc_nofs_save();
 		ret = btrfs_get_dev_zone(device, physical[i], &zone);
 		memalloc_nofs_restore(nofs_flag);
@@ -1352,10 +1346,12 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 		switch (zone.cond) {
 		case BLK_ZONE_COND_OFFLINE:
 		case BLK_ZONE_COND_READONLY:
-			btrfs_err(fs_info,
-		"zoned: offline/readonly zone %llu on device %s (devid %llu)",
-				  physical[i] >> device->zone_info->zone_size_shift,
-				  rcu_str_deref(device->name), device->devid);
+			btrfs_err(
+				fs_info,
+				"zoned: offline/readonly zone %u on device %s (devid %llu)",
+				bdev_zone_no(device->bdev,
+					     physical[i] >> SECTOR_SHIFT),
+				rcu_str_deref(device->name), device->devid);
 			alloc_offsets[i] = WP_MISSING_DEV;
 			break;
 		case BLK_ZONE_COND_EMPTY:
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index b9c435961361..c578bdb6bf2f 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -23,7 +23,6 @@ struct btrfs_zoned_device_info {
 	 * zoned block device.
 	 */
 	u64 zone_size;
-	u8  zone_size_shift;
 	u32 nr_zones;
 	unsigned int max_active_zones;
 	atomic_t active_zones_left;
@@ -274,7 +273,8 @@ static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
 	if (!zone_info)
 		return false;
 
-	return test_bit(pos >> zone_info->zone_size_shift, zone_info->seq_zones);
+	return test_bit(bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT),
+			zone_info->seq_zones);
 }
 
 static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
@@ -284,7 +284,8 @@ static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
 	if (!zone_info)
 		return true;
 
-	return test_bit(pos >> zone_info->zone_size_shift, zone_info->empty_zones);
+	return test_bit(bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT),
+			zone_info->empty_zones);
 }
 
 static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device,
@@ -296,7 +297,7 @@ static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device,
 	if (!zone_info)
 		return;
 
-	zno = pos >> zone_info->zone_size_shift;
+	zno = bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT);
 	if (set)
 		set_bit(zno, zone_info->empty_zones);
 	else
@@ -350,7 +351,8 @@ static inline bool btrfs_can_zone_reset(struct btrfs_device *device,
 		return false;
 
 	zone_size = device->zone_info->zone_size;
-	if (!IS_ALIGNED(physical, zone_size) || !IS_ALIGNED(length, zone_size))
+	if (!btrfs_zoned_is_aligned(physical, zone_size) ||
+	    !btrfs_zoned_is_aligned(length, zone_size))
 		return false;
 
 	return true;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 11/16] btrfs: zoned: relax the alignment constraint for zoned devices
       [not found]   ` <CGME20220427160307eucas1p229f9ebae38fcca9974909799e5e63ccf@eucas1p2.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  0 siblings, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

Checks were in place to return error when a non power-of-2 zoned devices
is detected. Remove those checks as non power-of-2 zoned devices are
now supported.

Relax the zone size constraint to align with the BTRFS_STRIPE_LEN(64k)
so that block groups are aligned to the BTRFS_STRIPE_LEN.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/btrfs/zoned.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 8f3f542e174c..3ed085762f14 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -395,8 +395,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
 		zone_sectors = bdev_zone_sectors(bdev);
 	}
 
-	/* Check if it's power of 2 (see is_power_of_2) */
-	ASSERT(zone_sectors != 0 && (zone_sectors & (zone_sectors - 1)) == 0);
+	ASSERT(zone_sectors != 0 && IS_ALIGNED(zone_sectors, BTRFS_STRIPE_LEN));
 	zone_info->zone_size = zone_sectors << SECTOR_SHIFT;
 
 	/* We reject devices with a zone size larger than 8GB */
@@ -835,9 +834,11 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
 
 	ASSERT(rw == READ || rw == WRITE);
 
-	if (!is_power_of_2(bdev_zone_sectors(bdev)))
-		return -EINVAL;
 	nr_sectors = bdev_nr_sectors(bdev);
+
+	if (!IS_ALIGNED(nr_sectors, BTRFS_STRIPE_LEN))
+		return -EINVAL;
+
 	nr_zones = bdev_zone_no(bdev, nr_sectors);
 
 	sb_zone = sb_zone_number(bdev, mirror);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 12/16] zonefs: allow non power of 2 zoned devices
       [not found]   ` <CGME20220427160309eucas1p2f677c8db581616f994473f17c4a5bd44@eucas1p2.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  2022-04-27 23:39       ` Damien Le Moal
  0 siblings, 1 reply; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

The zone size shift variable is useful only if the zone sizes are known
to be power of 2. Remove that variable and use generic helpers from
block layer to calculate zone index in zonefs

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/zonefs/super.c  | 6 ++----
 fs/zonefs/zonefs.h | 1 -
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 3614c7834007..5422be2ca570 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -401,10 +401,9 @@ static void __zonefs_io_error(struct inode *inode, bool write)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	struct super_block *sb = inode->i_sb;
-	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
 	unsigned int noio_flag;
 	unsigned int nr_zones =
-		zi->i_zone_size >> (sbi->s_zone_sectors_shift + SECTOR_SHIFT);
+		bdev_zone_no(sb->s_bdev, zi->i_zone_size >> SECTOR_SHIFT);
 	struct zonefs_ioerr_data err = {
 		.inode = inode,
 		.write = write,
@@ -1300,7 +1299,7 @@ static void zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
 	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 
-	inode->i_ino = zone->start >> sbi->s_zone_sectors_shift;
+	inode->i_ino = bdev_zone_no(sb->s_bdev, zone->start);
 	inode->i_mode = S_IFREG | sbi->s_perm;
 
 	zi->i_ztype = type;
@@ -1647,7 +1646,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 	 * interface constraints.
 	 */
 	sb_set_blocksize(sb, bdev_zone_write_granularity(sb->s_bdev));
-	sbi->s_zone_sectors_shift = ilog2(bdev_zone_sectors(sb->s_bdev));
 	sbi->s_uid = GLOBAL_ROOT_UID;
 	sbi->s_gid = GLOBAL_ROOT_GID;
 	sbi->s_perm = 0640;
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 7b147907c328..2d5ea3be3a8e 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -175,7 +175,6 @@ struct zonefs_sb_info {
 	kgid_t			s_gid;
 	umode_t			s_perm;
 	uuid_t			s_uuid;
-	unsigned int		s_zone_sectors_shift;
 
 	unsigned int		s_nr_files[ZONEFS_ZTYPE_MAX];
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 13/16] null_blk: allow non power of 2 zoned devices
       [not found]   ` <CGME20220427160310eucas1p28cd3c5ff4fb7a04bc77c4c0b9d96bb74@eucas1p2.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  2022-04-29 17:30       ` Adam Manzanares
                         ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

Convert the power of 2 based calculation with zone size to be generic in
null_zone_no with optimization for power of 2 based zone sizes.

The nr_zones calculation in null_init_zoned_dev has been replaced with a
division without special handling for power of 2 based zone sizes as
this function is called only during the initialization and will not
invoked in the hot path.

Performance Measurement:

Device:
zone size = 128M, blocksize=4k

FIO cmd:

fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
--loops=4

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Sequential Write:

x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  578     |  2257    |   12.80   |  576     |  2248    |   25.78   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  581     |  2268    |   12.74   |  576     |  2248    |   25.85   |
x-----------------x---------------------------------x---------------------------------x

Sequential read:

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  667     |  2605    |   11.79   |  675     |  2637    |   23.49   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  667     |  2605    |   11.79   |  675     |  2638    |   23.48   |
x-----------------x---------------------------------x---------------------------------x

Random read:

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  522     |  2038    |   15.05   |  514     |  2006    |   30.87   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  522     |  2039    |   15.04   |  523     |  2042    |   30.33   |
x-----------------x---------------------------------x---------------------------------x

Minor variations are noticed in Sequential write with io depth 8 and
in random read with io depth 16. But overall no noticeable differences
were noticed

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/block/null_blk/main.c  |  4 ++--
 drivers/block/null_blk/zoned.c | 14 +++++++-------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index c441a4972064..82a62b543782 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -1931,8 +1931,8 @@ static int null_validate_conf(struct nullb_device *dev)
 		dev->mbps = 0;
 
 	if (dev->zoned &&
-	    (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
-		pr_err("zone_size must be power-of-two\n");
+	    (!dev->zone_size)) {
+		pr_err("zone_size must not be zero\n");
 		return -EINVAL;
 	}
 
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index dae54dd1aeac..00c34e65ef0a 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -13,7 +13,10 @@ static inline sector_t mb_to_sects(unsigned long mb)
 
 static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
 {
-	return sect >> ilog2(dev->zone_size_sects);
+	if (is_power_of_2(dev->zone_size_sects))
+		return sect >> ilog2(dev->zone_size_sects);
+
+	return div64_u64(sect, dev->zone_size_sects);
 }
 
 static inline void null_lock_zone_res(struct nullb_device *dev)
@@ -62,10 +65,6 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	sector_t sector = 0;
 	unsigned int i;
 
-	if (!is_power_of_2(dev->zone_size)) {
-		pr_err("zone_size must be power-of-two\n");
-		return -EINVAL;
-	}
 	if (dev->zone_size > dev->size) {
 		pr_err("Zone size larger than device capacity\n");
 		return -EINVAL;
@@ -83,8 +82,9 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	zone_capacity_sects = mb_to_sects(dev->zone_capacity);
 	dev_capacity_sects = mb_to_sects(dev->size);
 	dev->zone_size_sects = mb_to_sects(dev->zone_size);
-	dev->nr_zones = round_up(dev_capacity_sects, dev->zone_size_sects)
-		>> ilog2(dev->zone_size_sects);
+	dev->nr_zones =
+		div64_u64(roundup(dev_capacity_sects, dev->zone_size_sects),
+			  dev->zone_size_sects);
 
 	dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct nullb_zone),
 				    GFP_KERNEL | __GFP_ZERO);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 14/16] f2fs: call bdev_zone_sectors() only once on init_blkz_info()
       [not found]   ` <CGME20220427160311eucas1p151141fc73adc590b40ad6f935b1ac214@eucas1p1.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  2022-05-03 20:04       ` Jaegeuk Kim
  0 siblings, 1 reply; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

From: Luis Chamberlain <mcgrof@kernel.org>

Instead of calling bdev_zone_sectors() multiple times, call
it once and cache the value locally. This will make the
subsequent change easier to read.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/f2fs/super.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index ea939db18f88..f64761a15df7 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -3678,22 +3678,25 @@ static int init_blkz_info(struct f2fs_sb_info *sbi, int devi)
 	struct block_device *bdev = FDEV(devi).bdev;
 	sector_t nr_sectors = bdev_nr_sectors(bdev);
 	struct f2fs_report_zones_args rep_zone_arg;
+	u64 zone_sectors;
 	int ret;
 
 	if (!f2fs_sb_has_blkzoned(sbi))
 		return 0;
 
+	zone_sectors = bdev_zone_sectors(bdev);
+
 	if (sbi->blocks_per_blkz && sbi->blocks_per_blkz !=
-				SECTOR_TO_BLOCK(bdev_zone_sectors(bdev)))
+				SECTOR_TO_BLOCK(zone_sectors))
 		return -EINVAL;
-	sbi->blocks_per_blkz = SECTOR_TO_BLOCK(bdev_zone_sectors(bdev));
+	sbi->blocks_per_blkz = SECTOR_TO_BLOCK(zone_sectors);
 	if (sbi->log_blocks_per_blkz && sbi->log_blocks_per_blkz !=
 				__ilog2_u32(sbi->blocks_per_blkz))
 		return -EINVAL;
 	sbi->log_blocks_per_blkz = __ilog2_u32(sbi->blocks_per_blkz);
 	FDEV(devi).nr_blkz = SECTOR_TO_BLOCK(nr_sectors) >>
 					sbi->log_blocks_per_blkz;
-	if (nr_sectors & (bdev_zone_sectors(bdev) - 1))
+	if (nr_sectors & (zone_sectors - 1))
 		FDEV(devi).nr_blkz++;
 
 	FDEV(devi).blkz_seq = f2fs_kvzalloc(sbi,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 15/16] f2fs: ensure only power of 2 zone sizes are allowed
       [not found]   ` <CGME20220427160312eucas1p279bcffd97ef83bd3617a38b80d979746@eucas1p2.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  2022-05-03 20:05       ` Jaegeuk Kim
  0 siblings, 1 reply; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

From: Luis Chamberlain <mcgrof@kernel.org>

F2FS zoned support has power of 2 zone size assumption in many places
such as in __f2fs_issue_discard_zone, init_blkz_info. As the power of 2
requirement has been removed from the block layer, explicitly add a
condition in f2fs to allow only power of 2 zone size devices.

This condition will be relaxed once those calculation based on power of
2 is made generic.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/f2fs/super.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index f64761a15df7..db79abf30002 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -3685,6 +3685,10 @@ static int init_blkz_info(struct f2fs_sb_info *sbi, int devi)
 		return 0;
 
 	zone_sectors = bdev_zone_sectors(bdev);
+	if (!is_power_of_2(zone_sectors)) {
+		f2fs_err(sbi, "F2FS does not support non power of 2 zone sizes\n");
+		return -EINVAL;
+	}
 
 	if (sbi->blocks_per_blkz && sbi->blocks_per_blkz !=
 				SECTOR_TO_BLOCK(zone_sectors))
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 16/16] dm-zoned: ensure only power of 2 zone sizes are allowed
       [not found]   ` <CGME20220427160313eucas1p1feecf74ec15c8c3d9250444710fd1676@eucas1p1.samsung.com>
@ 2022-04-27 16:02     ` Pankaj Raghav
  2022-04-27 23:42       ` Damien Le Moal
  2022-05-04 17:11       ` Hannes Reinecke
  0 siblings, 2 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-27 16:02 UTC (permalink / raw)
  To: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block, Pankaj Raghav

From: Luis Chamberlain <mcgrof@kernel.org>

Today dm-zoned relies on the assumption that you have a zone size
with a power of 2. Even though the block layer today enforces this
requirement, these devices do exist and so provide a stop-gap measure
to ensure these devices cannot be used by mistake

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/md/dm-zone.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 57daa86c19cf..221e0aa0f1a7 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
 	struct request_queue *q = md->queue;
 	unsigned int noio_flag;
 	int ret;
+	struct block_device *bdev = md->disk->part0;
+	sector_t zone_sectors;
+	char bname[BDEVNAME_SIZE];
+
+	zone_sectors = bdev_zone_sectors(bdev);
+
+	if (!is_power_of_2(zone_sectors)) {
+		DMWARN("%s: %s only power of two zone size supported\n",
+		       dm_device_name(md),
+		       bdevname(bdev, bname));
+		return 1;
+	}
 
 	/*
 	 * Check if something changed. If yes, cleanup the current resources
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 03/16] block: add bdev_zone_no helper
  2022-04-27 16:02     ` [PATCH 03/16] block: add bdev_zone_no helper Pankaj Raghav
@ 2022-04-27 23:31       ` Damien Le Moal
  2022-04-28 15:40         ` Pankaj Raghav
  2022-04-27 23:53       ` Bart Van Assche
  2022-05-04 16:55       ` Hannes Reinecke
  2 siblings, 1 reply; 62+ messages in thread
From: Damien Le Moal @ 2022-04-27 23:31 UTC (permalink / raw)
  To: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block

On 4/28/22 01:02, Pankaj Raghav wrote:
> Many places in the filesystem for zoned devices open code this function
> to find the zone number for a given sector with power of 2 assumption.
> This generic helper can be used to calculate zone number for a given
> sector in a block device
> 
> This helper internally uses blk_queue_zone_no to find the zone number.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  include/linux/blkdev.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index f8f2d2998afb..55293e0a8702 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1392,6 +1392,15 @@ static inline bool bdev_zone_aligned(struct block_device *bdev, sector_t sec)
>  	return false;
>  }
>  
> +static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
> +{
> +	struct request_queue *q = bdev_get_queue(bdev);
> +
> +	if (q)

q is never NULL. So this can be simplified to:

	return blk_queue_zone_no(bdev_get_queue(bdev), sector);

> +		return blk_queue_zone_no(q, sec);
> +	return 0;
> +}
> +
>  static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
>  {
>  	struct request_queue *q = bdev_get_queue(bdev);


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 04/16] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-04-27 16:02     ` [PATCH 04/16] block: allow blk-zoned devices to have non-power-of-2 zone size Pankaj Raghav
@ 2022-04-27 23:37       ` Damien Le Moal
  2022-04-28 17:29         ` Luis Chamberlain
  2022-04-28 19:04         ` Pankaj Raghav
  2022-05-04 16:59       ` Hannes Reinecke
  1 sibling, 2 replies; 62+ messages in thread
From: Damien Le Moal @ 2022-04-27 23:37 UTC (permalink / raw)
  To: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block

On 4/28/22 01:02, Pankaj Raghav wrote:
> Convert the calculations on zone size to be generic instead of relying on
> power_of_2 based logic in the block layer using the helpers wherever
> possible.
> 
> The only hot path affected by this change for power_of_2 zoned devices
> is in blk_check_zone_append() but the effects should be negligible as the
> helper blk_queue_zone_aligned() optimizes the calculation for those
> devices. Note that the append path cannot be accessed by direct raw access
> to the block device but only through a filesystem abstraction.
> 
> Finally, remove the check for power_of_2 zone size requirement in
> blk-zoned.c
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  block/blk-core.c  |  3 +--
>  block/blk-zoned.c | 12 ++++++------
>  2 files changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 937bb6b86331..850caf311064 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -634,8 +634,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>  		return BLK_STS_NOTSUPP;
>  
>  	/* The bio sector must point to the start of a sequential zone */
> -	if (pos & (blk_queue_zone_sectors(q) - 1) ||
> -	    !blk_queue_zone_is_seq(q, pos))
> +	if (!blk_queue_zone_aligned(q, pos) || !blk_queue_zone_is_seq(q, pos))

blk_queue_zone_aligned() is a little confusing since "aligned" is also
used for write-pointer aligned. I would rename this helper

blk_queue_is_zone_start()

or something like that.


>  		return BLK_STS_IOERR;
>  
>  	/*
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 1dff4a8bd51d..f7c7c3bd148d 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -288,10 +288,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
>  		return -EINVAL;
>  
>  	/* Check alignment (handle eventual smaller last zone) */
> -	if (sector & (zone_sectors - 1))
> +	if (!blk_queue_zone_aligned(q, sector))
>  		return -EINVAL;
>  
> -	if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
> +	if (!blk_queue_zone_aligned(q, nr_sectors) && end_sector != capacity)
>  		return -EINVAL;
>  
>  	/*
> @@ -489,14 +489,14 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>  	 * smaller last zone.
>  	 */
>  	if (zone->start == 0) {
> -		if (zone->len == 0 || !is_power_of_2(zone->len)) {
> -			pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
> -				disk->disk_name, zone->len);
> +		if (zone->len == 0) {
> +			pr_warn("%s: Invalid zoned device size",
> +				disk->disk_name);

The message is weird now. Please change it to "Invalid zone size".

Also, the entire premise of this patch series is that it is hard for
people to support the unusable sectors between zone capacity and zone end
for drives with a zone capacity smaller than the zone size.

Yet, here you do not check that zone capacity == zone size for drives that
do not have a zone size equal to a power of 2 number of sectors. This
means that we can still have drives with ZC < ZS AND ZS not equal to a
power of 2. So from the point of view of your arguments, no gains at all.
Any thoughts on this ?

>  			return -ENODEV;
>  		}
>  
>  		args->zone_sectors = zone->len;
> -		args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
> +		args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
>  	} else if (zone->start + args->zone_sectors < capacity) {
>  		if (zone->len != args->zone_sectors) {
>  			pr_warn("%s: Invalid zoned device with non constant zone size\n",


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 12/16] zonefs: allow non power of 2 zoned devices
  2022-04-27 16:02     ` [PATCH 12/16] zonefs: allow non power of 2 " Pankaj Raghav
@ 2022-04-27 23:39       ` Damien Le Moal
  2022-04-28 15:54         ` Pankaj Raghav
  0 siblings, 1 reply; 62+ messages in thread
From: Damien Le Moal @ 2022-04-27 23:39 UTC (permalink / raw)
  To: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block

On 4/28/22 01:02, Pankaj Raghav wrote:
> The zone size shift variable is useful only if the zone sizes are known
> to be power of 2. Remove that variable and use generic helpers from
> block layer to calculate zone index in zonefs

Period missing at the end of the sentence.

What about zonefs-tools and its test suite ? Is everything still OK on
that front ? I suspect not...

> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/zonefs/super.c  | 6 ++----
>  fs/zonefs/zonefs.h | 1 -
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index 3614c7834007..5422be2ca570 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -401,10 +401,9 @@ static void __zonefs_io_error(struct inode *inode, bool write)
>  {
>  	struct zonefs_inode_info *zi = ZONEFS_I(inode);
>  	struct super_block *sb = inode->i_sb;
> -	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
>  	unsigned int noio_flag;
>  	unsigned int nr_zones =
> -		zi->i_zone_size >> (sbi->s_zone_sectors_shift + SECTOR_SHIFT);
> +		bdev_zone_no(sb->s_bdev, zi->i_zone_size >> SECTOR_SHIFT);
>  	struct zonefs_ioerr_data err = {
>  		.inode = inode,
>  		.write = write,
> @@ -1300,7 +1299,7 @@ static void zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
>  	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
>  	struct zonefs_inode_info *zi = ZONEFS_I(inode);
>  
> -	inode->i_ino = zone->start >> sbi->s_zone_sectors_shift;
> +	inode->i_ino = bdev_zone_no(sb->s_bdev, zone->start);
>  	inode->i_mode = S_IFREG | sbi->s_perm;
>  
>  	zi->i_ztype = type;
> @@ -1647,7 +1646,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
>  	 * interface constraints.
>  	 */
>  	sb_set_blocksize(sb, bdev_zone_write_granularity(sb->s_bdev));
> -	sbi->s_zone_sectors_shift = ilog2(bdev_zone_sectors(sb->s_bdev));
>  	sbi->s_uid = GLOBAL_ROOT_UID;
>  	sbi->s_gid = GLOBAL_ROOT_GID;
>  	sbi->s_perm = 0640;
> diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
> index 7b147907c328..2d5ea3be3a8e 100644
> --- a/fs/zonefs/zonefs.h
> +++ b/fs/zonefs/zonefs.h
> @@ -175,7 +175,6 @@ struct zonefs_sb_info {
>  	kgid_t			s_gid;
>  	umode_t			s_perm;
>  	uuid_t			s_uuid;
> -	unsigned int		s_zone_sectors_shift;
>  
>  	unsigned int		s_nr_files[ZONEFS_ZTYPE_MAX];
>  


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 16/16] dm-zoned: ensure only power of 2 zone sizes are allowed
  2022-04-27 16:02     ` [PATCH 16/16] dm-zoned: " Pankaj Raghav
@ 2022-04-27 23:42       ` Damien Le Moal
  2022-04-28 17:34         ` Luis Chamberlain
  2022-05-04 17:11       ` Hannes Reinecke
  1 sibling, 1 reply; 62+ messages in thread
From: Damien Le Moal @ 2022-04-27 23:42 UTC (permalink / raw)
  To: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block

On 4/28/22 01:02, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Today dm-zoned relies on the assumption that you have a zone size
> with a power of 2. Even though the block layer today enforces this
> requirement, these devices do exist and so provide a stop-gap measure
> to ensure these devices cannot be used by mistake
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/md/dm-zone.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 57daa86c19cf..221e0aa0f1a7 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
>  	struct request_queue *q = md->queue;
>  	unsigned int noio_flag;
>  	int ret;
> +	struct block_device *bdev = md->disk->part0;
> +	sector_t zone_sectors;
> +	char bname[BDEVNAME_SIZE];
> +
> +	zone_sectors = bdev_zone_sectors(bdev);
> +
> +	if (!is_power_of_2(zone_sectors)) {
> +		DMWARN("%s: %s only power of two zone size supported\n",
> +		       dm_device_name(md),
> +		       bdevname(bdev, bname));
> +		return 1;
> +	}

Why ?

See my previous email about still allowing ZC < ZS for non power of 2 zone
size drives. dm-zoned can easily support non power of 2 zone size as long
as ZC == ZS for all zones.

The problem with dm-zoned is ZC < ZS *AND* potentially variable ZC per
zone. That cannot be supported easily (still not impossible, but
definitely a lot more complex).

>  
>  	/*
>  	 * Check if something changed. If yes, cleanup the current resources


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/16] block: add blk_queue_zone_aligned and bdev_zone_aligned helper
  2022-04-27 16:02     ` [PATCH 02/16] block: add blk_queue_zone_aligned and bdev_zone_aligned helper Pankaj Raghav
@ 2022-04-27 23:52       ` Bart Van Assche
  2022-04-28 15:33         ` Pankaj Raghav
  2022-05-04 16:55       ` Hannes Reinecke
  1 sibling, 1 reply; 62+ messages in thread
From: Bart Van Assche @ 2022-04-27 23:52 UTC (permalink / raw)
  To: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	jiangbo.365, linux-fsdevel, matias.bjorling, linux-block

On 4/27/22 09:02, Pankaj Raghav wrote:
> +static inline bool bdev_zone_aligned(struct block_device *bdev, sector_t sec)
> +{
> +	struct request_queue *q = bdev_get_queue(bdev);
> +
> +	if (q)
> +		return blk_queue_zone_aligned(q, sec);
> +	return false;
> +}

Which patch uses this function? I can't find any patch in this series 
that introduces a call to this function.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 03/16] block: add bdev_zone_no helper
  2022-04-27 16:02     ` [PATCH 03/16] block: add bdev_zone_no helper Pankaj Raghav
  2022-04-27 23:31       ` Damien Le Moal
@ 2022-04-27 23:53       ` Bart Van Assche
  2022-04-28 15:34         ` Pankaj Raghav
  2022-05-04 16:55       ` Hannes Reinecke
  2 siblings, 1 reply; 62+ messages in thread
From: Bart Van Assche @ 2022-04-27 23:53 UTC (permalink / raw)
  To: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	jiangbo.365, linux-fsdevel, matias.bjorling, linux-block

On 4/27/22 09:02, Pankaj Raghav wrote:
> +static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
> +{
> +	struct request_queue *q = bdev_get_queue(bdev);
> +
> +	if (q)
> +		return blk_queue_zone_no(q, sec);
> +	return 0;
> +}

This patch series has been split incorrectly: the same patch that 
introduces a new function should also introduce a caller to that function.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/16] block: add blk_queue_zone_aligned and bdev_zone_aligned helper
  2022-04-27 23:52       ` Bart Van Assche
@ 2022-04-28 15:33         ` Pankaj Raghav
  0 siblings, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-28 15:33 UTC (permalink / raw)
  To: Bart Van Assche, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, clm, gost.dev, chao, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, jiangbo.365, matias.bjorling,
	linux-block

On 2022-04-28 01:52, Bart Van Assche wrote:
> On 4/27/22 09:02, Pankaj Raghav wrote:
>> +static inline bool bdev_zone_aligned(struct block_device *bdev,
>> sector_t sec)
>> +{
>> +    struct request_queue *q = bdev_get_queue(bdev);
>> +
>> +    if (q)
>> +        return blk_queue_zone_aligned(q, sec);
>> +    return false;
>> +}
> 
> Which patch uses this function? I can't find any patch in this series
> that introduces a call to this function.
> 
Initially I used it but at the end I had to remove that patch but I
forgot to remove this function. Thanks for pointing it out. I will fix
it up in the next rev.
> Thanks,
> 
> Bart.
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 03/16] block: add bdev_zone_no helper
  2022-04-27 23:53       ` Bart Van Assche
@ 2022-04-28 15:34         ` Pankaj Raghav
  0 siblings, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-28 15:34 UTC (permalink / raw)
  To: Bart Van Assche, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, clm, gost.dev, chao, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, jiangbo.365, matias.bjorling,
	linux-block

On 2022-04-28 01:53, Bart Van Assche wrote:
> On 4/27/22 09:02, Pankaj Raghav wrote:
>> +static inline unsigned int bdev_zone_no(struct block_device *bdev,
>> sector_t sec)
>> +{
>> +    struct request_queue *q = bdev_get_queue(bdev);
>> +
>> +    if (q)
>> +        return blk_queue_zone_no(q, sec);
>> +    return 0;
>> +}
> 
> This patch series has been split incorrectly: the same patch that
> introduces a new function should also introduce a caller to that function.
> 
Acked. I will make sure this happens in the next revision. Thanks.
> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 03/16] block: add bdev_zone_no helper
  2022-04-27 23:31       ` Damien Le Moal
@ 2022-04-28 15:40         ` Pankaj Raghav
  0 siblings, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-28 15:40 UTC (permalink / raw)
  To: Damien Le Moal, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, dsterba, johannes.thumshirn
  Cc: linux-kernel, clm, gost.dev, chao, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, bvanassche, jiangbo.365,
	matias.bjorling, linux-block


On 2022-04-28 01:31, Damien Le Moal wrote:

>> +static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
>> +{
>> +	struct request_queue *q = bdev_get_queue(bdev);
>> +
>> +	if (q)
> 
> q is never NULL. So this can be simplified to:
> 
That is a good point. I just noticed it in the bdev_get_queue() function
that q can never be NULL. I will fix it up.

All the functions `bdev*` have this pattern, so probably they could be
simplified as well in the future.
> 	return blk_queue_zone_no(bdev_get_queue(bdev), sector);
> 
>> +		return blk_queue_zone_no(q, sec);
>> +	return 0;
>> +}
>> +
>>  static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
>>  {
>>  	struct request_queue *q = bdev_get_queue(bdev);
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 12/16] zonefs: allow non power of 2 zoned devices
  2022-04-27 23:39       ` Damien Le Moal
@ 2022-04-28 15:54         ` Pankaj Raghav
  2022-04-28 21:49           ` Damien Le Moal
  0 siblings, 1 reply; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-28 15:54 UTC (permalink / raw)
  To: Damien Le Moal, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, dsterba, johannes.thumshirn
  Cc: linux-kernel, clm, gost.dev, chao, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, bvanassche, jiangbo.365, linux-fsdevel,
	matias.bjorling, linux-block

On 2022-04-28 01:39, Damien Le Moal wrote:
> On 4/28/22 01:02, Pankaj Raghav wrote:
>> The zone size shift variable is useful only if the zone sizes are known
>> to be power of 2. Remove that variable and use generic helpers from
>> block layer to calculate zone index in zonefs
> 
> Period missing at the end of the sentence.
> 
Ack
> What about zonefs-tools and its test suite ? Is everything still OK on
> that front ? I suspect not...
> 
I don't know why you assume that :). Zonefs had only one place that had
the assumption of po2 zsze sectors:
if (nr_zones < dev.nr_zones) {
	size_t runt_sectors = dev.capacity & (dev.zone_nr_sectors - 1);

In my local tree I changed it and I was able to run zonefs tests for non
po2 zone device. I have also mentioned it in my cover letter:
```
ZoneFS:
zonefs-tests.sh from zonefs-tools were run with no failures.
```
I will make sure to add my private tree for zonefs in my cover letter in
the next rev. But even without that change, a typical emulated npo2
device should work fine because the changes are applicable only for
"runt" zones.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 04/16] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-04-27 23:37       ` Damien Le Moal
@ 2022-04-28 17:29         ` Luis Chamberlain
  2022-04-28 19:04         ` Pankaj Raghav
  1 sibling, 0 replies; 62+ messages in thread
From: Luis Chamberlain @ 2022-04-28 17:29 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, naohiro.aota, sagi,
	dsterba, johannes.thumshirn, linux-kernel, linux-btrfs, clm,
	gost.dev, chao, linux-f2fs-devel, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, dm-devel, bvanassche, jiangbo.365,
	linux-fsdevel, matias.bjorling, linux-block

On Thu, Apr 28, 2022 at 08:37:27AM +0900, Damien Le Moal wrote:
> Also, the entire premise of this patch series is that it is hard for
> people to support the unusable sectors between zone capacity and zone end
> for drives with a zone capacity smaller than the zone size.
> 
> Yet, here you do not check that zone capacity == zone size for drives that
> do not have a zone size equal to a power of 2 number of sectors. This
> means that we can still have drives with ZC < ZS AND ZS not equal to a
> power of 2. So from the point of view of your arguments, no gains at all.
> Any thoughts on this ?

You are right, a check should be added on bringup so that if npo2 is
used, zone cap == zone size. That should be added on the next iteration
of this patch.

  Luis

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 16/16] dm-zoned: ensure only power of 2 zone sizes are allowed
  2022-04-27 23:42       ` Damien Le Moal
@ 2022-04-28 17:34         ` Luis Chamberlain
  2022-04-28 21:43           ` Damien Le Moal
  0 siblings, 1 reply; 62+ messages in thread
From: Luis Chamberlain @ 2022-04-28 17:34 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, naohiro.aota, sagi,
	dsterba, johannes.thumshirn, linux-kernel, linux-btrfs, clm,
	gost.dev, chao, linux-f2fs-devel, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, dm-devel, bvanassche, jiangbo.365,
	linux-fsdevel, matias.bjorling, linux-block

On Thu, Apr 28, 2022 at 08:42:41AM +0900, Damien Le Moal wrote:
> On 4/28/22 01:02, Pankaj Raghav wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> > 
> > Today dm-zoned relies on the assumption that you have a zone size
> > with a power of 2. Even though the block layer today enforces this
> > requirement, these devices do exist and so provide a stop-gap measure
> > to ensure these devices cannot be used by mistake
> > 
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > ---
> >  drivers/md/dm-zone.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> > index 57daa86c19cf..221e0aa0f1a7 100644
> > --- a/drivers/md/dm-zone.c
> > +++ b/drivers/md/dm-zone.c
> > @@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
> >  	struct request_queue *q = md->queue;
> >  	unsigned int noio_flag;
> >  	int ret;
> > +	struct block_device *bdev = md->disk->part0;
> > +	sector_t zone_sectors;
> > +	char bname[BDEVNAME_SIZE];
> > +
> > +	zone_sectors = bdev_zone_sectors(bdev);
> > +
> > +	if (!is_power_of_2(zone_sectors)) {
> > +		DMWARN("%s: %s only power of two zone size supported\n",
> > +		       dm_device_name(md),
> > +		       bdevname(bdev, bname));
> > +		return 1;
> > +	}
> 
> Why ?
> 
> See my previous email about still allowing ZC < ZS for non power of 2 zone
> size drives. dm-zoned can easily support non power of 2 zone size as long
> as ZC == ZS for all zones.

Great, thanks for the heads up.

> The problem with dm-zoned is ZC < ZS *AND* potentially variable ZC per
> zone. That cannot be supported easily (still not impossible, but
> definitely a lot more complex).

I see thanks.

Testing would still be required to ensure this all works well with npo2.
So I'd prefer to do that as a separate effort, even if it is easy. So
for now I think it makes sense to avoid this as this is not yet well
tested.

As with filesystem support, we've even have gotten hints that support
for npo2 should be easy, but without proper testing it would not be
prudent to enable support for users yet.

One step at a time.

  Luis

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 04/16] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-04-27 23:37       ` Damien Le Moal
  2022-04-28 17:29         ` Luis Chamberlain
@ 2022-04-28 19:04         ` Pankaj Raghav
  1 sibling, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-28 19:04 UTC (permalink / raw)
  To: Damien Le Moal, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, dsterba, johannes.thumshirn
  Cc: linux-kernel, clm, gost.dev, chao, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, bvanassche, jiangbo.365,
	matias.bjorling, linux-block

On 2022-04-28 01:37, Damien Le Moal wrote:
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 937bb6b86331..850caf311064 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -634,8 +634,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>>  		return BLK_STS_NOTSUPP;
>>  
>>  	/* The bio sector must point to the start of a sequential zone */
>> -	if (pos & (blk_queue_zone_sectors(q) - 1) ||
>> -	    !blk_queue_zone_is_seq(q, pos))
>> +	if (!blk_queue_zone_aligned(q, pos) || !blk_queue_zone_is_seq(q, pos))
> 
> blk_queue_zone_aligned() is a little confusing since "aligned" is also
> used for write-pointer aligned. I would rename this helper
> 
> blk_queue_is_zone_start()
> 
> or something like that.
> 
That is a good idea and definitely a better name that
blk_queue_zone_aligned. I will fix it.

>>  	/*
>> @@ -489,14 +489,14 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>>  	 * smaller last zone.
>>  	 */
>>  	if (zone->start == 0) {
>> -		if (zone->len == 0 || !is_power_of_2(zone->len)) {
>> -			pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
>> -				disk->disk_name, zone->len);
>> +		if (zone->len == 0) {
>> +			pr_warn("%s: Invalid zoned device size",
>> +				disk->disk_name);
> 
> The message is weird now. Please change it to "Invalid zone size".
> 
Ok.
> Also, the entire premise of this patch series is that it is hard for
> people to support the unusable sectors between zone capacity and zone end
> for drives with a zone capacity smaller than the zone size.
> 
> Yet, here you do not check that zone capacity == zone size for drives that
> do not have a zone size equal to a power of 2 number of sectors. This
> means that we can still have drives with ZC < ZS AND ZS not equal to a
> power of 2. So from the point of view of your arguments, no gains at all.
> Any thoughts on this ?
> 
That is a good point. Instead of implicitly assuming npo2 drives to have
ZC == ZS, it is better to be explicit during bringup. Thanks. As Luis
mentioned, I will add this condition in the next revision.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 16/16] dm-zoned: ensure only power of 2 zone sizes are allowed
  2022-04-28 17:34         ` Luis Chamberlain
@ 2022-04-28 21:43           ` Damien Le Moal
  2022-04-28 22:06             ` Luis Chamberlain
  0 siblings, 1 reply; 62+ messages in thread
From: Damien Le Moal @ 2022-04-28 21:43 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, naohiro.aota, sagi,
	dsterba, johannes.thumshirn, linux-kernel, linux-btrfs, clm,
	gost.dev, chao, linux-f2fs-devel, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, dm-devel, bvanassche, jiangbo.365,
	linux-fsdevel, matias.bjorling, linux-block

On 4/29/22 02:34, Luis Chamberlain wrote:
> On Thu, Apr 28, 2022 at 08:42:41AM +0900, Damien Le Moal wrote:
>> On 4/28/22 01:02, Pankaj Raghav wrote:
>>> From: Luis Chamberlain <mcgrof@kernel.org>
>>>
>>> Today dm-zoned relies on the assumption that you have a zone size
>>> with a power of 2. Even though the block layer today enforces this
>>> requirement, these devices do exist and so provide a stop-gap measure
>>> to ensure these devices cannot be used by mistake
>>>
>>> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
>>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>>> ---
>>>  drivers/md/dm-zone.c | 12 ++++++++++++
>>>  1 file changed, 12 insertions(+)
>>>
>>> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
>>> index 57daa86c19cf..221e0aa0f1a7 100644
>>> --- a/drivers/md/dm-zone.c
>>> +++ b/drivers/md/dm-zone.c
>>> @@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
>>>  	struct request_queue *q = md->queue;
>>>  	unsigned int noio_flag;
>>>  	int ret;
>>> +	struct block_device *bdev = md->disk->part0;
>>> +	sector_t zone_sectors;
>>> +	char bname[BDEVNAME_SIZE];
>>> +
>>> +	zone_sectors = bdev_zone_sectors(bdev);
>>> +
>>> +	if (!is_power_of_2(zone_sectors)) {
>>> +		DMWARN("%s: %s only power of two zone size supported\n",
>>> +		       dm_device_name(md),
>>> +		       bdevname(bdev, bname));
>>> +		return 1;
>>> +	}
>>
>> Why ?
>>
>> See my previous email about still allowing ZC < ZS for non power of 2 zone
>> size drives. dm-zoned can easily support non power of 2 zone size as long
>> as ZC == ZS for all zones.
> 
> Great, thanks for the heads up.
> 
>> The problem with dm-zoned is ZC < ZS *AND* potentially variable ZC per
>> zone. That cannot be supported easily (still not impossible, but
>> definitely a lot more complex).
> 
> I see thanks.
> 
> Testing would still be required to ensure this all works well with npo2.
> So I'd prefer to do that as a separate effort, even if it is easy. So
> for now I think it makes sense to avoid this as this is not yet well
> tested.
> 
> As with filesystem support, we've even have gotten hints that support
> for npo2 should be easy, but without proper testing it would not be
> prudent to enable support for users yet.
> 
> One step at a time.

Yes, in general, I agree. But in this case, that will create kernel
versions that end up having partial support for zoned drives. Not ideal to
say the least. So if the patches are not that big, I would rather like to
see everything go into a single release.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 12/16] zonefs: allow non power of 2 zoned devices
  2022-04-28 15:54         ` Pankaj Raghav
@ 2022-04-28 21:49           ` Damien Le Moal
  2022-04-29  7:55             ` Pankaj Raghav
  0 siblings, 1 reply; 62+ messages in thread
From: Damien Le Moal @ 2022-04-28 21:49 UTC (permalink / raw)
  To: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, dsterba, johannes.thumshirn
  Cc: linux-kernel, clm, gost.dev, chao, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, bvanassche, jiangbo.365, linux-fsdevel,
	matias.bjorling, linux-block

On 4/29/22 00:54, Pankaj Raghav wrote:
> On 2022-04-28 01:39, Damien Le Moal wrote:
>> On 4/28/22 01:02, Pankaj Raghav wrote:
>>> The zone size shift variable is useful only if the zone sizes are known
>>> to be power of 2. Remove that variable and use generic helpers from
>>> block layer to calculate zone index in zonefs
>>
>> Period missing at the end of the sentence.
>>
> Ack
>> What about zonefs-tools and its test suite ? Is everything still OK on
>> that front ? I suspect not...
>>
> I don't know why you assume that :). Zonefs had only one place that had
> the assumption of po2 zsze sectors:
> if (nr_zones < dev.nr_zones) {
> 	size_t runt_sectors = dev.capacity & (dev.zone_nr_sectors - 1);
> 
> In my local tree I changed it and I was able to run zonefs tests for non
> po2 zone device. I have also mentioned it in my cover letter:
> ```
> ZoneFS:
> zonefs-tests.sh from zonefs-tools were run with no failures.
> ```

This is still not convincing given the code I saw. Additional test cases
need to be added with data verification & concurrent regular writes also
sent while doing copy to verify locking.

Which also reminds me that I have not seen any change to mq-deadline zone
write locking for this series. What is the assumption ? That users should
not be issuing writes when a copy is on-going ? What a bout the reverse
case ? at the very least, it seems that blk_issue_copy() should be taking
the zone write lock.

> I will make sure to add my private tree for zonefs in my cover letter in
> the next rev. But even without that change, a typical emulated npo2
> device should work fine because the changes are applicable only for
> "runt" zones.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 16/16] dm-zoned: ensure only power of 2 zone sizes are allowed
  2022-04-28 21:43           ` Damien Le Moal
@ 2022-04-28 22:06             ` Luis Chamberlain
  0 siblings, 0 replies; 62+ messages in thread
From: Luis Chamberlain @ 2022-04-28 22:06 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, naohiro.aota, sagi,
	dsterba, johannes.thumshirn, linux-kernel, linux-btrfs, clm,
	gost.dev, chao, linux-f2fs-devel, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, dm-devel, bvanassche, jiangbo.365,
	linux-fsdevel, matias.bjorling, linux-block

On Fri, Apr 29, 2022 at 06:43:58AM +0900, Damien Le Moal wrote:
> On 4/29/22 02:34, Luis Chamberlain wrote:
> > One step at a time.
> 
> Yes, in general, I agree. But in this case, that will create kernel
> versions that end up having partial support for zoned drives. Not ideal to
> say the least. So if the patches are not that big, I would rather like to
> see everything go into a single release.

This would have delayed the patches more, and I see no rush for npo2 to
use dm-zoned really. Just as with f2fs, we can take priorities at a
time.

  Luis

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 12/16] zonefs: allow non power of 2 zoned devices
  2022-04-28 21:49           ` Damien Le Moal
@ 2022-04-29  7:55             ` Pankaj Raghav
  0 siblings, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-04-29  7:55 UTC (permalink / raw)
  To: Damien Le Moal, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, dsterba, johannes.thumshirn
  Cc: linux-kernel, clm, gost.dev, chao, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, bvanassche, jiangbo.365, linux-fsdevel,
	matias.bjorling, linux-block

Hi Damien,
On 2022-04-28 23:49, Damien Le Moal wrote:
> This is still not convincing given the code I saw. Additional test cases
> need to be added with data verification & concurrent regular writes also
> sent while doing copy to verify locking.
> 
> Which also reminds me that I have not seen any change to mq-deadline zone
> write locking for this series. What is the assumption ? That users should
> not be issuing writes when a copy is on-going ? What a bout the reverse
> case ? at the very least, it seems that blk_issue_copy() should be taking
> the zone write lock.
> 
I think you posted this comment in this thread instead of posting it in
the copy offload thread.

>> I will make sure to add my private tree for zonefs in my cover letter in
>> the next rev. But even without that change, a typical emulated npo2
>> device should work fine because the changes are applicable only for
>> "runt" zones.
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze
  2022-04-27 16:02     ` [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze Pankaj Raghav
@ 2022-04-29 17:16       ` Adam Manzanares
  2022-05-03 16:37       ` Bart Van Assche
  2022-05-04 16:52       ` Hannes Reinecke
  2 siblings, 0 replies; 62+ messages in thread
From: Adam Manzanares @ 2022-04-29 17:16 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn, linux-kernel,
	linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel, josef,
	jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block

On Wed, Apr 27, 2022 at 06:02:40PM +0200, Pankaj Raghav wrote:
> Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
> also work for non-power-of-2 zone sizes.
> 
> As the existing deployments of zoned devices had power-of-2
> assumption, power-of-2 optimized calculation is kept for those devices.
> 
> There are no direct hot paths modified and the changes just
> introduce one new branch per call.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  block/blk-zoned.c      | 8 +++++++-
>  include/linux/blkdev.h | 8 +++++++-
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 38cd840d8838..1dff4a8bd51d 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -117,10 +117,16 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
>  unsigned int blkdev_nr_zones(struct gendisk *disk)
>  {
>  	sector_t zone_sectors = blk_queue_zone_sectors(disk->queue);
> +	sector_t capacity = get_capacity(disk);
>  
>  	if (!blk_queue_is_zoned(disk->queue))
>  		return 0;
> -	return (get_capacity(disk) + zone_sectors - 1) >> ilog2(zone_sectors);
> +
> +	if (is_power_of_2(zone_sectors))
> +		return (capacity + zone_sectors - 1) >>
> +		       ilog2(zone_sectors);
> +
> +	return div64_u64(capacity + zone_sectors - 1, zone_sectors);
>  }
>  EXPORT_SYMBOL_GPL(blkdev_nr_zones);
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 60d016138997..c4e4c7071b7b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -665,9 +665,15 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
>  static inline unsigned int blk_queue_zone_no(struct request_queue *q,
>  					     sector_t sector)
>  {
> +	sector_t zone_sectors = blk_queue_zone_sectors(q);
> +
>  	if (!blk_queue_is_zoned(q))
>  		return 0;
> -	return sector >> ilog2(q->limits.chunk_sectors);
> +
> +	if (is_power_of_2(zone_sectors))
> +		return sector >> ilog2(zone_sectors);
> +
> +	return div64_u64(sector, zone_sectors);
>  }
>  
>  static inline bool blk_queue_zone_is_seq(struct request_queue *q,
> -- 
> 2.25.1
>


Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  2022-04-27 16:02     ` [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 " Pankaj Raghav
@ 2022-04-29 17:23       ` Adam Manzanares
  2022-05-03 16:50       ` Bart Van Assche
  2022-05-04 17:03       ` Hannes Reinecke
  2 siblings, 0 replies; 62+ messages in thread
From: Adam Manzanares @ 2022-04-29 17:23 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn, linux-kernel,
	linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel, josef,
	jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block

On Wed, Apr 27, 2022 at 06:02:44PM +0200, Pankaj Raghav wrote:
> Remove the condition which disallows non-power_of_2 zone size ZNS drive
> to be updated and use generic method to calculate number of zones
> instead of relying on log and shift based calculation on zone size.
> 
> The power_of_2 calculation has been replaced directly with generic
> calculation without special handling. Both modified functions are not
> used in hot paths, they are only used during initialization &
> revalidation of the ZNS device.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/nvme/host/zns.c | 11 ++---------
>  1 file changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 9f81beb4df4e..2087de0768ee 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
>  	}
>  
>  	ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
> -	if (!is_power_of_2(ns->zsze)) {
> -		dev_warn(ns->ctrl->device,
> -			"invalid zone size:%llu for namespace:%u\n",
> -			ns->zsze, ns->head->ns_id);
> -		status = -ENODEV;
> -		goto free_data;
> -	}
>  
>  	blk_queue_set_zoned(ns->disk, BLK_ZONED_HM);
>  	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
> @@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
>  				   sizeof(struct nvme_zone_descriptor);
>  
>  	nr_zones = min_t(unsigned int, nr_zones,
> -			 get_capacity(ns->disk) >> ilog2(ns->zsze));
> +			 div64_u64(get_capacity(ns->disk), ns->zsze));
>  
>  	bufsize = sizeof(struct nvme_zone_report) +
>  		nr_zones * sizeof(struct nvme_zone_descriptor);
> @@ -197,7 +190,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>  	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>  	c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
>  
> -	sector &= ~(ns->zsze - 1);
> +	sector = rounddown(sector, ns->zsze);
>  	while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
>  		memset(report, 0, buflen);
>  
> -- 
> 2.25.1
>


Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/16] nvmet: use blk_queue_zone_no()
  2022-04-27 16:02     ` [PATCH 06/16] nvmet: use blk_queue_zone_no() Pankaj Raghav
@ 2022-04-29 17:27       ` Adam Manzanares
  2022-05-03 16:54       ` Bart Van Assche
  2022-05-04 17:05       ` Hannes Reinecke
  2 siblings, 0 replies; 62+ messages in thread
From: Adam Manzanares @ 2022-04-29 17:27 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn, linux-kernel,
	linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel, josef,
	jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block

On Wed, Apr 27, 2022 at 06:02:45PM +0200, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Instead of open coding the number of zones given a sector, use the helper
> blk_queue_zone_no(). This let's us make modifications to the math if
> needed in one place and adds now support for npo2 zone devices.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/nvme/target/zns.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
> index e34718b09550..e41b6a6ef048 100644
> --- a/drivers/nvme/target/zns.c
> +++ b/drivers/nvme/target/zns.c
> @@ -243,7 +243,7 @@ static unsigned long nvmet_req_nr_zones_from_slba(struct nvmet_req *req)
>  	unsigned int sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);
>  
>  	return blkdev_nr_zones(req->ns->bdev->bd_disk) -
> -		(sect >> ilog2(bdev_zone_sectors(req->ns->bdev)));
> +	       bdev_zone_no(req->ns->bdev, sect);
>  }
>  
>  static unsigned long get_nr_zones_from_buf(struct nvmet_req *req, u32 bufsize)
> -- 
> 2.25.1
>


Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 13/16] null_blk: allow non power of 2 zoned devices
  2022-04-27 16:02     ` [PATCH 13/16] null_blk: " Pankaj Raghav
@ 2022-04-29 17:30       ` Adam Manzanares
  2022-05-03 17:01       ` Bart Van Assche
  2022-05-04 17:10       ` Hannes Reinecke
  2 siblings, 0 replies; 62+ messages in thread
From: Adam Manzanares @ 2022-04-29 17:30 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: jaegeuk, axboe, snitzer, hch, mcgrof, naohiro.aota, sagi,
	damien.lemoal, dsterba, johannes.thumshirn, linux-kernel,
	linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel, josef,
	jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, matias.bjorling,
	linux-block

On Wed, Apr 27, 2022 at 06:02:52PM +0200, Pankaj Raghav wrote:
> Convert the power of 2 based calculation with zone size to be generic in
> null_zone_no with optimization for power of 2 based zone sizes.
> 
> The nr_zones calculation in null_init_zoned_dev has been replaced with a
> division without special handling for power of 2 based zone sizes as
> this function is called only during the initialization and will not
> invoked in the hot path.
> 
> Performance Measurement:
> 
> Device:
> zone size = 128M, blocksize=4k
> 
> FIO cmd:
> 
> fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
> --io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
> --loops=4
> 
> The following results are an average of 4 runs on AMD Ryzen 5 5600X with
> 32GB of RAM:
> 
> Sequential Write:
> 
> x-----------------x---------------------------------x---------------------------------x
> |     IOdepth     |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  578     |  2257    |   12.80   |  576     |  2248    |   25.78   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  581     |  2268    |   12.74   |  576     |  2248    |   25.85   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Sequential read:
> 
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth         |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  667     |  2605    |   11.79   |  675     |  2637    |   23.49   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  667     |  2605    |   11.79   |  675     |  2638    |   23.48   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Random read:
> 
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth         |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  522     |  2038    |   15.05   |  514     |  2006    |   30.87   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  522     |  2039    |   15.04   |  523     |  2042    |   30.33   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Minor variations are noticed in Sequential write with io depth 8 and
> in random read with io depth 16. But overall no noticeable differences
> were noticed
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/block/null_blk/main.c  |  4 ++--
>  drivers/block/null_blk/zoned.c | 14 +++++++-------
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
> index c441a4972064..82a62b543782 100644
> --- a/drivers/block/null_blk/main.c
> +++ b/drivers/block/null_blk/main.c
> @@ -1931,8 +1931,8 @@ static int null_validate_conf(struct nullb_device *dev)
>  		dev->mbps = 0;
>  
>  	if (dev->zoned &&
> -	    (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
> -		pr_err("zone_size must be power-of-two\n");
> +	    (!dev->zone_size)) {
> +		pr_err("zone_size must not be zero\n");
>  		return -EINVAL;
>  	}
>  
> diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
> index dae54dd1aeac..00c34e65ef0a 100644
> --- a/drivers/block/null_blk/zoned.c
> +++ b/drivers/block/null_blk/zoned.c
> @@ -13,7 +13,10 @@ static inline sector_t mb_to_sects(unsigned long mb)
>  
>  static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
>  {
> -	return sect >> ilog2(dev->zone_size_sects);
> +	if (is_power_of_2(dev->zone_size_sects))
> +		return sect >> ilog2(dev->zone_size_sects);
> +
> +	return div64_u64(sect, dev->zone_size_sects);
>  }
>  
>  static inline void null_lock_zone_res(struct nullb_device *dev)
> @@ -62,10 +65,6 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
>  	sector_t sector = 0;
>  	unsigned int i;
>  
> -	if (!is_power_of_2(dev->zone_size)) {
> -		pr_err("zone_size must be power-of-two\n");
> -		return -EINVAL;
> -	}
>  	if (dev->zone_size > dev->size) {
>  		pr_err("Zone size larger than device capacity\n");
>  		return -EINVAL;
> @@ -83,8 +82,9 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
>  	zone_capacity_sects = mb_to_sects(dev->zone_capacity);
>  	dev_capacity_sects = mb_to_sects(dev->size);
>  	dev->zone_size_sects = mb_to_sects(dev->zone_size);
> -	dev->nr_zones = round_up(dev_capacity_sects, dev->zone_size_sects)
> -		>> ilog2(dev->zone_size_sects);
> +	dev->nr_zones =
> +		div64_u64(roundup(dev_capacity_sects, dev->zone_size_sects),
> +			  dev->zone_size_sects);
>  
>  	dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct nullb_zone),
>  				    GFP_KERNEL | __GFP_ZERO);
> -- 
> 2.25.1
>


Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 00/16] support non power of 2 zoned devices
  2022-04-27 16:02 ` [PATCH 00/16] support non power of 2 zoned devices Pankaj Raghav
                     ` (15 preceding siblings ...)
       [not found]   ` <CGME20220427160313eucas1p1feecf74ec15c8c3d9250444710fd1676@eucas1p1.samsung.com>
@ 2022-05-02 22:07   ` Johannes Thumshirn
  2022-05-03  9:12     ` Pankaj Raghav
  16 siblings, 1 reply; 62+ messages in thread
From: Johannes Thumshirn @ 2022-05-02 22:07 UTC (permalink / raw)
  To: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, mcgrof,
	Naohiro Aota, sagi, damien.lemoal, dsterba
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, Matias Bjørling,
	linux-block

On 27/04/2022 09:03, Pankaj Raghav wrote:
> - Background and Motivation:
> 
> The zone storage implementation in Linux, introduced since v4.10, first
> targetted SMR drives which have a power of 2 (po2) zone size alignment
> requirement. The po2 zone size was further imposed implicitly by the
> block layer's blk_queue_chunk_sectors(), used to prevent IO merging
> across chunks beyond the specified size, since v3.16 through commit
> 762380ad9322 ("block: add notion of a chunk size for request merging").
> But this same general block layer po2 requirement for blk_queue_chunk_sectors()
> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
> to be non-power-of-2"). NAND, which is the media used in newer zoned storage
> devices, does not naturally align to po2, and so the po2 requirement
> does not make sense for those type of zone storage devices.
> 
> Removing the po2 requirement from zone storage should therefore be possible
> now provided that no userspace regression and no performance regressions are
> introduced. Stop-gap patches have been already merged into f2fs-tools to
> proactively not allow npo2 zone sizes until proper support is added [0].
> Additional kernel stop-gap patches are provided in this series for dm-zoned.
> Support for npo2 zonefs and btrfs support is addressed in this series.
> 
> There was an effort previously [1] to add support to non po2 devices via
> device level emulation but that was rejected with a final conclusion
> to add support for non po2 zoned device in the complete stack[2].

Hey Pankaj,

One thing I'm concerned with this patches is, once we have npo2 zones (or to be precise 
not fs_info->sectorsize aligned zones) we have to check on every allocation if we still 
have at least have fs_info->sectorsize bytes left in a zone. If not we need to 
explicitly finish the zone, otherwise we'll run out of max active zones. 

This is a problem for zoned btrfs at the moment already but it'll be even worse
with npo2, because we're never implicitly finishing zones.

See also 
https://lore.kernel.org/linux-btrfs/42758829d8696a471a27f7aaeab5468f60b1565d.1651157034.git.naohiro.aota@wdc.com

Thanks,
	Johannes

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 00/16] support non power of 2 zoned devices
  2022-05-02 22:07   ` [PATCH 00/16] support non power of 2 zoned devices Johannes Thumshirn
@ 2022-05-03  9:12     ` Pankaj Raghav
  2022-05-04 21:14       ` David Sterba
  0 siblings, 1 reply; 62+ messages in thread
From: Pankaj Raghav @ 2022-05-03  9:12 UTC (permalink / raw)
  To: Johannes Thumshirn, jaegeuk, axboe, snitzer, hch, mcgrof,
	Naohiro Aota, sagi, damien.lemoal, dsterba
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, Matias Bjørling,
	linux-block

Hi Johannes,
On 2022-05-03 00:07, Johannes Thumshirn wrote:
>> There was an effort previously [1] to add support to non po2 devices via
>> device level emulation but that was rejected with a final conclusion
>> to add support for non po2 zoned device in the complete stack[2].
> 
> Hey Pankaj,
> 
> One thing I'm concerned with this patches is, once we have npo2 zones (or to be precise 
> not fs_info->sectorsize aligned zones) we have to check on every allocation if we still 
> have at least have fs_info->sectorsize bytes left in a zone. If not we need to 
> explicitly finish the zone, otherwise we'll run out of max active zones. 
> 
This commit: `btrfs: zoned: relax the alignment constraint for zoned
devices` makes sure the zone size is BTRFS_STRIPE_LEN aligned (64K). So
even the npo2 zoned device should be aligned to `fs_info->sectorsize`,
which is typically 4k.

This was one of the comment that came from David Sterba:
https://lore.kernel.org/all/20220315142740.GU12643@twin.jikos.cz/
where he suggested to have some sane alignment for the zone sizes.

> This is a problem for zoned btrfs at the moment already but it'll be even worse
> with npo2, because we're never implicitly finishing zones.
> 
> See also 
> https://lore.kernel.org/linux-btrfs/42758829d8696a471a27f7aaeab5468f60b1565d.1651157034.git.naohiro.aota@wdc.com
> 
I did take a look at this few days back and the patch should work fine
also for npo2 zoned device as we allow only zone sizes that are
BTRFS_STRIPE_LEN aligned. So even the max nodesize for METADATA BGs is
only 64k and it should be aligned correctly to implicitly finish the zone.

Let me know your thoughts and if I am missing something.

Regards,
Pankaj

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze
  2022-04-27 16:02     ` [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze Pankaj Raghav
  2022-04-29 17:16       ` Adam Manzanares
@ 2022-05-03 16:37       ` Bart Van Assche
  2022-05-03 16:43         ` Damien Le Moal
  2022-05-04  8:35         ` Pankaj Raghav
  2022-05-04 16:52       ` Hannes Reinecke
  2 siblings, 2 replies; 62+ messages in thread
From: Bart Van Assche @ 2022-05-03 16:37 UTC (permalink / raw)
  To: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	jiangbo.365, linux-fsdevel, matias.bjorling, linux-block

On 4/27/22 09:02, Pankaj Raghav wrote:
> Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
> also work for non-power-of-2 zone sizes.
> 
> As the existing deployments of zoned devices had power-of-2
> assumption, power-of-2 optimized calculation is kept for those devices.
> 
> There are no direct hot paths modified and the changes just
> introduce one new branch per call.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   block/blk-zoned.c      | 8 +++++++-
>   include/linux/blkdev.h | 8 +++++++-
>   2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 38cd840d8838..1dff4a8bd51d 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -117,10 +117,16 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
>   unsigned int blkdev_nr_zones(struct gendisk *disk)
>   {
>   	sector_t zone_sectors = blk_queue_zone_sectors(disk->queue);
> +	sector_t capacity = get_capacity(disk);
>   
>   	if (!blk_queue_is_zoned(disk->queue))
>   		return 0;
> -	return (get_capacity(disk) + zone_sectors - 1) >> ilog2(zone_sectors);
> +
> +	if (is_power_of_2(zone_sectors))
> +		return (capacity + zone_sectors - 1) >>
> +		       ilog2(zone_sectors);
> +
> +	return div64_u64(capacity + zone_sectors - 1, zone_sectors);
>   }
>   EXPORT_SYMBOL_GPL(blkdev_nr_zones);

Does anyone need support for more than 4 billion sectors per zone? If 
not, do_div() should be sufficient.

> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 60d016138997..c4e4c7071b7b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -665,9 +665,15 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
>   static inline unsigned int blk_queue_zone_no(struct request_queue *q,
>   					     sector_t sector)
>   {
> +	sector_t zone_sectors = blk_queue_zone_sectors(q);
> +
>   	if (!blk_queue_is_zoned(q))
>   		return 0;
> -	return sector >> ilog2(q->limits.chunk_sectors);
> +
> +	if (is_power_of_2(zone_sectors))
> +		return sector >> ilog2(zone_sectors);
> +
> +	return div64_u64(sector, zone_sectors);
>   }

Same comment here.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze
  2022-05-03 16:37       ` Bart Van Assche
@ 2022-05-03 16:43         ` Damien Le Moal
  2022-05-04  8:35         ` Pankaj Raghav
  1 sibling, 0 replies; 62+ messages in thread
From: Damien Le Moal @ 2022-05-03 16:43 UTC (permalink / raw)
  To: Bart Van Assche, Pankaj Raghav, jaegeuk, axboe, snitzer, hch,
	mcgrof, naohiro.aota, sagi, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	jiangbo.365, linux-fsdevel, matias.bjorling, linux-block

On 2022/05/04 1:37, Bart Van Assche wrote:
> On 4/27/22 09:02, Pankaj Raghav wrote:
>> Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
>> also work for non-power-of-2 zone sizes.
>>
>> As the existing deployments of zoned devices had power-of-2
>> assumption, power-of-2 optimized calculation is kept for those devices.
>>
>> There are no direct hot paths modified and the changes just
>> introduce one new branch per call.
>>
>> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>> ---
>>   block/blk-zoned.c      | 8 +++++++-
>>   include/linux/blkdev.h | 8 +++++++-
>>   2 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index 38cd840d8838..1dff4a8bd51d 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -117,10 +117,16 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
>>   unsigned int blkdev_nr_zones(struct gendisk *disk)
>>   {
>>   	sector_t zone_sectors = blk_queue_zone_sectors(disk->queue);
>> +	sector_t capacity = get_capacity(disk);
>>   
>>   	if (!blk_queue_is_zoned(disk->queue))
>>   		return 0;
>> -	return (get_capacity(disk) + zone_sectors - 1) >> ilog2(zone_sectors);
>> +
>> +	if (is_power_of_2(zone_sectors))
>> +		return (capacity + zone_sectors - 1) >>
>> +		       ilog2(zone_sectors);
>> +
>> +	return div64_u64(capacity + zone_sectors - 1, zone_sectors);
>>   }
>>   EXPORT_SYMBOL_GPL(blkdev_nr_zones);
> 
> Does anyone need support for more than 4 billion sectors per zone? If 
> not, do_div() should be sufficient.
> 
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 60d016138997..c4e4c7071b7b 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -665,9 +665,15 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
>>   static inline unsigned int blk_queue_zone_no(struct request_queue *q,
>>   					     sector_t sector)
>>   {
>> +	sector_t zone_sectors = blk_queue_zone_sectors(q);
>> +
>>   	if (!blk_queue_is_zoned(q))
>>   		return 0;
>> -	return sector >> ilog2(q->limits.chunk_sectors);
>> +
>> +	if (is_power_of_2(zone_sectors))
>> +		return sector >> ilog2(zone_sectors);
>> +
>> +	return div64_u64(sector, zone_sectors);
>>   }
> 
> Same comment here.

sector_t is 64-bits even on 32-bits arch, no ?
so div64_u64 is needed here I think, which will be a simple regular division for
64-bit arch.

> 
> Thanks,
> 
> Bart.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  2022-04-27 16:02     ` [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 " Pankaj Raghav
  2022-04-29 17:23       ` Adam Manzanares
@ 2022-05-03 16:50       ` Bart Van Assche
  2022-05-04  8:38         ` Pankaj Raghav
  2022-05-04 17:03       ` Hannes Reinecke
  2 siblings, 1 reply; 62+ messages in thread
From: Bart Van Assche @ 2022-05-03 16:50 UTC (permalink / raw)
  To: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	jiangbo.365, linux-fsdevel, matias.bjorling, linux-block

On 4/27/22 09:02, Pankaj Raghav wrote:
> -	sector &= ~(ns->zsze - 1);
> +	sector = rounddown(sector, ns->zsze);

The above change breaks 32-bit builds since ns->zsze is 64 bits wide and 
since rounddown() uses the C division operator instead of div64_u64().

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/16] nvmet: use blk_queue_zone_no()
  2022-04-27 16:02     ` [PATCH 06/16] nvmet: use blk_queue_zone_no() Pankaj Raghav
  2022-04-29 17:27       ` Adam Manzanares
@ 2022-05-03 16:54       ` Bart Van Assche
  2022-05-04 17:05       ` Hannes Reinecke
  2 siblings, 0 replies; 62+ messages in thread
From: Bart Van Assche @ 2022-05-03 16:54 UTC (permalink / raw)
  To: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	jiangbo.365, linux-fsdevel, matias.bjorling, linux-block

On 4/27/22 09:02, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Instead of open coding the number of zones given a sector, use the helper
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I can't parse this. Please rephrase this.

> blk_queue_zone_no(). This let's us make modifications to the math if
> needed in one place and adds now support for npo2 zone devices.

But since the code looks fine:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 13/16] null_blk: allow non power of 2 zoned devices
  2022-04-27 16:02     ` [PATCH 13/16] null_blk: " Pankaj Raghav
  2022-04-29 17:30       ` Adam Manzanares
@ 2022-05-03 17:01       ` Bart Van Assche
  2022-05-04 17:10       ` Hannes Reinecke
  2 siblings, 0 replies; 62+ messages in thread
From: Bart Van Assche @ 2022-05-03 17:01 UTC (permalink / raw)
  To: Pankaj Raghav, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	jiangbo.365, linux-fsdevel, matias.bjorling, linux-block

On 4/27/22 09:02, Pankaj Raghav wrote:
> diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
> index c441a4972064..82a62b543782 100644
> --- a/drivers/block/null_blk/main.c
> +++ b/drivers/block/null_blk/main.c
> @@ -1931,8 +1931,8 @@ static int null_validate_conf(struct nullb_device *dev)
>   		dev->mbps = 0;
>   
>   	if (dev->zoned &&
> -	    (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
> -		pr_err("zone_size must be power-of-two\n");
> +	    (!dev->zone_size)) {
> +		pr_err("zone_size must not be zero\n");
>   		return -EINVAL;
>   	}

Please combine "if (dev->zoned &&" and "(!dev->zone_size)) {" into a 
single line and leave out the parentheses that became superfluous.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 14/16] f2fs: call bdev_zone_sectors() only once on init_blkz_info()
  2022-04-27 16:02     ` [PATCH 14/16] f2fs: call bdev_zone_sectors() only once on init_blkz_info() Pankaj Raghav
@ 2022-05-03 20:04       ` Jaegeuk Kim
  0 siblings, 0 replies; 62+ messages in thread
From: Jaegeuk Kim @ 2022-05-03 20:04 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: axboe, snitzer, hch, mcgrof, naohiro.aota, sagi, damien.lemoal,
	dsterba, johannes.thumshirn, linux-kernel, linux-btrfs, clm,
	gost.dev, chao, linux-f2fs-devel, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, dm-devel, bvanassche, jiangbo.365,
	linux-fsdevel, matias.bjorling, linux-block

Applied to f2fs tree. Thanks,

On 04/27, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Instead of calling bdev_zone_sectors() multiple times, call
> it once and cache the value locally. This will make the
> subsequent change easier to read.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/f2fs/super.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> index ea939db18f88..f64761a15df7 100644
> --- a/fs/f2fs/super.c
> +++ b/fs/f2fs/super.c
> @@ -3678,22 +3678,25 @@ static int init_blkz_info(struct f2fs_sb_info *sbi, int devi)
>  	struct block_device *bdev = FDEV(devi).bdev;
>  	sector_t nr_sectors = bdev_nr_sectors(bdev);
>  	struct f2fs_report_zones_args rep_zone_arg;
> +	u64 zone_sectors;
>  	int ret;
>  
>  	if (!f2fs_sb_has_blkzoned(sbi))
>  		return 0;
>  
> +	zone_sectors = bdev_zone_sectors(bdev);
> +
>  	if (sbi->blocks_per_blkz && sbi->blocks_per_blkz !=
> -				SECTOR_TO_BLOCK(bdev_zone_sectors(bdev)))
> +				SECTOR_TO_BLOCK(zone_sectors))
>  		return -EINVAL;
> -	sbi->blocks_per_blkz = SECTOR_TO_BLOCK(bdev_zone_sectors(bdev));
> +	sbi->blocks_per_blkz = SECTOR_TO_BLOCK(zone_sectors);
>  	if (sbi->log_blocks_per_blkz && sbi->log_blocks_per_blkz !=
>  				__ilog2_u32(sbi->blocks_per_blkz))
>  		return -EINVAL;
>  	sbi->log_blocks_per_blkz = __ilog2_u32(sbi->blocks_per_blkz);
>  	FDEV(devi).nr_blkz = SECTOR_TO_BLOCK(nr_sectors) >>
>  					sbi->log_blocks_per_blkz;
> -	if (nr_sectors & (bdev_zone_sectors(bdev) - 1))
> +	if (nr_sectors & (zone_sectors - 1))
>  		FDEV(devi).nr_blkz++;
>  
>  	FDEV(devi).blkz_seq = f2fs_kvzalloc(sbi,
> -- 
> 2.25.1

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 15/16] f2fs: ensure only power of 2 zone sizes are allowed
  2022-04-27 16:02     ` [PATCH 15/16] f2fs: ensure only power of 2 zone sizes are allowed Pankaj Raghav
@ 2022-05-03 20:05       ` Jaegeuk Kim
  2022-05-04  8:53         ` Pankaj Raghav
  0 siblings, 1 reply; 62+ messages in thread
From: Jaegeuk Kim @ 2022-05-03 20:05 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: axboe, snitzer, hch, mcgrof, naohiro.aota, sagi, damien.lemoal,
	dsterba, johannes.thumshirn, linux-kernel, linux-btrfs, clm,
	gost.dev, chao, linux-f2fs-devel, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, dm-devel, bvanassche, jiangbo.365,
	linux-fsdevel, matias.bjorling, linux-block

Applied to f2fs tree. Thanks,

On 04/27, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> F2FS zoned support has power of 2 zone size assumption in many places
> such as in __f2fs_issue_discard_zone, init_blkz_info. As the power of 2
> requirement has been removed from the block layer, explicitly add a
> condition in f2fs to allow only power of 2 zone size devices.
> 
> This condition will be relaxed once those calculation based on power of
> 2 is made generic.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/f2fs/super.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> index f64761a15df7..db79abf30002 100644
> --- a/fs/f2fs/super.c
> +++ b/fs/f2fs/super.c
> @@ -3685,6 +3685,10 @@ static int init_blkz_info(struct f2fs_sb_info *sbi, int devi)
>  		return 0;
>  
>  	zone_sectors = bdev_zone_sectors(bdev);
> +	if (!is_power_of_2(zone_sectors)) {
> +		f2fs_err(sbi, "F2FS does not support non power of 2 zone sizes\n");
> +		return -EINVAL;
> +	}
>  
>  	if (sbi->blocks_per_blkz && sbi->blocks_per_blkz !=
>  				SECTOR_TO_BLOCK(zone_sectors))
> -- 
> 2.25.1

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze
  2022-05-03 16:37       ` Bart Van Assche
  2022-05-03 16:43         ` Damien Le Moal
@ 2022-05-04  8:35         ` Pankaj Raghav
  1 sibling, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-05-04  8:35 UTC (permalink / raw)
  To: Bart Van Assche, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	jiangbo.365, linux-fsdevel, matias.bjorling, linux-block

On 2022-05-03 18:37, Bart Van Assche wrote:
>>       sector_t zone_sectors = blk_queue_zone_sectors(disk->queue);
>> +    sector_t capacity = get_capacity(disk);
>>         if (!blk_queue_is_zoned(disk->queue))
>>           return 0;
>> -    return (get_capacity(disk) + zone_sectors - 1) >>
>> ilog2(zone_sectors);
>> +
>> +    if (is_power_of_2(zone_sectors))
>> +        return (capacity + zone_sectors - 1) >>
>> +               ilog2(zone_sectors);
>> +
>> +    return div64_u64(capacity + zone_sectors - 1, zone_sectors);
>>   }
>>   EXPORT_SYMBOL_GPL(blkdev_nr_zones);
> 
> Does anyone need support for more than 4 billion sectors per zone? If
> not, do_div() should be sufficient.
> 
You are absolutely right but blk_queue_zone_sectors explicitly changed
the return type to uint32_t to sector_t in this commit:
'113ab72  block: Fix potential overflow in blk_report_zones()'.

I initially did have a do_div but later changed to div64_u64 to avoid
any implicit down casting even though zone_sectors will be always
limited to an unsigned int.
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 60d016138997..c4e4c7071b7b 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -665,9 +665,15 @@ static inline unsigned int
>> blk_queue_nr_zones(struct request_queue *q)
>>   static inline unsigned int blk_queue_zone_no(struct request_queue *q,
>>                            sector_t sector)
>>   {
>> +    sector_t zone_sectors = blk_queue_zone_sectors(q);
>> +
>>       if (!blk_queue_is_zoned(q))
>>           return 0;
>> -    return sector >> ilog2(q->limits.chunk_sectors);
>> +
>> +    if (is_power_of_2(zone_sectors))
>> +        return sector >> ilog2(zone_sectors);
>> +
>> +    return div64_u64(sector, zone_sectors);
>>   }
> 
> Same comment here.
> 
> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  2022-05-03 16:50       ` Bart Van Assche
@ 2022-05-04  8:38         ` Pankaj Raghav
  0 siblings, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-05-04  8:38 UTC (permalink / raw)
  To: Bart Van Assche, jaegeuk, axboe, snitzer, hch, mcgrof,
	naohiro.aota, sagi, damien.lemoal, dsterba, johannes.thumshirn
  Cc: linux-kernel, linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel,
	josef, jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	jiangbo.365, linux-fsdevel, matias.bjorling, linux-block


On 2022-05-03 18:50, Bart Van Assche wrote:
> On 4/27/22 09:02, Pankaj Raghav wrote:
>> -    sector &= ~(ns->zsze - 1);
>> +    sector = rounddown(sector, ns->zsze);
> 
> The above change breaks 32-bit builds since ns->zsze is 64 bits wide and
> since rounddown() uses the C division operator instead of div64_u64().
> 
Good catch. I don't see any generic helper for rounddown that will work
for both 32 bits and 64 bits. Maybe I will open code the rounddown logic
here so that it works for both 32 and 64 bits.
> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 15/16] f2fs: ensure only power of 2 zone sizes are allowed
  2022-05-03 20:05       ` Jaegeuk Kim
@ 2022-05-04  8:53         ` Pankaj Raghav
  0 siblings, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-05-04  8:53 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: axboe, snitzer, hch, mcgrof, naohiro.aota, sagi, damien.lemoal,
	dsterba, johannes.thumshirn, linux-kernel, linux-btrfs, clm,
	gost.dev, chao, linux-f2fs-devel, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, dm-devel, bvanassche, jiangbo.365,
	linux-fsdevel, matias.bjorling, linux-block


On 2022-05-03 22:05, Jaegeuk Kim wrote:
> Applied to f2fs tree. Thanks,
>
Thanks Jaegeuk. I will remove the f2fs patches from my next revision

Regards,
Pankaj
> On 04/27, Pankaj Raghav wrote:
>> From: Luis Chamberlain <mcgrof@kernel.org>
>>
>> F2FS zoned support has power of 2 zone size assumption in many places
>> such as in __f2fs_issue_discard_zone, init_blkz_info. As the power of 2
>> requirement has been removed from the block layer, explicitly add a
>> condition in f2fs to allow only power of 2 zone size devices.
>>
>> This condition will be relaxed once those calculation based on power of
>> 2 is made generic.
>>
>> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>> ---
>>  fs/f2fs/super.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
>> index f64761a15df7..db79abf30002 100644
>> --- a/fs/f2fs/super.c
>> +++ b/fs/f2fs/super.c
>> @@ -3685,6 +3685,10 @@ static int init_blkz_info(struct f2fs_sb_info *sbi, int devi)
>>  		return 0;
>>  
>>  	zone_sectors = bdev_zone_sectors(bdev);
>> +	if (!is_power_of_2(zone_sectors)) {
>> +		f2fs_err(sbi, "F2FS does not support non power of 2 zone sizes\n");
>> +		return -EINVAL;
>> +	}
>>  
>>  	if (sbi->blocks_per_blkz && sbi->blocks_per_blkz !=
>>  				SECTOR_TO_BLOCK(zone_sectors))
>> -- 
>> 2.25.1

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze
  2022-04-27 16:02     ` [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze Pankaj Raghav
  2022-04-29 17:16       ` Adam Manzanares
  2022-05-03 16:37       ` Bart Van Assche
@ 2022-05-04 16:52       ` Hannes Reinecke
  2 siblings, 0 replies; 62+ messages in thread
From: Hannes Reinecke @ 2022-05-04 16:52 UTC (permalink / raw)


On 4/27/22 09:02, Pankaj Raghav wrote:
> Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
> also work for non-power-of-2 zone sizes.
> 
> As the existing deployments of zoned devices had power-of-2
> assumption, power-of-2 optimized calculation is kept for those devices.
> 
> There are no direct hot paths modified and the changes just
> introduce one new branch per call.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   block/blk-zoned.c      | 8 +++++++-
>   include/linux/blkdev.h | 8 +++++++-
>   2 files changed, 14 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/16] block: add blk_queue_zone_aligned and bdev_zone_aligned helper
  2022-04-27 16:02     ` [PATCH 02/16] block: add blk_queue_zone_aligned and bdev_zone_aligned helper Pankaj Raghav
  2022-04-27 23:52       ` Bart Van Assche
@ 2022-05-04 16:55       ` Hannes Reinecke
  1 sibling, 0 replies; 62+ messages in thread
From: Hannes Reinecke @ 2022-05-04 16:55 UTC (permalink / raw)


On 4/27/22 09:02, Pankaj Raghav wrote:
> Checking if a given sector is aligned to a zone is a very common
> operation that is performed for zoned devices. Add
> blk_queue_zone_aligned helper to check for this instead of opencoding it
> everywhere.
> 
> The helper is made to be generic so that it can also check for alignment
> for non non-power-of-2 zone size devices.
> 
> As the existing deployments of zoned devices had power-of-2
> assumption, power-of-2 optimized calculation is done for devices with
> power-of-2 zone size
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   include/linux/blkdev.h | 31 +++++++++++++++++++++++++++++++
>   1 file changed, 31 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 03/16] block: add bdev_zone_no helper
  2022-04-27 16:02     ` [PATCH 03/16] block: add bdev_zone_no helper Pankaj Raghav
  2022-04-27 23:31       ` Damien Le Moal
  2022-04-27 23:53       ` Bart Van Assche
@ 2022-05-04 16:55       ` Hannes Reinecke
  2 siblings, 0 replies; 62+ messages in thread
From: Hannes Reinecke @ 2022-05-04 16:55 UTC (permalink / raw)


On 4/27/22 09:02, Pankaj Raghav wrote:
> Many places in the filesystem for zoned devices open code this function
> to find the zone number for a given sector with power of 2 assumption.
> This generic helper can be used to calculate zone number for a given
> sector in a block device
> 
> This helper internally uses blk_queue_zone_no to find the zone number.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   include/linux/blkdev.h | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 04/16] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-04-27 16:02     ` [PATCH 04/16] block: allow blk-zoned devices to have non-power-of-2 zone size Pankaj Raghav
  2022-04-27 23:37       ` Damien Le Moal
@ 2022-05-04 16:59       ` Hannes Reinecke
  2022-05-04 18:46         ` Pankaj Raghav
  1 sibling, 1 reply; 62+ messages in thread
From: Hannes Reinecke @ 2022-05-04 16:59 UTC (permalink / raw)


On 4/27/22 09:02, Pankaj Raghav wrote:
> Convert the calculations on zone size to be generic instead of relying on
> power_of_2 based logic in the block layer using the helpers wherever
> possible.
> 
> The only hot path affected by this change for power_of_2 zoned devices
> is in blk_check_zone_append() but the effects should be negligible as the
> helper blk_queue_zone_aligned() optimizes the calculation for those
> devices. Note that the append path cannot be accessed by direct raw access
> to the block device but only through a filesystem abstraction.
> 
> Finally, remove the check for power_of_2 zone size requirement in
> blk-zoned.c
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   block/blk-core.c  |  3 +--
>   block/blk-zoned.c | 12 ++++++------
>   2 files changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 937bb6b86331..850caf311064 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -634,8 +634,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>   		return BLK_STS_NOTSUPP;
>   
>   	/* The bio sector must point to the start of a sequential zone */
> -	if (pos & (blk_queue_zone_sectors(q) - 1) ||
> -	    !blk_queue_zone_is_seq(q, pos))
> +	if (!blk_queue_zone_aligned(q, pos) || !blk_queue_zone_is_seq(q, pos))
>   		return BLK_STS_IOERR;
>   
>   	/*
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 1dff4a8bd51d..f7c7c3bd148d 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -288,10 +288,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
>   		return -EINVAL;
>   
>   	/* Check alignment (handle eventual smaller last zone) */
> -	if (sector & (zone_sectors - 1))
> +	if (!blk_queue_zone_aligned(q, sector))
>   		return -EINVAL;
>   
> -	if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
> +	if (!blk_queue_zone_aligned(q, nr_sectors) && end_sector != capacity)
>   		return -EINVAL;
>   
>   	/*
> @@ -489,14 +489,14 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>   	 * smaller last zone.
>   	 */
>   	if (zone->start == 0) {
> -		if (zone->len == 0 || !is_power_of_2(zone->len)) {
> -			pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
> -				disk->disk_name, zone->len);
> +		if (zone->len == 0) {
> +			pr_warn("%s: Invalid zoned device size",
> +				disk->disk_name);
>   			return -ENODEV;
>   		}
>   
>   		args->zone_sectors = zone->len;
> -		args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
> +		args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);

This is a different calculation than the one you're using in the first 
patch. Can you please add a helper such that both are using the same 
calculation?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  2022-04-27 16:02     ` [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 " Pankaj Raghav
  2022-04-29 17:23       ` Adam Manzanares
  2022-05-03 16:50       ` Bart Van Assche
@ 2022-05-04 17:03       ` Hannes Reinecke
  2022-05-04 18:55         ` Pankaj Raghav
  2 siblings, 1 reply; 62+ messages in thread
From: Hannes Reinecke @ 2022-05-04 17:03 UTC (permalink / raw)


On 4/27/22 09:02, Pankaj Raghav wrote:
> Remove the condition which disallows non-power_of_2 zone size ZNS drive
> to be updated and use generic method to calculate number of zones
> instead of relying on log and shift based calculation on zone size.
> 
> The power_of_2 calculation has been replaced directly with generic
> calculation without special handling. Both modified functions are not
> used in hot paths, they are only used during initialization &
> revalidation of the ZNS device.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   drivers/nvme/host/zns.c | 11 ++---------
>   1 file changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 9f81beb4df4e..2087de0768ee 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
>   	}
>   
>   	ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
> -	if (!is_power_of_2(ns->zsze)) {
> -		dev_warn(ns->ctrl->device,
> -			"invalid zone size:%llu for namespace:%u\n",
> -			ns->zsze, ns->head->ns_id);
> -		status = -ENODEV;
> -		goto free_data;
> -	}
>   
>   	blk_queue_set_zoned(ns->disk, BLK_ZONED_HM);
>   	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
> @@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
>   				   sizeof(struct nvme_zone_descriptor);
>   
>   	nr_zones = min_t(unsigned int, nr_zones,
> -			 get_capacity(ns->disk) >> ilog2(ns->zsze));
> +			 div64_u64(get_capacity(ns->disk), ns->zsze));
>   
Same here; please add a helper calculating the number of zones for a 
given disk.

>   	bufsize = sizeof(struct nvme_zone_report) +
>   		nr_zones * sizeof(struct nvme_zone_descriptor);
> @@ -197,7 +190,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>   	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>   	c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
>   
> -	sector &= ~(ns->zsze - 1);
> +	sector = rounddown(sector, ns->zsze);
>   	while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
>   		memset(report, 0, buflen);
>   
Please be a bit more consistent. In the previous patches you always had 
a condition to check if it's a power_of_2 zone size, but here you are 
using the same calculation for each disk.
So please use the check in all locations, or add a comment why the 
generic calculation is okay to use here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/16] nvmet: use blk_queue_zone_no()
  2022-04-27 16:02     ` [PATCH 06/16] nvmet: use blk_queue_zone_no() Pankaj Raghav
  2022-04-29 17:27       ` Adam Manzanares
  2022-05-03 16:54       ` Bart Van Assche
@ 2022-05-04 17:05       ` Hannes Reinecke
  2 siblings, 0 replies; 62+ messages in thread
From: Hannes Reinecke @ 2022-05-04 17:05 UTC (permalink / raw)


On 4/27/22 09:02, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Instead of open coding the number of zones given a sector, use the helper
> blk_queue_zone_no(). This let's us make modifications to the math if
> needed in one place and adds now support for npo2 zone devices.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   drivers/nvme/target/zns.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 13/16] null_blk: allow non power of 2 zoned devices
  2022-04-27 16:02     ` [PATCH 13/16] null_blk: " Pankaj Raghav
  2022-04-29 17:30       ` Adam Manzanares
  2022-05-03 17:01       ` Bart Van Assche
@ 2022-05-04 17:10       ` Hannes Reinecke
  2 siblings, 0 replies; 62+ messages in thread
From: Hannes Reinecke @ 2022-05-04 17:10 UTC (permalink / raw)


On 4/27/22 09:02, Pankaj Raghav wrote:
> Convert the power of 2 based calculation with zone size to be generic in
> null_zone_no with optimization for power of 2 based zone sizes.
> 
> The nr_zones calculation in null_init_zoned_dev has been replaced with a
> division without special handling for power of 2 based zone sizes as
> this function is called only during the initialization and will not
> invoked in the hot path.
> 
> Performance Measurement:
> 
> Device:
> zone size = 128M, blocksize=4k
> 
> FIO cmd:
> 
> fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
> --io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
> --loops=4
> 
> The following results are an average of 4 runs on AMD Ryzen 5 5600X with
> 32GB of RAM:
> 
> Sequential Write:
> 
> x-----------------x---------------------------------x---------------------------------x
> |     IOdepth     |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  578     |  2257    |   12.80   |  576     |  2248    |   25.78   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  581     |  2268    |   12.74   |  576     |  2248    |   25.85   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Sequential read:
> 
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth         |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  667     |  2605    |   11.79   |  675     |  2637    |   23.49   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  667     |  2605    |   11.79   |  675     |  2638    |   23.48   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Random read:
> 
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth         |            8                    |            16                   |
> x-----------------x---------------------------------x---------------------------------x
> |                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch   |  522     |  2038    |   15.05   |  514     |  2006    |   30.87   |
> x-----------------x---------------------------------x---------------------------------x
> |  With patch     |  522     |  2039    |   15.04   |  523     |  2042    |   30.33   |
> x-----------------x---------------------------------x---------------------------------x
> 
> Minor variations are noticed in Sequential write with io depth 8 and
> in random read with io depth 16. But overall no noticeable differences
> were noticed
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   drivers/block/null_blk/main.c  |  4 ++--
>   drivers/block/null_blk/zoned.c | 14 +++++++-------
>   2 files changed, 9 insertions(+), 9 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 16/16] dm-zoned: ensure only power of 2 zone sizes are allowed
  2022-04-27 16:02     ` [PATCH 16/16] dm-zoned: " Pankaj Raghav
  2022-04-27 23:42       ` Damien Le Moal
@ 2022-05-04 17:11       ` Hannes Reinecke
  1 sibling, 0 replies; 62+ messages in thread
From: Hannes Reinecke @ 2022-05-04 17:11 UTC (permalink / raw)


On 4/27/22 09:02, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Today dm-zoned relies on the assumption that you have a zone size
> with a power of 2. Even though the block layer today enforces this
> requirement, these devices do exist and so provide a stop-gap measure
> to ensure these devices cannot be used by mistake
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   drivers/md/dm-zone.c | 12 ++++++++++++
>   1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 57daa86c19cf..221e0aa0f1a7 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
>   	struct request_queue *q = md->queue;
>   	unsigned int noio_flag;
>   	int ret;
> +	struct block_device *bdev = md->disk->part0;
> +	sector_t zone_sectors;
> +	char bname[BDEVNAME_SIZE];
> +
> +	zone_sectors = bdev_zone_sectors(bdev);
> +
> +	if (!is_power_of_2(zone_sectors)) {
> +		DMWARN("%s: %s only power of two zone size supported\n",
> +		       dm_device_name(md),
> +		       bdevname(bdev, bname));
> +		return 1;
> +	}
>   
>   	/*
>   	 * Check if something changed. If yes, cleanup the current resources

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 04/16] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-05-04 16:59       ` Hannes Reinecke
@ 2022-05-04 18:46         ` Pankaj Raghav
  0 siblings, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-05-04 18:46 UTC (permalink / raw)
  To: Hannes Reinecke, Damien Le Moal, jaegeuk, axboe, snitzer, hch,
	mcgrof, naohiro.aota, sagi, dsterba, johannes.thumshirn
  Cc: linux-kernel, clm, gost.dev, chao, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, bvanassche, jiangbo.365,
	matias.bjorling, linux-block

Hi Hannes,
	Somehow your message did not go through the mailing list. Maybe you hit
reply instead of reply all?

I have added your reviewed-by tag in the other commits based on your
response. Thanks.

Anyway, my response to this email below:

On 2022-05-04 18:59, Hannes Reinecke wrote:
> On 4/27/22 09:02, Pankaj Raghav wrote:
>> zone size (%llu)\n",
>> -                disk->disk_name, zone->len);
>> +        if (zone->len == 0) {
>> +            pr_warn("%s: Invalid zoned device size",
>> +                disk->disk_name);
>>               return -ENODEV;
>>           }
>>             args->zone_sectors = zone->len;
>> -        args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
>> +        args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
> 
> This is a different calculation than the one you're using in the first
> patch. Can you please add a helper such that both are using the same
> calculation?
>
So this calculation is actually doing a roundup to the nearest zone
number operation and not just a division that is done in the block layer
helper such as bdev_zone_no(). Note the `zone->len - 1` added to the
capacity. This is done to take into account also the last unequal zone
size, if present.

Another thing to note is that block layer helpers cannot be used here
because at this point we haven't set the chunk sectors and we are still
in the revalidation callback.

Maybe some comments on top of this will help to avoid any confusion?
What do you think? And, I am not aware of any generic helper in math.h
that does this operation for both 32 and 64 bit architecture.

Regards,
Pankaj

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  2022-05-04 17:03       ` Hannes Reinecke
@ 2022-05-04 18:55         ` Pankaj Raghav
  0 siblings, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-05-04 18:55 UTC (permalink / raw)
  To: Hannes Reinecke, Damien Le Moal, jaegeuk, axboe, snitzer, hch,
	mcgrof, naohiro.aota, sagi, dsterba, johannes.thumshirn
  Cc: linux-kernel, clm, gost.dev, chao, josef, jonathan.derrick, agk,
	kbusch, kch, linux-nvme, bvanassche, jiangbo.365,
	matias.bjorling, linux-block

On 2022-05-04 19:03, Hannes Reinecke wrote:
> On 4/27/22 09:02, Pankaj Raghav wrote:
>> Remove the condition which disallows non-power_of_2 zone size ZNS drive
>> to be updated and use generic method to calculate number of zones
>> instead of relying on log and shift based calculation on zone size.
>>
>> The power_of_2 calculation has been replaced directly with generic
>> calculation without special handling. Both modified functions are not
>> used in hot paths, they are only used during initialization &
>> revalidation of the ZNS device.
>>
>> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>> ---
     }
>>         ns->zsze = nvme_lba_to_sect(ns,
>> le64_to_cpu(id->lbafe[lbaf].zsze));
>> -    if (!is_power_of_2(ns->zsze)) {
>> -        dev_warn(ns->ctrl->device,
>> -            "invalid zone size:%llu for namespace:%u\n",
>> -            ns->zsze, ns->head->ns_id);
>> -        status = -ENODEV;
>> -        goto free_data;
>> -    }
>>         blk_queue_set_zoned(ns->disk, BLK_ZONED_HM);
>>       blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
>> @@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct
>> nvme_ns *ns,
>>                      sizeof(struct nvme_zone_descriptor);
>>         nr_zones = min_t(unsigned int, nr_zones,
>> -             get_capacity(ns->disk) >> ilog2(ns->zsze));
>> +             div64_u64(get_capacity(ns->disk), ns->zsze));
>>   
> Same here; please add a helper calculating the number of zones for a
> given disk.
> 
I am already  using the div64_u64 helper and this is not done again
anywhere in the nvme zns driver. I am not sure if having a separate
helper for this will add value. And this is not in the hot path, so no
need for special handling.
>>       bufsize = sizeof(struct nvme_zone_report) +
>>           nr_zones * sizeof(struct nvme_zone_descriptor);
>> @@ -197,7 +190,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns,
>> sector_t sector,
>>       c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>>       c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
>>   -    sector &= ~(ns->zsze - 1);
>> +    sector = rounddown(sector, ns->zsze);
>>       while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
>>           memset(report, 0, buflen);
>>   
> Please be a bit more consistent. In the previous patches you always had
> a condition to check if it's a power_of_2 zone size, but here you are
> using the same calculation for each disk.
> So please use the check in all locations, or add a comment why the
> generic calculation is okay to use here.
> 
That is a good point. I have mentioned that in my commit log that I am
not having any special handling because these calculations are not in
the hot path.

Maybe adding comments is better for clarity. I will also do it for your
previous comment.

I will queue this up for my next revision. Thanks.
> Cheers,
> 
> Hannes

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 00/16] support non power of 2 zoned devices
  2022-05-03  9:12     ` Pankaj Raghav
@ 2022-05-04 21:14       ` David Sterba
  2022-05-05  7:28         ` Pankaj Raghav
  0 siblings, 1 reply; 62+ messages in thread
From: David Sterba @ 2022-05-04 21:14 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Johannes Thumshirn, jaegeuk, axboe, snitzer, hch, mcgrof,
	Naohiro Aota, sagi, damien.lemoal, dsterba, linux-kernel,
	linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel, josef,
	jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, Matias Bjørling,
	linux-block

On Tue, May 03, 2022 at 11:12:04AM +0200, Pankaj Raghav wrote:
> Hi Johannes,
> On 2022-05-03 00:07, Johannes Thumshirn wrote:
> >> There was an effort previously [1] to add support to non po2 devices via
> >> device level emulation but that was rejected with a final conclusion
> >> to add support for non po2 zoned device in the complete stack[2].
> > 
> > Hey Pankaj,
> > 
> > One thing I'm concerned with this patches is, once we have npo2 zones (or to be precise 
> > not fs_info->sectorsize aligned zones) we have to check on every allocation if we still 
> > have at least have fs_info->sectorsize bytes left in a zone. If not we need to 
> > explicitly finish the zone, otherwise we'll run out of max active zones. 
> > 
> This commit: `btrfs: zoned: relax the alignment constraint for zoned
> devices` makes sure the zone size is BTRFS_STRIPE_LEN aligned (64K). So
> even the npo2 zoned device should be aligned to `fs_info->sectorsize`,
> which is typically 4k.
> 
> This was one of the comment that came from David Sterba:
> https://lore.kernel.org/all/20220315142740.GU12643@twin.jikos.cz/
> where he suggested to have some sane alignment for the zone sizes.

My idea of 'sane' value would be 1M, that we have 4K for sectors is
because of the 1:1 mapping to pages, but RAM sizes are on a different
scale than storage devices. The 4K is absolute minimum but if the page
size is taken as a basic constraint, ARM has 64K and there are some 256K
arches.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 00/16] support non power of 2 zoned devices
  2022-05-04 21:14       ` David Sterba
@ 2022-05-05  7:28         ` Pankaj Raghav
  0 siblings, 0 replies; 62+ messages in thread
From: Pankaj Raghav @ 2022-05-05  7:28 UTC (permalink / raw)
  To: David Sterba
  Cc: Johannes Thumshirn, jaegeuk, axboe, snitzer, hch, mcgrof,
	Naohiro Aota, sagi, damien.lemoal, dsterba, linux-kernel,
	linux-btrfs, clm, gost.dev, chao, linux-f2fs-devel, josef,
	jonathan.derrick, agk, kbusch, kch, linux-nvme, dm-devel,
	bvanassche, jiangbo.365, linux-fsdevel, Matias Bjørling,
	linux-block

On 2022-05-04 23:14, David Sterba wrote:
>> This commit: `btrfs: zoned: relax the alignment constraint for zoned
>> devices` makes sure the zone size is BTRFS_STRIPE_LEN aligned (64K). So
>> even the npo2 zoned device should be aligned to `fs_info->sectorsize`,
>> which is typically 4k.
>>
>> This was one of the comment that came from David Sterba:
>> https://lore.kernel.org/all/20220315142740.GU12643@twin.jikos.cz/
>> where he suggested to have some sane alignment for the zone sizes.
> 
> My idea of 'sane' value would be 1M, that we have 4K for sectors is
> because of the 1:1 mapping to pages, but RAM sizes are on a different
> scale than storage devices. The 4K is absolute minimum but if the page
> size is taken as a basic constraint, ARM has 64K and there are some 256K
> arches.

That is a good point. I think it is safe to have 1MB as the minimum
alignment so that it covers all architecture's page sizes. Thanks. I
will queue this up.

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2022-05-05  7:29 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20220427160256eucas1p2db2b58792ffc93026d870c260767da14@eucas1p2.samsung.com>
2022-04-27 16:02 ` [PATCH 00/16] support non power of 2 zoned devices Pankaj Raghav
     [not found]   ` <CGME20220427160257eucas1p21fb58d0129376a135fdf0b9c2fe88895@eucas1p2.samsung.com>
2022-04-27 16:02     ` [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze Pankaj Raghav
2022-04-29 17:16       ` Adam Manzanares
2022-05-03 16:37       ` Bart Van Assche
2022-05-03 16:43         ` Damien Le Moal
2022-05-04  8:35         ` Pankaj Raghav
2022-05-04 16:52       ` Hannes Reinecke
     [not found]   ` <CGME20220427160258eucas1p19548a7094f67b4c9f340add776f60082@eucas1p1.samsung.com>
2022-04-27 16:02     ` [PATCH 02/16] block: add blk_queue_zone_aligned and bdev_zone_aligned helper Pankaj Raghav
2022-04-27 23:52       ` Bart Van Assche
2022-04-28 15:33         ` Pankaj Raghav
2022-05-04 16:55       ` Hannes Reinecke
     [not found]   ` <CGME20220427160259eucas1p25aab0637fec229cd1140e6aa08678f38@eucas1p2.samsung.com>
2022-04-27 16:02     ` [PATCH 03/16] block: add bdev_zone_no helper Pankaj Raghav
2022-04-27 23:31       ` Damien Le Moal
2022-04-28 15:40         ` Pankaj Raghav
2022-04-27 23:53       ` Bart Van Assche
2022-04-28 15:34         ` Pankaj Raghav
2022-05-04 16:55       ` Hannes Reinecke
     [not found]   ` <CGME20220427160300eucas1p1470fe30535849de6204bb78d7083cb3a@eucas1p1.samsung.com>
2022-04-27 16:02     ` [PATCH 04/16] block: allow blk-zoned devices to have non-power-of-2 zone size Pankaj Raghav
2022-04-27 23:37       ` Damien Le Moal
2022-04-28 17:29         ` Luis Chamberlain
2022-04-28 19:04         ` Pankaj Raghav
2022-05-04 16:59       ` Hannes Reinecke
2022-05-04 18:46         ` Pankaj Raghav
     [not found]   ` <CGME20220427160301eucas1p147d0dced70946e20dd2dd046b94b8224@eucas1p1.samsung.com>
2022-04-27 16:02     ` [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 " Pankaj Raghav
2022-04-29 17:23       ` Adam Manzanares
2022-05-03 16:50       ` Bart Van Assche
2022-05-04  8:38         ` Pankaj Raghav
2022-05-04 17:03       ` Hannes Reinecke
2022-05-04 18:55         ` Pankaj Raghav
     [not found]   ` <CGME20220427160302eucas1p1aaba7a309778d3440c3315ad899e4035@eucas1p1.samsung.com>
2022-04-27 16:02     ` [PATCH 06/16] nvmet: use blk_queue_zone_no() Pankaj Raghav
2022-04-29 17:27       ` Adam Manzanares
2022-05-03 16:54       ` Bart Van Assche
2022-05-04 17:05       ` Hannes Reinecke
     [not found]   ` <CGME20220427160303eucas1p1c7d1b743e9ecf77b4f203bdeccbe382e@eucas1p1.samsung.com>
2022-04-27 16:02     ` [PATCH 07/16] btrfs: zoned: Cache superblock location in btrfs_zoned_device_info Pankaj Raghav
     [not found]   ` <CGME20220427160304eucas1p1a0080df82f76c39882c4298c3c3d99fd@eucas1p1.samsung.com>
2022-04-27 16:02     ` [PATCH 08/16] btrfs: zoned: add generic btrfs helpers for zoned devices Pankaj Raghav
     [not found]   ` <CGME20220427160305eucas1p26831c19df0b2097e42209edcf73526b7@eucas1p2.samsung.com>
2022-04-27 16:02     ` [PATCH 09/16] btrfs: zoned: Make sb_zone_number function non power of 2 compatible Pankaj Raghav
     [not found]   ` <CGME20220427160306eucas1p10514a8597007ed9d5e269d659df58d35@eucas1p1.samsung.com>
2022-04-27 16:02     ` [PATCH 10/16] btrfs: zoned: use btrfs zone helpers to support non po2 zoned devices Pankaj Raghav
     [not found]   ` <CGME20220427160307eucas1p229f9ebae38fcca9974909799e5e63ccf@eucas1p2.samsung.com>
2022-04-27 16:02     ` [PATCH 11/16] btrfs: zoned: relax the alignment constraint for " Pankaj Raghav
     [not found]   ` <CGME20220427160309eucas1p2f677c8db581616f994473f17c4a5bd44@eucas1p2.samsung.com>
2022-04-27 16:02     ` [PATCH 12/16] zonefs: allow non power of 2 " Pankaj Raghav
2022-04-27 23:39       ` Damien Le Moal
2022-04-28 15:54         ` Pankaj Raghav
2022-04-28 21:49           ` Damien Le Moal
2022-04-29  7:55             ` Pankaj Raghav
     [not found]   ` <CGME20220427160310eucas1p28cd3c5ff4fb7a04bc77c4c0b9d96bb74@eucas1p2.samsung.com>
2022-04-27 16:02     ` [PATCH 13/16] null_blk: " Pankaj Raghav
2022-04-29 17:30       ` Adam Manzanares
2022-05-03 17:01       ` Bart Van Assche
2022-05-04 17:10       ` Hannes Reinecke
     [not found]   ` <CGME20220427160311eucas1p151141fc73adc590b40ad6f935b1ac214@eucas1p1.samsung.com>
2022-04-27 16:02     ` [PATCH 14/16] f2fs: call bdev_zone_sectors() only once on init_blkz_info() Pankaj Raghav
2022-05-03 20:04       ` Jaegeuk Kim
     [not found]   ` <CGME20220427160312eucas1p279bcffd97ef83bd3617a38b80d979746@eucas1p2.samsung.com>
2022-04-27 16:02     ` [PATCH 15/16] f2fs: ensure only power of 2 zone sizes are allowed Pankaj Raghav
2022-05-03 20:05       ` Jaegeuk Kim
2022-05-04  8:53         ` Pankaj Raghav
     [not found]   ` <CGME20220427160313eucas1p1feecf74ec15c8c3d9250444710fd1676@eucas1p1.samsung.com>
2022-04-27 16:02     ` [PATCH 16/16] dm-zoned: " Pankaj Raghav
2022-04-27 23:42       ` Damien Le Moal
2022-04-28 17:34         ` Luis Chamberlain
2022-04-28 21:43           ` Damien Le Moal
2022-04-28 22:06             ` Luis Chamberlain
2022-05-04 17:11       ` Hannes Reinecke
2022-05-02 22:07   ` [PATCH 00/16] support non power of 2 zoned devices Johannes Thumshirn
2022-05-03  9:12     ` Pankaj Raghav
2022-05-04 21:14       ` David Sterba
2022-05-05  7:28         ` Pankaj Raghav

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).