linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/13] support non power of 2 zoned devices
       [not found] <CGME20220516165418eucas1p2be592d9cd4b35f6b71d39ccbe87f3fef@eucas1p2.samsung.com>
@ 2022-05-16 16:54 ` Pankaj Raghav
       [not found]   ` <CGME20220516165419eucas1p104aadda60df323e6154bfc3b92103b7b@eucas1p1.samsung.com>
                     ` (13 more replies)
  0 siblings, 14 replies; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 16:54 UTC (permalink / raw)
  To: axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, p.raghav, linux-kernel, dm-devel

- Background and Motivation:

The zone storage implementation in Linux, introduced since v4.10, first
targetted SMR drives which have a power of 2 (po2) zone size alignment
requirement. The po2 zone size was further imposed implicitly by the
block layer's blk_queue_chunk_sectors(), used to prevent IO merging
across chunks beyond the specified size, since v3.16 through commit
762380ad9322 ("block: add notion of a chunk size for request merging").
But this same general block layer po2 requirement for blk_queue_chunk_sectors()
was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
to be non-power-of-2").

NAND, which is the media used in newer zoned storage devices, does not
naturally align to po2. In these devices, zone cap is not the same as the
po2 zone size. When the zone cap != zone size, then unmapped LBAs are
introduced to cover the space between the zone cap and zone size. po2
requirement does not make sense for these type of zone storage devices.
This patch series aims to remove these unmapped LBAs for zoned devices when
zone cap is npo2. This is done by relaxing the po2 zone size constraint
in the kernel and allowing zoned device with npo2 zone sizes if zone cap
== zone size.

Removing the po2 requirement from zone storage should be possible
now provided that no userspace regression and no performance regressions are
introduced. Stop-gap patches have been already merged into f2fs-tools to
proactively not allow npo2 zone sizes until proper support is added [0].
Additional kernel stop-gap patches are provided in this series for dm-zoned.
Support for npo2 zonefs and btrfs support is addressed in this series.

There was an effort previously [1] to add support to non po2 devices via
device level emulation but that was rejected with a final conclusion
to add support for non po2 zoned device in the complete stack[2].

- Patchset description:
This patchset aims at adding support to non power of 2 zoned devices in
the block layer, nvme layer, null blk and adds support to btrfs and
zonefs.

This round of patches **will not** support DM layer for non
power of 2 zoned devices. More about this in the future work section.

Patches 1-2 deals with removing the po2 constraint from the
block layer.

Patches 3-4 deals with removing the constraint from nvme zns.

Patches 5-9 adds support to btrfs for non po2 zoned devices.

Patch 10 removes the po2 constraint in ZoneFS

Patch 11-12 removes the po2 contraint in null blk

Patches 13 adds conditions to not allow non power of 2 devices in
DM.

The patch series is based on linux-next tag: next-20220502

- Performance:
PO2 zone sizes utilizes log and shifts instead of division when
determing alignment, zone number, etc. The same math cannot be used when
using a zoned device with non po2 zone size. Hence, to avoid any performance
regression on zoned devices with po2 zone sizes, the optimized math in the
hot paths has been retained with branching.

The performance was measured using null blk for regression
and the results have been posted in the appropriate commit log. No
performance regression was noticed.

- Testing
With respect to testing we need to tackle two things: one for regression
on po2 zoned device and progression on non po2 zoned devices.

kdevops (https://github.com/mcgrof/kdevops) was extensively used to
automate the testing for blktests and (x)fstests for btrfs changes. The
known failures were excluded during the test based on the baseline
v5.17.0-rc7

-- regression
Emulated zoned device with zone size =128M , nr_zones = 10000

Block and nvme zns:
blktests were run with no new failures

Btrfs:
Changes were tested with the following profile in QEMU:
[btrfs_simple_zns]
TEST_DIR=<dir>
SCRATCH_MNT=<mnt>
FSTYP=btrfs
MKFS_OPTIONS="-f -d single -m single"
TEST_DEV=<dev>
SCRATCH_DEV_POOL=<dev-pool>

No new failures were observed in btrfs, generic and shared test suite

ZoneFS:
zonefs-tests-nullblk.sh and zonefs-tests.sh from zonefs-tools were run
with no failures.

nullblk:
t/zbd/run-tests-against-nullb from fio was run with no failures.

DM:
It was verified if dm-zoned successfully mounts without any
error.

-- progression
Emulated zoned device with zone size = 96M , nr_zones = 10000

Block and nvme zns:
blktests were run with no new failures

Btrfs:
Same profile as po2 zone size was used.

Many tests in xfstests for btrfs included dm-flakey and some tests
required dm-linear. As they are not supported at the moment for non
po2 devices, those **tests were excluded for non po2 devices**.

No new failures were observed in btrfs, generic and shared test suite

ZoneFS:
zonefs-tests.sh from zonefs-tools were run with no failures.

nullblk:
A new section was added to cover non po2 devices:

section14()
{
       conv_pcnt=10
       zone_size=3
       zone_capacity=3
       max_open=${set_max_open}
       zbd_test_opts+=("-o ${max_open}")
}
t/zbd/run-tests-against-nullb from fio was run with no failures.

DM:
It was verified that dm-zoned does not mount.

- Tools:
Some tools had to be updated to support non po2 devices. Once these
patches are accepted in the kernel, these tool updates will also be
upstreamed.
* btrfs-prog: https://github.com/Panky-codes/btrfs-progs/tree/remove-po2-btrfs
* blkzone: https://github.com/Panky-codes/util-linux/tree/remove-po2
* zonefs-tools: https://github.com/Panky-codes/zonefs-tools/tree/remove-po2

- Future work
To reduce the amount of changes and testing, support for DM was
excluded in this round of patches. The plan is to add support to F2FS
and DM in the forthcoming future.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git/commit/?h=dev-test&id=6afcf6493578e77528abe65ab8b12f3e1c16749f
[1] https://lore.kernel.org/all/20220310094725.GA28499@lst.de/T/
[2] https://lore.kernel.org/all/20220315135245.eqf4tqngxxb7ymqa@unifi/

Changes since v1:
- Put the function declaration and its usage in the same commit (Bart)
- Remove bdev_zone_aligned function (Bart)
- Change the name from blk_queue_zone_aligned to blk_queue_is_zone_start
  (Damien)
- q is never null in from bdev_get_queue (Damien)
- Add condition during bringup and check for zsze == zcap for npo2
  drives (Damien)
- Rounddown operation should be made generic to work in 32 bits arch
  (bart)
- Add comments where generic calculation is directly used instead having
  special handling for po2 zone sizes (Hannes)
- Make the minimum zone size alignment requirement for btrfs to be 1M
  instead of BTRFS_STRIPE_LEN(David)

Changes since v2:
- Minor formatting changes

Changes since v3:
- Make superblock mirror align with the existing superblock log offsets
  (David)
- DM change return value and remove extra newline
- Optimize null blk zone index lookup with shift for po2 zone size

Luis Chamberlain (1):
  dm-zoned: ensure only power of 2 zone sizes are allowed

Pankaj Raghav (12):
  block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2
    zsze
  block: allow blk-zoned devices to have non-power-of-2 zone size
  nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  nvmet: Allow ZNS target to support non-power_of_2 zone sizes
  btrfs: zoned: Cache superblock location in btrfs_zoned_device_info
  btrfs: zoned: Make sb_zone_number function non power of 2 compatible
  btrfs: zoned: use generic btrfs zone helpers to support npo2 zoned
    devices
  btrfs:zoned: make sb for npo2 zone devices align with sb log offsets
  btrfs: zoned: relax the alignment constraint for zoned devices
  zonefs: allow non power of 2 zoned devices
  null_blk: allow non power of 2 zoned devices
  null_blk: use zone_size_sects_shift for power of 2 zoned devices

 block/blk-core.c                  |   3 +-
 block/blk-zoned.c                 |  40 +++++--
 drivers/block/null_blk/main.c     |   5 +-
 drivers/block/null_blk/null_blk.h |   6 +
 drivers/block/null_blk/zoned.c    |  20 ++--
 drivers/md/dm-zone.c              |  12 ++
 drivers/nvme/host/zns.c           |  24 ++--
 drivers/nvme/target/zns.c         |   2 +-
 fs/btrfs/volumes.c                |  24 ++--
 fs/btrfs/zoned.c                  | 191 +++++++++++++++++++++---------
 fs/btrfs/zoned.h                  |  44 ++++++-
 fs/zonefs/super.c                 |   6 +-
 fs/zonefs/zonefs.h                |   1 -
 include/linux/blkdev.h            |  37 +++++-
 14 files changed, 303 insertions(+), 112 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v4 01/13] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze
       [not found]   ` <CGME20220516165419eucas1p104aadda60df323e6154bfc3b92103b7b@eucas1p1.samsung.com>
@ 2022-05-16 16:54     ` Pankaj Raghav
  0 siblings, 0 replies; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 16:54 UTC (permalink / raw)
  To: axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, p.raghav, linux-kernel, dm-devel, Luis Chamberlain,
	Hannes Reinecke

Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
also work for non-power-of-2 zone sizes.

As the existing deployments of zoned devices had power-of-2
assumption, power-of-2 optimized calculation is kept for those devices.

There are no direct hot paths modified and the changes just
introduce one new branch per call.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 block/blk-zoned.c      | 13 ++++++++++---
 include/linux/blkdev.h |  8 +++++++-
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 38cd840d8..140230134 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -111,16 +111,23 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
  * blkdev_nr_zones - Get number of zones
  * @disk:	Target gendisk
  *
- * Return the total number of zones of a zoned block device.  For a block
- * device without zone capabilities, the number of zones is always 0.
+ * Return the total number of zones of a zoned block device, including the
+ * eventual small last zone if present. For a block device without zone
+ * capabilities, the number of zones is always 0.
  */
 unsigned int blkdev_nr_zones(struct gendisk *disk)
 {
 	sector_t zone_sectors = blk_queue_zone_sectors(disk->queue);
+	sector_t capacity = get_capacity(disk);
 
 	if (!blk_queue_is_zoned(disk->queue))
 		return 0;
-	return (get_capacity(disk) + zone_sectors - 1) >> ilog2(zone_sectors);
+
+	if (is_power_of_2(zone_sectors))
+		return (capacity + zone_sectors - 1) >>
+		       ilog2(zone_sectors);
+
+	return div64_u64(capacity + zone_sectors - 1, zone_sectors);
 }
 EXPORT_SYMBOL_GPL(blkdev_nr_zones);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1b24c1fb3..22fe512ee 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -675,9 +675,15 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
 static inline unsigned int blk_queue_zone_no(struct request_queue *q,
 					     sector_t sector)
 {
+	sector_t zone_sectors = blk_queue_zone_sectors(q);
+
 	if (!blk_queue_is_zoned(q))
 		return 0;
-	return sector >> ilog2(q->limits.chunk_sectors);
+
+	if (is_power_of_2(zone_sectors))
+		return sector >> ilog2(zone_sectors);
+
+	return div64_u64(sector, zone_sectors);
 }
 
 static inline bool blk_queue_zone_is_seq(struct request_queue *q,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 02/13] block: allow blk-zoned devices to have non-power-of-2 zone size
       [not found]   ` <CGME20220516165421eucas1p2515446ac290987bdb9af24ffb835b287@eucas1p2.samsung.com>
@ 2022-05-16 16:54     ` Pankaj Raghav
  2022-05-16 19:05       ` Pankaj Raghav
  0 siblings, 1 reply; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 16:54 UTC (permalink / raw)
  To: axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, p.raghav, linux-kernel, dm-devel, Luis Chamberlain

Checking if a given sector is aligned to a zone is a common
operation that is performed for zoned devices. Add
blk_queue_is_zone_start helper to check for this instead of opencoding it
everywhere.

Convert the calculations on zone size to be generic instead of relying on
power_of_2 based logic in the block layer using the helpers wherever
possible.

The only hot path affected by this change for power_of_2 zoned devices
is in blk_check_zone_append() but blk_queue_is_zone_start() helper is
used to optimize the calculation for po2 zone sizes. Note that the append
path cannot be accessed by direct raw access to the block device but only
through a filesystem abstraction.

Finally, allow non power of 2 zoned devices provided that their zone
capacity and zone size are equal. The main motivation to allow non
power_of_2 zoned device is to remove the unmapped LBA between zcap and
zsze for devices that cannot have a power_of_2 zcap.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 block/blk-core.c       |  3 +--
 block/blk-zoned.c      | 27 +++++++++++++++++++++------
 include/linux/blkdev.h | 22 ++++++++++++++++++++++
 3 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f305cb66c..b7051b7ea 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -634,8 +634,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 		return BLK_STS_NOTSUPP;
 
 	/* The bio sector must point to the start of a sequential zone */
-	if (pos & (blk_queue_zone_sectors(q) - 1) ||
-	    !blk_queue_zone_is_seq(q, pos))
+	if (!blk_queue_is_zone_start(q, pos) || !blk_queue_zone_is_seq(q, pos))
 		return BLK_STS_IOERR;
 
 	/*
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 140230134..cfc2fb804 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -289,10 +289,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
 		return -EINVAL;
 
 	/* Check alignment (handle eventual smaller last zone) */
-	if (sector & (zone_sectors - 1))
+	if (!blk_queue_is_zone_start(q, sector))
 		return -EINVAL;
 
-	if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
+	if (!blk_queue_is_zone_start(q, nr_sectors) && end_sector != capacity)
 		return -EINVAL;
 
 	/*
@@ -490,14 +490,29 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 	 * smaller last zone.
 	 */
 	if (zone->start == 0) {
-		if (zone->len == 0 || !is_power_of_2(zone->len)) {
-			pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
-				disk->disk_name, zone->len);
+		if (zone->len == 0) {
+			pr_warn("%s: Invalid zone size",
+				disk->disk_name);
+			return -ENODEV;
+		}
+
+		/*
+		 * Don't allow zoned device with non power_of_2 zone size with
+		 * zone capacity less than zone size.
+		 */
+		if (!is_power_of_2(zone->len) &&
+		    zone->capacity < zone->len) {
+			pr_warn("%s: Invalid zoned size with non power of 2 zone size and zone capacity < zone size",
+				disk->disk_name);
 			return -ENODEV;
 		}
 
 		args->zone_sectors = zone->len;
-		args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
+		/*
+		 * Division is used to calculate nr_zones for both power_of_2
+		 * and non power_of_2 zone sizes as it is not in the hot path.
+		 */
+		args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
 	} else if (zone->start + args->zone_sectors < capacity) {
 		if (zone->len != args->zone_sectors) {
 			pr_warn("%s: Invalid zoned device with non constant zone size\n",
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 22fe512ee..32d7bd7b1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -686,6 +686,22 @@ static inline unsigned int blk_queue_zone_no(struct request_queue *q,
 	return div64_u64(sector, zone_sectors);
 }
 
+static inline bool blk_queue_is_zone_start(struct request_queue *q, sector_t sec)
+{
+	sector_t zone_sectors = blk_queue_zone_sectors(q);
+	u64 remainder = 0;
+
+	if (!blk_queue_is_zoned(q))
+		return false;
+
+	if (is_power_of_2(zone_sectors))
+		return IS_ALIGNED(sec, zone_sectors);
+
+	div64_u64_rem(sec, zone_sectors, &remainder);
+	/* if there is a remainder, then the sector is not aligned */
+	return remainder == 0;
+}
+
 static inline bool blk_queue_zone_is_seq(struct request_queue *q,
 					 sector_t sector)
 {
@@ -732,6 +748,12 @@ static inline unsigned int blk_queue_zone_no(struct request_queue *q,
 {
 	return 0;
 }
+
+static inline bool blk_queue_is_zone_start(struct request_queue *q, sector_t sec)
+{
+	return false;
+}
+
 static inline unsigned int queue_max_open_zones(const struct request_queue *q)
 {
 	return 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 03/13] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
       [not found]   ` <CGME20220516165422eucas1p174acec28848a9c2178376f092af3fa1c@eucas1p1.samsung.com>
@ 2022-05-16 16:54     ` Pankaj Raghav
  0 siblings, 0 replies; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 16:54 UTC (permalink / raw)
  To: axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, p.raghav, linux-kernel, dm-devel, Luis Chamberlain

Remove the condition which disallows non-power_of_2 zone size ZNS drive
to be updated and use generic method to calculate number of zones
instead of relying on log and shift based calculation on zone size.

The power_of_2 calculation has been replaced directly with generic
calculation without special handling. Both modified functions are not
used in hot paths, they are only used during initialization &
revalidation of the ZNS device.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/host/zns.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 9f81beb4d..65d2aa68a 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
 	}
 
 	ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
-	if (!is_power_of_2(ns->zsze)) {
-		dev_warn(ns->ctrl->device,
-			"invalid zone size:%llu for namespace:%u\n",
-			ns->zsze, ns->head->ns_id);
-		status = -ENODEV;
-		goto free_data;
-	}
 
 	blk_queue_set_zoned(ns->disk, BLK_ZONED_HM);
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
@@ -128,8 +121,13 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
 	const size_t min_bufsize = sizeof(struct nvme_zone_report) +
 				   sizeof(struct nvme_zone_descriptor);
 
+	/*
+	 * Division is used to calculate nr_zones with no special handling
+	 * for power of 2 zone sizes as this function is not invoked in a
+	 * hot path
+	 */
 	nr_zones = min_t(unsigned int, nr_zones,
-			 get_capacity(ns->disk) >> ilog2(ns->zsze));
+			 div64_u64(get_capacity(ns->disk), ns->zsze));
 
 	bufsize = sizeof(struct nvme_zone_report) +
 		nr_zones * sizeof(struct nvme_zone_descriptor);
@@ -182,6 +180,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 	int ret, zone_idx = 0;
 	unsigned int nz, i;
 	size_t buflen;
+	u64 remainder = 0;
 
 	if (ns->head->ids.csi != NVME_CSI_ZNS)
 		return -EINVAL;
@@ -197,7 +196,14 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
 	c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
 
-	sector &= ~(ns->zsze - 1);
+	/*
+	 * rounddown the sector value to the nearest zone size. roundown macro
+	 * provided in math.h will not work for 32 bit architectures.
+	 * Division is used here with no special handling for power of 2
+	 * zone sizes as this function is not invoked in a hot path
+	 */
+	div64_u64_rem(sector, ns->zsze, &remainder);
+	sector -= remainder;
 	while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
 		memset(report, 0, buflen);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 04/13] nvmet: Allow ZNS target to support non-power_of_2 zone sizes
       [not found]   ` <CGME20220516165424eucas1p2ee38cd64260539e5cac8d1fa4d0cba38@eucas1p2.samsung.com>
@ 2022-05-16 16:54     ` Pankaj Raghav
  2022-05-17 14:19       ` Johannes Thumshirn
  0 siblings, 1 reply; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 16:54 UTC (permalink / raw)
  To: axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, p.raghav, linux-kernel, dm-devel, Bart Van Assche,
	Hannes Reinecke, Luis Chamberlain

A generic bdev_zone_no helper is added to calculate zone number for a given
sector in a block device. This helper internally uses blk_queue_zone_no to
find the zone number.

Use the helper bdev_zone_no() to calculate nr of zones. This let's us
make modifications to the math if needed in one place and adds now
support for npo2 zone devices.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/target/zns.c | 2 +-
 include/linux/blkdev.h    | 7 +++++++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
index 82b61acf7..5516dd6cc 100644
--- a/drivers/nvme/target/zns.c
+++ b/drivers/nvme/target/zns.c
@@ -242,7 +242,7 @@ static unsigned long nvmet_req_nr_zones_from_slba(struct nvmet_req *req)
 	unsigned int sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);
 
 	return blkdev_nr_zones(req->ns->bdev->bd_disk) -
-		(sect >> ilog2(bdev_zone_sectors(req->ns->bdev)));
+	       bdev_zone_no(req->ns->bdev, sect);
 }
 
 static unsigned long get_nr_zones_from_buf(struct nvmet_req *req, u32 bufsize)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 32d7bd7b1..967790f51 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1370,6 +1370,13 @@ static inline sector_t bdev_zone_sectors(struct block_device *bdev)
 	return 0;
 }
 
+static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	return blk_queue_zone_no(q, sec);
+}
+
 static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 05/13] btrfs: zoned: Cache superblock location in btrfs_zoned_device_info
       [not found]   ` <CGME20220516165425eucas1p29fcd11d7051d9d3a9a9efc17cd3b6999@eucas1p2.samsung.com>
@ 2022-05-16 16:54     ` Pankaj Raghav
  2022-05-16 21:58       ` David Sterba
  0 siblings, 1 reply; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 16:54 UTC (permalink / raw)
  To: axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, p.raghav, linux-kernel, dm-devel, Luis Chamberlain

Instead of calculating the superblock location every time, cache the
superblock zone location in btrfs_zoned_device_info struct and use it to
locate the zone index.

The functions such as btrfs_sb_log_location_bdev() and
btrfs_reset_sb_log_zones() which work directly on block_device shall
continue to use the sb_zone_number because btrfs_zoned_device_info
struct might not have been initialized at that point.

This patch will enable non power-of-2 zoned devices to not perform
division to lookup superblock and its mirror location.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/btrfs/zoned.c | 13 +++++++++----
 fs/btrfs/zoned.h |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 06f22c021..e8c7cebb2 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -511,6 +511,11 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
 			   max_active_zones - nactive);
 	}
 
+	/* Cache the sb zone number */
+	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; ++i) {
+		zone_info->sb_zone_location[i] =
+			sb_zone_number(zone_info->zone_size_shift, i);
+	}
 	/* Validate superblock log */
 	nr_zones = BTRFS_NR_SB_LOG_ZONES;
 	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
@@ -518,7 +523,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
 		u64 sb_wp;
 		int sb_pos = BTRFS_NR_SB_LOG_ZONES * i;
 
-		sb_zone = sb_zone_number(zone_info->zone_size_shift, i);
+		sb_zone = zone_info->sb_zone_location[i];
 		if (sb_zone + 1 >= zone_info->nr_zones)
 			continue;
 
@@ -866,7 +871,7 @@ int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
 		return 0;
 	}
 
-	zone_num = sb_zone_number(zinfo->zone_size_shift, mirror);
+	zone_num = zinfo->sb_zone_location[mirror];
 	if (zone_num + 1 >= zinfo->nr_zones)
 		return -ENOENT;
 
@@ -883,7 +888,7 @@ static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo,
 	if (!zinfo)
 		return false;
 
-	zone_num = sb_zone_number(zinfo->zone_size_shift, mirror);
+	zone_num = zinfo->sb_zone_location[mirror];
 	if (zone_num + 1 >= zinfo->nr_zones)
 		return false;
 
@@ -1011,7 +1016,7 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
 			u32 sb_zone;
 			u64 sb_pos;
 
-			sb_zone = sb_zone_number(shift, i);
+			sb_zone = zinfo->sb_zone_location[i];
 			if (!(end <= sb_zone ||
 			      sb_zone + BTRFS_NR_SB_LOG_ZONES <= begin)) {
 				have_sb = true;
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 10f31d1c8..694ab6d1e 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -27,6 +27,7 @@ struct btrfs_zoned_device_info {
 	unsigned long *active_zones;
 	struct blk_zone *zone_cache;
 	struct blk_zone sb_zones[2 * BTRFS_SUPER_MIRROR_MAX];
+	u32 sb_zone_location[BTRFS_SUPER_MIRROR_MAX];
 };
 
 #ifdef CONFIG_BLK_DEV_ZONED
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 06/13] btrfs: zoned: Make sb_zone_number function non power of 2 compatible
       [not found]   ` <CGME20220516165427eucas1p1cfd87ca44ec314ea1d2ddc8ece7259f9@eucas1p1.samsung.com>
@ 2022-05-16 16:54     ` Pankaj Raghav
  2022-05-17  6:53       ` Johannes Thumshirn
  0 siblings, 1 reply; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 16:54 UTC (permalink / raw)
  To: axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, p.raghav, linux-kernel, dm-devel, Luis Chamberlain

Make the calculation in sb_zone_number function to be generic and work
for both power-of-2 and non power-of-2 zone sizes.

The function signature has been modified to take block device and mirror
as input as this function is only invoked from callers that have access
to the block device. This enables to use the generic bdev_zone_no
function provided by the block layer to calculate the zone number.

Even though division is used to calculate the zone index for non
power-of-2 zone sizes, this function will not be used in the fast path as
the sb_zone_location cache is used for the superblock zone location.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/btrfs/zoned.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index e8c7cebb2..5be2ef7bb 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -34,9 +34,6 @@
 #define BTRFS_SB_LOG_FIRST_OFFSET	(512ULL * SZ_1G)
 #define BTRFS_SB_LOG_SECOND_OFFSET	(4096ULL * SZ_1G)
 
-#define BTRFS_SB_LOG_FIRST_SHIFT	const_ilog2(BTRFS_SB_LOG_FIRST_OFFSET)
-#define BTRFS_SB_LOG_SECOND_SHIFT	const_ilog2(BTRFS_SB_LOG_SECOND_OFFSET)
-
 /* Number of superblock log zones */
 #define BTRFS_NR_SB_LOG_ZONES 2
 
@@ -153,15 +150,23 @@ static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones,
 /*
  * Get the first zone number of the superblock mirror
  */
-static inline u32 sb_zone_number(int shift, int mirror)
+static inline u32 sb_zone_number(struct block_device *bdev, int mirror)
 {
 	u64 zone;
 
 	ASSERT(mirror < BTRFS_SUPER_MIRROR_MAX);
 	switch (mirror) {
-	case 0: zone = 0; break;
-	case 1: zone = 1ULL << (BTRFS_SB_LOG_FIRST_SHIFT - shift); break;
-	case 2: zone = 1ULL << (BTRFS_SB_LOG_SECOND_SHIFT - shift); break;
+	case 0:
+		zone = 0;
+		break;
+	case 1:
+		zone = bdev_zone_no(bdev,
+				    BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT);
+		break;
+	case 2:
+		zone = bdev_zone_no(bdev,
+				    BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT);
+		break;
 	}
 
 	ASSERT(zone <= U32_MAX);
@@ -514,7 +519,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
 	/* Cache the sb zone number */
 	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; ++i) {
 		zone_info->sb_zone_location[i] =
-			sb_zone_number(zone_info->zone_size_shift, i);
+			sb_zone_number(bdev, i);
 	}
 	/* Validate superblock log */
 	nr_zones = BTRFS_NR_SB_LOG_ZONES;
@@ -839,7 +844,7 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
 	nr_sectors = bdev_nr_sectors(bdev);
 	nr_zones = nr_sectors >> zone_sectors_shift;
 
-	sb_zone = sb_zone_number(zone_sectors_shift + SECTOR_SHIFT, mirror);
+	sb_zone = sb_zone_number(bdev, mirror);
 	if (sb_zone + 1 >= nr_zones)
 		return -ENOENT;
 
@@ -963,7 +968,7 @@ int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror)
 	nr_sectors = bdev_nr_sectors(bdev);
 	nr_zones = nr_sectors >> zone_sectors_shift;
 
-	sb_zone = sb_zone_number(zone_sectors_shift + SECTOR_SHIFT, mirror);
+	sb_zone = sb_zone_number(bdev, mirror);
 	if (sb_zone + 1 >= nr_zones)
 		return -ENOENT;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 07/13] btrfs: zoned: use generic btrfs zone helpers to support npo2 zoned devices
       [not found]   ` <CGME20220516165428eucas1p1374b5f9592db3ca6a6551aff975537ce@eucas1p1.samsung.com>
@ 2022-05-16 16:54     ` Pankaj Raghav
  2022-05-17 12:30       ` David Sterba
  2022-05-19  4:13       ` Naohiro Aota
  0 siblings, 2 replies; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 16:54 UTC (permalink / raw)
  To: axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, p.raghav, linux-kernel, dm-devel, Luis Chamberlain

Add helpers to calculate alignment, round up and round down
for zoned devices. These helpers encapsulates the necessary handling for
power_of_2 and non-power_of_2 zone sizes. Optimized calculations are
performed for zone sizes that are power_of_2 with log and shifts.

btrfs_zoned_is_aligned() is added instead of reusing bdev_zone_aligned()
helper due to some use cases in btrfs where zone alignment is checked
before having access to the underlying block device such as in this
function: btrfs_load_block_group_zone_info().

Use the generic btrfs zone helpers to calculate zone index, check zone
alignment, round up and round down operations.

The zone_size_shift field is not needed anymore as generic helpers are
used for calculation.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/btrfs/volumes.c | 24 +++++++++-------
 fs/btrfs/zoned.c   | 72 ++++++++++++++++++++++------------------------
 fs/btrfs/zoned.h   | 43 +++++++++++++++++++++++----
 3 files changed, 85 insertions(+), 54 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 94f851592..3d6b9a25a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1408,7 +1408,7 @@ static u64 dev_extent_search_start(struct btrfs_device *device, u64 start)
 		 * allocator, because we anyway use/reserve the first two zones
 		 * for superblock logging.
 		 */
-		return ALIGN(start, device->zone_info->zone_size);
+		return btrfs_zoned_roundup(start, device->zone_info->zone_size);
 	default:
 		BUG();
 	}
@@ -1423,7 +1423,7 @@ static bool dev_extent_hole_check_zoned(struct btrfs_device *device,
 	int ret;
 	bool changed = false;
 
-	ASSERT(IS_ALIGNED(*hole_start, zone_size));
+	ASSERT(btrfs_zoned_is_aligned(*hole_start, zone_size));
 
 	while (*hole_size > 0) {
 		pos = btrfs_find_allocatable_zones(device, *hole_start,
@@ -1560,7 +1560,7 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 	search_start = dev_extent_search_start(device, search_start);
 
 	WARN_ON(device->zone_info &&
-		!IS_ALIGNED(num_bytes, device->zone_info->zone_size));
+		!btrfs_zoned_is_aligned(num_bytes, device->zone_info->zone_size));
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -5111,8 +5111,8 @@ static void init_alloc_chunk_ctl_policy_zoned(
 
 	ctl->max_stripe_size = zone_size;
 	if (type & BTRFS_BLOCK_GROUP_DATA) {
-		ctl->max_chunk_size = round_down(BTRFS_MAX_DATA_CHUNK_SIZE,
-						 zone_size);
+		ctl->max_chunk_size = btrfs_zoned_rounddown(
+			BTRFS_MAX_DATA_CHUNK_SIZE, zone_size);
 	} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
 		ctl->max_chunk_size = ctl->max_stripe_size;
 	} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
@@ -5124,9 +5124,10 @@ static void init_alloc_chunk_ctl_policy_zoned(
 	}
 
 	/* We don't want a chunk larger than 10% of writable space */
-	limit = max(round_down(div_factor(fs_devices->total_rw_bytes, 1),
-			       zone_size),
-		    min_chunk_size);
+	limit = max(
+		btrfs_zoned_rounddown(div_factor(fs_devices->total_rw_bytes, 1),
+				      zone_size),
+		min_chunk_size);
 	ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
 	ctl->dev_extent_min = zone_size * ctl->dev_stripes;
 }
@@ -6729,7 +6730,8 @@ static void submit_stripe_bio(struct btrfs_io_context *bioc,
 	 */
 	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
 		if (btrfs_dev_is_sequential(dev, physical)) {
-			u64 zone_start = round_down(physical, fs_info->zone_size);
+			u64 zone_start = btrfs_zoned_rounddown(physical,
+							fs_info->zone_size);
 
 			bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;
 		} else {
@@ -8051,8 +8053,8 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 	if (dev->zone_info) {
 		u64 zone_size = dev->zone_info->zone_size;
 
-		if (!IS_ALIGNED(physical_offset, zone_size) ||
-		    !IS_ALIGNED(physical_len, zone_size)) {
+		if (!btrfs_zoned_is_aligned(physical_offset, zone_size) ||
+		    !btrfs_zoned_is_aligned(physical_len, zone_size)) {
 			btrfs_err(fs_info,
 "zoned: dev extent devid %llu physical offset %llu len %llu is not aligned to device zone",
 				  devid, physical_offset, physical_len);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 5be2ef7bb..3023c871e 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -177,13 +177,13 @@ static inline u32 sb_zone_number(struct block_device *bdev, int mirror)
 static inline sector_t zone_start_sector(u32 zone_number,
 					 struct block_device *bdev)
 {
-	return (sector_t)zone_number << ilog2(bdev_zone_sectors(bdev));
+	return zone_number * bdev_zone_sectors(bdev);
 }
 
 static inline u64 zone_start_physical(u32 zone_number,
 				      struct btrfs_zoned_device_info *zone_info)
 {
-	return (u64)zone_number << zone_info->zone_size_shift;
+	return zone_number * zone_info->zone_size;
 }
 
 /*
@@ -236,8 +236,8 @@ static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
 	if (zinfo->zone_cache) {
 		unsigned int i;
 
-		ASSERT(IS_ALIGNED(pos, zinfo->zone_size));
-		zno = pos >> zinfo->zone_size_shift;
+		ASSERT(btrfs_zoned_is_aligned(pos, zinfo->zone_size));
+		zno = bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT);
 		/*
 		 * We cannot report zones beyond the zone end. So, it is OK to
 		 * cap *nr_zones to at the end.
@@ -409,9 +409,8 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
 	}
 
 	nr_sectors = bdev_nr_sectors(bdev);
-	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
-	zone_info->nr_zones = nr_sectors >> ilog2(zone_sectors);
-	if (!IS_ALIGNED(nr_sectors, zone_sectors))
+	zone_info->nr_zones = bdev_zone_no(bdev, nr_sectors);
+	if (!btrfs_zoned_is_aligned(nr_sectors, zone_sectors))
 		zone_info->nr_zones++;
 
 	max_active_zones = bdev_max_active_zones(bdev);
@@ -823,10 +822,8 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
 			       u64 *bytenr_ret)
 {
 	struct blk_zone zones[BTRFS_NR_SB_LOG_ZONES];
-	sector_t zone_sectors;
 	u32 sb_zone;
 	int ret;
-	u8 zone_sectors_shift;
 	sector_t nr_sectors;
 	u32 nr_zones;
 
@@ -837,12 +834,10 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
 
 	ASSERT(rw == READ || rw == WRITE);
 
-	zone_sectors = bdev_zone_sectors(bdev);
-	if (!is_power_of_2(zone_sectors))
+	if (!is_power_of_2(bdev_zone_sectors(bdev)))
 		return -EINVAL;
-	zone_sectors_shift = ilog2(zone_sectors);
 	nr_sectors = bdev_nr_sectors(bdev);
-	nr_zones = nr_sectors >> zone_sectors_shift;
+	nr_zones = bdev_zone_no(bdev, nr_sectors);
 
 	sb_zone = sb_zone_number(bdev, mirror);
 	if (sb_zone + 1 >= nr_zones)
@@ -959,14 +954,12 @@ int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror)
 {
 	sector_t zone_sectors;
 	sector_t nr_sectors;
-	u8 zone_sectors_shift;
 	u32 sb_zone;
 	u32 nr_zones;
 
 	zone_sectors = bdev_zone_sectors(bdev);
-	zone_sectors_shift = ilog2(zone_sectors);
 	nr_sectors = bdev_nr_sectors(bdev);
-	nr_zones = nr_sectors >> zone_sectors_shift;
+	nr_zones = bdev_zone_no(bdev, nr_sectors);
 
 	sb_zone = sb_zone_number(bdev, mirror);
 	if (sb_zone + 1 >= nr_zones)
@@ -992,18 +985,17 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
 				 u64 hole_end, u64 num_bytes)
 {
 	struct btrfs_zoned_device_info *zinfo = device->zone_info;
-	const u8 shift = zinfo->zone_size_shift;
-	u64 nzones = num_bytes >> shift;
+	u64 nzones = bdev_zone_no(device->bdev, num_bytes >> SECTOR_SHIFT);
 	u64 pos = hole_start;
 	u64 begin, end;
 	bool have_sb;
 	int i;
 
-	ASSERT(IS_ALIGNED(hole_start, zinfo->zone_size));
-	ASSERT(IS_ALIGNED(num_bytes, zinfo->zone_size));
+	ASSERT(btrfs_zoned_is_aligned(hole_start, zinfo->zone_size));
+	ASSERT(btrfs_zoned_is_aligned(num_bytes, zinfo->zone_size));
 
 	while (pos < hole_end) {
-		begin = pos >> shift;
+		begin = bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT);
 		end = begin + nzones;
 
 		if (end > zinfo->nr_zones)
@@ -1035,8 +1027,9 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
 			if (!(pos + num_bytes <= sb_pos ||
 			      sb_pos + BTRFS_SUPER_INFO_SIZE <= pos)) {
 				have_sb = true;
-				pos = ALIGN(sb_pos + BTRFS_SUPER_INFO_SIZE,
-					    zinfo->zone_size);
+				pos = btrfs_zoned_roundup(
+					sb_pos + BTRFS_SUPER_INFO_SIZE,
+					zinfo->zone_size);
 				break;
 			}
 		}
@@ -1050,7 +1043,7 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
 static bool btrfs_dev_set_active_zone(struct btrfs_device *device, u64 pos)
 {
 	struct btrfs_zoned_device_info *zone_info = device->zone_info;
-	unsigned int zno = (pos >> zone_info->zone_size_shift);
+	unsigned int zno = bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT);
 
 	/* We can use any number of zones */
 	if (zone_info->max_active_zones == 0)
@@ -1072,7 +1065,7 @@ static bool btrfs_dev_set_active_zone(struct btrfs_device *device, u64 pos)
 static void btrfs_dev_clear_active_zone(struct btrfs_device *device, u64 pos)
 {
 	struct btrfs_zoned_device_info *zone_info = device->zone_info;
-	unsigned int zno = (pos >> zone_info->zone_size_shift);
+	unsigned int zno = bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT);
 
 	/* We can use any number of zones */
 	if (zone_info->max_active_zones == 0)
@@ -1108,14 +1101,14 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
 {
 	struct btrfs_zoned_device_info *zinfo = device->zone_info;
-	const u8 shift = zinfo->zone_size_shift;
-	unsigned long begin = start >> shift;
-	unsigned long end = (start + size) >> shift;
+	unsigned long begin = bdev_zone_no(device->bdev, start >> SECTOR_SHIFT);
+	unsigned long end =
+		bdev_zone_no(device->bdev, (start + size) >> SECTOR_SHIFT);
 	u64 pos;
 	int ret;
 
-	ASSERT(IS_ALIGNED(start, zinfo->zone_size));
-	ASSERT(IS_ALIGNED(size, zinfo->zone_size));
+	ASSERT(btrfs_zoned_is_aligned(start, zinfo->zone_size));
+	ASSERT(btrfs_zoned_is_aligned(size, zinfo->zone_size));
 
 	if (end > zinfo->nr_zones)
 		return -ERANGE;
@@ -1139,8 +1132,9 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
 		/* Free regions should be empty */
 		btrfs_warn_in_rcu(
 			device->fs_info,
-		"zoned: resetting device %s (devid %llu) zone %llu for allocation",
-			rcu_str_deref(device->name), device->devid, pos >> shift);
+			"zoned: resetting device %s (devid %llu) zone %u for allocation",
+			rcu_str_deref(device->name), device->devid,
+			bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT));
 		WARN_ON_ONCE(1);
 
 		ret = btrfs_reset_device_zone(device, pos, zinfo->zone_size,
@@ -1237,7 +1231,7 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 		return 0;
 
 	/* Sanity check */
-	if (!IS_ALIGNED(length, fs_info->zone_size)) {
+	if (!btrfs_zoned_is_aligned(length, fs_info->zone_size)) {
 		btrfs_err(fs_info,
 		"zoned: block group %llu len %llu unaligned to zone size %llu",
 			  logical, length, fs_info->zone_size);
@@ -1325,7 +1319,7 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 		 * The group is mapped to a sequential zone. Get the zone write
 		 * pointer to determine the allocation offset within the zone.
 		 */
-		WARN_ON(!IS_ALIGNED(physical[i], fs_info->zone_size));
+		WARN_ON(!btrfs_zoned_is_aligned(physical[i], fs_info->zone_size));
 		nofs_flag = memalloc_nofs_save();
 		ret = btrfs_get_dev_zone(device, physical[i], &zone);
 		memalloc_nofs_restore(nofs_flag);
@@ -1351,10 +1345,12 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
 		switch (zone.cond) {
 		case BLK_ZONE_COND_OFFLINE:
 		case BLK_ZONE_COND_READONLY:
-			btrfs_err(fs_info,
-		"zoned: offline/readonly zone %llu on device %s (devid %llu)",
-				  physical[i] >> device->zone_info->zone_size_shift,
-				  rcu_str_deref(device->name), device->devid);
+			btrfs_err(
+				fs_info,
+				"zoned: offline/readonly zone %u on device %s (devid %llu)",
+				bdev_zone_no(device->bdev,
+					     physical[i] >> SECTOR_SHIFT),
+				rcu_str_deref(device->name), device->devid);
 			alloc_offsets[i] = WP_MISSING_DEV;
 			break;
 		case BLK_ZONE_COND_EMPTY:
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 694ab6d1e..b94ce4d1f 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -9,6 +9,7 @@
 #include "disk-io.h"
 #include "block-group.h"
 #include "btrfs_inode.h"
+#include "misc.h"
 
 #define BTRFS_DEFAULT_RECLAIM_THRESH           			(75)
 
@@ -18,7 +19,6 @@ struct btrfs_zoned_device_info {
 	 * zoned block device.
 	 */
 	u64 zone_size;
-	u8  zone_size_shift;
 	u32 nr_zones;
 	unsigned int max_active_zones;
 	atomic_t active_zones_left;
@@ -30,6 +30,36 @@ struct btrfs_zoned_device_info {
 	u32 sb_zone_location[BTRFS_SUPER_MIRROR_MAX];
 };
 
+static inline bool btrfs_zoned_is_aligned(u64 pos, u64 zone_size)
+{
+	u64 remainder = 0;
+
+	if (is_power_of_two_u64(zone_size))
+		return IS_ALIGNED(pos, zone_size);
+
+	div64_u64_rem(pos, zone_size, &remainder);
+	return remainder == 0;
+}
+
+static inline u64 btrfs_zoned_roundup(u64 pos, u64 zone_size)
+{
+	if (is_power_of_two_u64(zone_size))
+		return ALIGN(pos, zone_size);
+
+	return div64_u64(pos + zone_size - 1, zone_size) * zone_size;
+}
+
+static inline u64 btrfs_zoned_rounddown(u64 pos, u64 zone_size)
+{
+	u64 remainder = 0;
+	if (is_power_of_two_u64(zone_size))
+		return round_down(pos, zone_size);
+
+	div64_u64_rem(pos, zone_size, &remainder);
+	pos -= remainder;
+	return pos;
+}
+
 #ifdef CONFIG_BLK_DEV_ZONED
 int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 		       struct blk_zone *zone);
@@ -253,7 +283,8 @@ static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
 	if (!zone_info)
 		return false;
 
-	return test_bit(pos >> zone_info->zone_size_shift, zone_info->seq_zones);
+	return test_bit(bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT),
+			zone_info->seq_zones);
 }
 
 static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
@@ -263,7 +294,8 @@ static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
 	if (!zone_info)
 		return true;
 
-	return test_bit(pos >> zone_info->zone_size_shift, zone_info->empty_zones);
+	return test_bit(bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT),
+			zone_info->empty_zones);
 }
 
 static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device,
@@ -275,7 +307,7 @@ static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device,
 	if (!zone_info)
 		return;
 
-	zno = pos >> zone_info->zone_size_shift;
+	zno = bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT);
 	if (set)
 		set_bit(zno, zone_info->empty_zones);
 	else
@@ -329,7 +361,8 @@ static inline bool btrfs_can_zone_reset(struct btrfs_device *device,
 		return false;
 
 	zone_size = device->zone_info->zone_size;
-	if (!IS_ALIGNED(physical, zone_size) || !IS_ALIGNED(length, zone_size))
+	if (!btrfs_zoned_is_aligned(physical, zone_size) ||
+	    !btrfs_zoned_is_aligned(length, zone_size))
 		return false;
 
 	return true;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets
       [not found]   ` <CGME20220516165429eucas1p272c8b4325a488675f08f2d7016aa6230@eucas1p2.samsung.com>
@ 2022-05-16 16:54     ` Pankaj Raghav
  2022-05-17  6:50       ` Johannes Thumshirn
                         ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 16:54 UTC (permalink / raw)
  To: axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, p.raghav, linux-kernel, dm-devel

Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
These are fixed at these locations so that recovery tools can reliably
retrieve the superblocks even if one of the mirror gets corrupted.

power of 2 zone sizes align at these offsets irrespective of their
value but non power of 2 zone sizes will not align.

To make sure the first zone at mirror 1 and mirror 2 align, write zero
operation is performed to move the write pointer of the first zone to
the expected offset. This operation is performed only after a zone reset
of the first zone, i.e., when the second zone that contains the sb is FULL.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/btrfs/zoned.c | 68 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 63 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 3023c871e..805aeaa76 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -760,11 +760,44 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
 	return 0;
 }
 
+static int fill_sb_wp_offset(struct block_device *bdev, struct blk_zone *zone,
+			     int mirror, u64 *wp_ret)
+{
+	u64 offset = 0;
+	int ret = 0;
+
+	ASSERT(!is_power_of_two_u64(zone->len));
+	ASSERT(zone->wp == zone->start);
+	ASSERT(mirror != 0);
+
+	switch (mirror) {
+	case 1:
+		div64_u64_rem(BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT,
+			      zone->len, &offset);
+		break;
+	case 2:
+		div64_u64_rem(BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT,
+			      zone->len, &offset);
+		break;
+	}
+
+	ret =  blkdev_issue_zeroout(bdev, zone->start, offset, GFP_NOFS, 0);
+	if (ret)
+		return ret;
+
+	zone->wp += offset;
+	zone->cond = BLK_ZONE_COND_IMP_OPEN;
+	*wp_ret = zone->wp << SECTOR_SHIFT;
+
+	return 0;
+}
+
 static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
-			   int rw, u64 *bytenr_ret)
+			   int rw, int mirror, u64 *bytenr_ret)
 {
 	u64 wp;
 	int ret;
+	bool zones_empty = false;
 
 	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
 		*bytenr_ret = zones[0].start << SECTOR_SHIFT;
@@ -775,13 +808,31 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
 	if (ret != -ENOENT && ret < 0)
 		return ret;
 
+	if (ret == -ENOENT)
+		zones_empty = true;
+
 	if (rw == WRITE) {
 		struct blk_zone *reset = NULL;
+		bool is_sb_offset_write_req = false;
+		u32 reset_zone_nr = -1;
 
-		if (wp == zones[0].start << SECTOR_SHIFT)
+		if (wp == zones[0].start << SECTOR_SHIFT) {
 			reset = &zones[0];
-		else if (wp == zones[1].start << SECTOR_SHIFT)
+			reset_zone_nr = 0;
+		} else if (wp == zones[1].start << SECTOR_SHIFT) {
 			reset = &zones[1];
+			reset_zone_nr = 1;
+		}
+
+		/*
+		 * Non po2 zone sizes will not align naturally at
+		 * mirror 1 (512GB) and mirror 2 (4TB). The wp of the
+		 * 1st zone in those superblock mirrors need to be
+		 * moved to align at those offsets.
+		 */
+		is_sb_offset_write_req =
+			(zones_empty || (reset_zone_nr == 0)) && mirror &&
+			!is_power_of_2(zones[0].len);
 
 		if (reset && reset->cond != BLK_ZONE_COND_EMPTY) {
 			ASSERT(sb_zone_is_full(reset));
@@ -795,6 +846,13 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
 			reset->cond = BLK_ZONE_COND_EMPTY;
 			reset->wp = reset->start;
 		}
+
+		if (is_sb_offset_write_req) {
+			ret = fill_sb_wp_offset(bdev, &zones[0], mirror, &wp);
+			if (ret)
+				return ret;
+		}
+
 	} else if (ret != -ENOENT) {
 		/*
 		 * For READ, we want the previous one. Move write pointer to
@@ -851,7 +909,7 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
 	if (ret != BTRFS_NR_SB_LOG_ZONES)
 		return -EIO;
 
-	return sb_log_location(bdev, zones, rw, bytenr_ret);
+	return sb_log_location(bdev, zones, rw, mirror, bytenr_ret);
 }
 
 int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
@@ -877,7 +935,7 @@ int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
 
 	return sb_log_location(device->bdev,
 			       &zinfo->sb_zones[BTRFS_NR_SB_LOG_ZONES * mirror],
-			       rw, bytenr_ret);
+			       rw, mirror, bytenr_ret);
 }
 
 static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 09/13] btrfs: zoned: relax the alignment constraint for zoned devices
       [not found]   ` <CGME20220516165430eucas1p214cca8eaba1db2c98d947444cad4f18f@eucas1p2.samsung.com>
@ 2022-05-16 16:54     ` Pankaj Raghav
  0 siblings, 0 replies; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 16:54 UTC (permalink / raw)
  To: axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, p.raghav, linux-kernel, dm-devel, Luis Chamberlain

Checks were in place to return error when a non power-of-2 zoned devices
is detected. Remove those checks as non power-of-2 zoned devices are
now supported.

Relax the zone size constraint to align with a sane default of 1MB.
This 1M default has been chosen as the minimum alignment requirement
for zone sizes to make sure zones align with sectorsize in different
architectures.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/btrfs/zoned.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 805aeaa76..4f3687c54 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -54,6 +54,13 @@
  */
 #define BTRFS_MAX_ZONE_SIZE		SZ_8G
 
+/*
+ * A minimum alignment of 1MB is chosen for zoned devices as their zone sizes
+ * can be non power of 2. This is to make sure the zones correctly align to the
+ * sectorsize.
+ */
+#define BTRFS_ZONED_MIN_ALIGN_SECTORS       ((u64)SZ_1M >> SECTOR_SHIFT)
+
 #define SUPER_INFO_SECTORS	((u64)BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT)
 
 static inline bool sb_zone_is_full(const struct blk_zone *zone)
@@ -394,8 +401,8 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
 		zone_sectors = bdev_zone_sectors(bdev);
 	}
 
-	/* Check if it's power of 2 (see is_power_of_2) */
-	ASSERT(zone_sectors != 0 && (zone_sectors & (zone_sectors - 1)) == 0);
+	ASSERT(zone_sectors != 0 &&
+	       IS_ALIGNED(zone_sectors, BTRFS_ZONED_MIN_ALIGN_SECTORS));
 	zone_info->zone_size = zone_sectors << SECTOR_SHIFT;
 
 	/* We reject devices with a zone size larger than 8GB */
@@ -892,9 +899,11 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
 
 	ASSERT(rw == READ || rw == WRITE);
 
-	if (!is_power_of_2(bdev_zone_sectors(bdev)))
-		return -EINVAL;
 	nr_sectors = bdev_nr_sectors(bdev);
+
+	if (!IS_ALIGNED(nr_sectors, BTRFS_ZONED_MIN_ALIGN_SECTORS))
+		return -EINVAL;
+
 	nr_zones = bdev_zone_no(bdev, nr_sectors);
 
 	sb_zone = sb_zone_number(bdev, mirror);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 10/13] zonefs: allow non power of 2 zoned devices
       [not found]   ` <CGME20220516165432eucas1p2e1ea74d44738e44745f49e37b6b9e503@eucas1p2.samsung.com>
@ 2022-05-16 16:54     ` Pankaj Raghav
  0 siblings, 0 replies; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 16:54 UTC (permalink / raw)
  To: axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, p.raghav, linux-kernel, dm-devel, Luis Chamberlain

The zone size shift variable is useful only if the zone sizes are known
to be power of 2. Remove that variable and use generic helpers from
block layer to calculate zone index in zonefs.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/zonefs/super.c  | 6 ++----
 fs/zonefs/zonefs.h | 1 -
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 57ca775f5..e302c889a 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -451,10 +451,9 @@ static void __zonefs_io_error(struct inode *inode, bool write)
 {
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	struct super_block *sb = inode->i_sb;
-	struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
 	unsigned int noio_flag;
 	unsigned int nr_zones =
-		zi->i_zone_size >> (sbi->s_zone_sectors_shift + SECTOR_SHIFT);
+		bdev_zone_no(sb->s_bdev, zi->i_zone_size >> SECTOR_SHIFT);
 	struct zonefs_ioerr_data err = {
 		.inode = inode,
 		.write = write,
@@ -1375,7 +1374,7 @@ static int zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone,
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
 	int ret;
 
-	inode->i_ino = zone->start >> sbi->s_zone_sectors_shift;
+	inode->i_ino = bdev_zone_no(sb->s_bdev, zone->start);
 	inode->i_mode = S_IFREG | sbi->s_perm;
 
 	zi->i_ztype = type;
@@ -1752,7 +1751,6 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
 	 * interface constraints.
 	 */
 	sb_set_blocksize(sb, bdev_zone_write_granularity(sb->s_bdev));
-	sbi->s_zone_sectors_shift = ilog2(bdev_zone_sectors(sb->s_bdev));
 	sbi->s_uid = GLOBAL_ROOT_UID;
 	sbi->s_gid = GLOBAL_ROOT_GID;
 	sbi->s_perm = 0640;
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 4b3de66c3..39895195c 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -177,7 +177,6 @@ struct zonefs_sb_info {
 	kgid_t			s_gid;
 	umode_t			s_perm;
 	uuid_t			s_uuid;
-	unsigned int		s_zone_sectors_shift;
 
 	unsigned int		s_nr_files[ZONEFS_ZTYPE_MAX];
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 11/13] null_blk: allow non power of 2 zoned devices
       [not found]   ` <CGME20220516165434eucas1p12b178fb83cc93470933e3d72c40e9004@eucas1p1.samsung.com>
@ 2022-05-16 16:54     ` Pankaj Raghav
  2022-05-17  4:12       ` kernel test robot
  0 siblings, 1 reply; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 16:54 UTC (permalink / raw)
  To: axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, p.raghav, linux-kernel, dm-devel, Luis Chamberlain,
	Hannes Reinecke

Convert the power of 2 based calculation with zone size to be generic in
null_zone_no with optimization for power of 2 based zone sizes.

The nr_zones calculation in null_init_zoned_dev has been replaced with a
division without special handling for power of 2 based zone sizes as
this function is called only during the initialization and will not
invoked in the hot path.

Performance Measurement:

Device:
zone size = 128M, blocksize=4k

FIO cmd:

fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
--loops=4

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Sequential Write:

x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  578     |  2257    |   12.80   |  576     |  2248    |   25.78   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  581     |  2268    |   12.74   |  576     |  2248    |   25.85   |
x-----------------x---------------------------------x---------------------------------x

Sequential read:

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  667     |  2605    |   11.79   |  675     |  2637    |   23.49   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  667     |  2605    |   11.79   |  675     |  2638    |   23.48   |
x-----------------x---------------------------------x---------------------------------x

Random read:

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch   |  522     |  2038    |   15.05   |  514     |  2006    |   30.87   |
x-----------------x---------------------------------x---------------------------------x
|  With patch     |  522     |  2039    |   15.04   |  523     |  2042    |   30.33   |
x-----------------x---------------------------------x---------------------------------x

Minor variations are noticed in Sequential write with io depth 8 and
in random read with io depth 16. But overall no noticeable differences
were noticed

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/block/null_blk/main.c  |  5 ++---
 drivers/block/null_blk/zoned.c | 14 +++++++-------
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 5cb4c92cd..53557e014 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -1929,9 +1929,8 @@ static int null_validate_conf(struct nullb_device *dev)
 	if (dev->queue_mode == NULL_Q_BIO)
 		dev->mbps = 0;
 
-	if (dev->zoned &&
-	    (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
-		pr_err("zone_size must be power-of-two\n");
+	if (dev->zoned && !dev->zone_size) {
+		pr_err("Invalid zero zone size\n");
 		return -EINVAL;
 	}
 
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index dae54dd1a..00c34e65e 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -13,7 +13,10 @@ static inline sector_t mb_to_sects(unsigned long mb)
 
 static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
 {
-	return sect >> ilog2(dev->zone_size_sects);
+	if (is_power_of_2(dev->zone_size_sects))
+		return sect >> ilog2(dev->zone_size_sects);
+
+	return div64_u64(sect, dev->zone_size_sects);
 }
 
 static inline void null_lock_zone_res(struct nullb_device *dev)
@@ -62,10 +65,6 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	sector_t sector = 0;
 	unsigned int i;
 
-	if (!is_power_of_2(dev->zone_size)) {
-		pr_err("zone_size must be power-of-two\n");
-		return -EINVAL;
-	}
 	if (dev->zone_size > dev->size) {
 		pr_err("Zone size larger than device capacity\n");
 		return -EINVAL;
@@ -83,8 +82,9 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	zone_capacity_sects = mb_to_sects(dev->zone_capacity);
 	dev_capacity_sects = mb_to_sects(dev->size);
 	dev->zone_size_sects = mb_to_sects(dev->zone_size);
-	dev->nr_zones = round_up(dev_capacity_sects, dev->zone_size_sects)
-		>> ilog2(dev->zone_size_sects);
+	dev->nr_zones =
+		div64_u64(roundup(dev_capacity_sects, dev->zone_size_sects),
+			  dev->zone_size_sects);
 
 	dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct nullb_zone),
 				    GFP_KERNEL | __GFP_ZERO);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 12/13] null_blk: use zone_size_sects_shift for power of 2 zoned devices
       [not found]   ` <CGME20220516165435eucas1p1dff8d9d039a76278ef1c09dba4b4e1fe@eucas1p1.samsung.com>
@ 2022-05-16 16:54     ` Pankaj Raghav
  0 siblings, 0 replies; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 16:54 UTC (permalink / raw)
  To: axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, p.raghav, linux-kernel, dm-devel, Damien Le Moal

Instead of doing is_power_of_2 and ilog2 operation for every IO, cache
the zone_size_sects_shift variable and use it for power of 2 zoned
devices.

This variable will be set to zero for non power of 2 zoned devices.

Suggested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/block/null_blk/null_blk.h |  6 ++++++
 drivers/block/null_blk/zoned.c    | 10 ++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 4525a65e1..3bbae8be4 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -76,6 +76,12 @@ struct nullb_device {
 	unsigned int imp_close_zone_no;
 	struct nullb_zone *zones;
 	sector_t zone_size_sects;
+	/*
+	 * zone_size_sects_shift is only useful when the zone size is
+	 * power of 2. This variable is set to zero when zone size is non
+	 * power of 2.
+	 */
+	unsigned int zone_size_sects_shift;
 	bool need_zone_res_mgmt;
 	spinlock_t zone_res_lock;
 
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 00c34e65e..806bef98a 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -13,8 +13,8 @@ static inline sector_t mb_to_sects(unsigned long mb)
 
 static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
 {
-	if (is_power_of_2(dev->zone_size_sects))
-		return sect >> ilog2(dev->zone_size_sects);
+	if (dev->zone_size_sects_shift)
+		return sect >> dev->zone_size_sects_shift;
 
 	return div64_u64(sect, dev->zone_size_sects);
 }
@@ -82,6 +82,12 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	zone_capacity_sects = mb_to_sects(dev->zone_capacity);
 	dev_capacity_sects = mb_to_sects(dev->size);
 	dev->zone_size_sects = mb_to_sects(dev->zone_size);
+
+	if (is_power_of_2(dev->zone_size_sects))
+		dev->zone_size_sects_shift = ilog2(dev->zone_size_sects);
+	else
+		dev->zone_size_sects_shift = 0;
+
 	dev->nr_zones =
 		div64_u64(roundup(dev_capacity_sects, dev->zone_size_sects),
 			  dev->zone_size_sects);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 13/13] dm-zoned: ensure only power of 2 zone sizes are allowed
       [not found]   ` <CGME20220516165436eucas1p178d079302dae3a9fca696b13b0390deb@eucas1p1.samsung.com>
@ 2022-05-16 16:54     ` Pankaj Raghav
  0 siblings, 0 replies; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 16:54 UTC (permalink / raw)
  To: axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, p.raghav, linux-kernel, dm-devel, Luis Chamberlain,
	Hannes Reinecke

From: Luis Chamberlain <mcgrof@kernel.org>

Today dm-zoned relies on the assumption that you have a zone size
with a power of 2. Even though the block layer today enforces this
requirement, these devices do exist and so provide a stop-gap measure
to ensure these devices cannot be used by mistake

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/md/dm-zone.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 3e7b1fe15..f0c588c02 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
 	struct request_queue *q = md->queue;
 	unsigned int noio_flag;
 	int ret;
+	struct block_device *bdev = md->disk->part0;
+	sector_t zone_sectors;
+	char bname[BDEVNAME_SIZE];
+
+	zone_sectors = bdev_zone_sectors(bdev);
+
+	if (!is_power_of_2(zone_sectors)) {
+		DMWARN("%s: %s only power of two zone size supported",
+		       dm_device_name(md),
+		       bdevname(bdev, bname));
+		return -EINVAL;
+	}
 
 	/*
 	 * Check if something changed. If yes, cleanup the current resources
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 02/13] block: allow blk-zoned devices to have non-power-of-2 zone size
  2022-05-16 16:54     ` [PATCH v4 02/13] block: allow blk-zoned devices to have non-power-of-2 zone size Pankaj Raghav
@ 2022-05-16 19:05       ` Pankaj Raghav
  0 siblings, 0 replies; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 19:05 UTC (permalink / raw)
  To: damien.lemoal
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, linux-kernel, dm-devel, Luis Chamberlain, axboe,
	pankydev8, dsterba, hch

Hi Damien,
I copied your comments from the previous thread to avoid confusion.


On 2022-05-16 16:00, Damien Le Moal wrote:
> On 2022/05/16 15:39, Pankaj Raghav wrote:
>> Checking if a given sector is aligned to a zone is a common
>> operation that is performed for zoned devices. Add
>> blk_queue_is_zone_start helper to check for this instead of opencoding it
>> everywhere.
>>
>> Convert the calculations on zone size to be generic instead of relying on
>> power_of_2 based logic in the block layer using the helpers wherever
>> possible.
>>
>> @@ -490,14 +490,29 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>>  	 * smaller last zone.
>>  	 */
>>  	if (zone->start == 0) {
>> -		if (zone->len == 0 || !is_power_of_2(zone->len)) {
>> -			pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
>> -				disk->disk_name, zone->len);
>> +		if (zone->len == 0) {
>> +			pr_warn("%s: Invalid zone size",
>> +				disk->disk_name);
>
> This fits on one line, no ?
>
Yeah. I don't know why my formatter decided to do that. I will fix it. Thanks.
>> +			return -ENODEV;
>> +		}
>> +
>> +		/*
>> +		 * Don't allow zoned device with non power_of_2 zone size with
>> +		 * zone capacity less than zone size.
>> +		 */
>> +		if (!is_power_of_2(zone->len) &&
>> +		    zone->capacity < zone->len) {
>> +			pr_warn("%s: Invalid zoned size with non power of 2 zone size and zone capacity < zone size",
>> +				disk->disk_name);
>
> Very long... What about:
>
> pr_warn("%s: Invalid zone capacity for non power of 2 zone size",
> 	disk->disk_name);
>
This looks better. I will fix it up!
>>  			return -ENODEV;
>>  		}
>>
>>  		args->zone_sectors = zone->len;
>> -		args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
>> +		/*
>> +		 * Division is used to calculate nr_zones for both power_of_2
>> +		 * and non power_of_2 zone sizes as it is not in the hot path.
>> +		 */
>
> This comment is not very useful.
>

I also saw you mentioning the comment was not useful in nvme code for
a similar scenario. Hannes brought up a point about making it
explicit when we are not using any special path for power of 2 zone sizes as in
most cases we do branching if the zone size is power of 2 to avoid division.

>> +		args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
>>  	} else if (zone->start + args->zone_sectors < capacity) {
>>  		if (zone->len != args->zone_sectors) {
>>  			pr_warn("%s: Invalid zoned device with non constant zone size\n",
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 22fe512ee..32d7bd7b1 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -686,6 +686,22 @@ static inline unsigned int blk_queue_zone_no(struct request_queue *q,
>>  	return div64_u64(sector, zone_sectors);
>>  }
>>
>> +static inline bool blk_queue_is_zone_start(struct request_queue *q, sector_t sec)
>> +{
>> +	sector_t zone_sectors = blk_queue_zone_sectors(q);
>> +	u64 remainder = 0;
>> +
>> +	if (!blk_queue_is_zoned(q))
>> +		return false;
>> +
>> +	if (is_power_of_2(zone_sectors))
>> +		return IS_ALIGNED(sec, zone_sectors);
>> +
>> +	div64_u64_rem(sec, zone_sectors, &remainder);
>> +	/* if there is a remainder, then the sector is not aligned */
>
> Hmmm... Fairly obvious. Not sure this comment is useful.
>
True. I will remove it.

>> +	return remainder == 0;
>> +}
>> +
>>  static inline bool blk_queue_zone_is_seq(struct request_queue *q,
>>  					 sector_t sector)
>>  {
>> @@ -732,6 +748,12 @@ static inline unsigned int blk_queue_zone_no(struct request_queue *q,
>>  {
>>  	return 0;
>>  }
>> +
>> +static inline bool blk_queue_is_zone_start(struct request_queue *q, sector_t sec)
>> +{
>> +	return false;
>> +}
>> +
>>  static inline unsigned int queue_max_open_zones(const struct request_queue *q)
>>  {
>>  	return 0;
>
>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 05/13] btrfs: zoned: Cache superblock location in btrfs_zoned_device_info
  2022-05-16 16:54     ` [PATCH v4 05/13] btrfs: zoned: Cache superblock location in btrfs_zoned_device_info Pankaj Raghav
@ 2022-05-16 21:58       ` David Sterba
  2022-05-17  7:55         ` Pankaj Raghav
  0 siblings, 1 reply; 56+ messages in thread
From: David Sterba @ 2022-05-16 21:58 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: axboe, damien.lemoal, pankydev8, dsterba, hch, linux-nvme,
	linux-fsdevel, linux-btrfs, jiangbo.365, linux-block, gost.dev,
	linux-kernel, dm-devel, Luis Chamberlain

On Mon, May 16, 2022 at 06:54:08PM +0200, Pankaj Raghav wrote:
> Instead of calculating the superblock location every time, cache the
> superblock zone location in btrfs_zoned_device_info struct and use it to
> locate the zone index.
> 
> The functions such as btrfs_sb_log_location_bdev() and
> btrfs_reset_sb_log_zones() which work directly on block_device shall
> continue to use the sb_zone_number because btrfs_zoned_device_info
> struct might not have been initialized at that point.
> 
> This patch will enable non power-of-2 zoned devices to not perform
> division to lookup superblock and its mirror location.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/btrfs/zoned.c | 13 +++++++++----
>  fs/btrfs/zoned.h |  1 +
>  2 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 06f22c021..e8c7cebb2 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -511,6 +511,11 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
>  			   max_active_zones - nactive);
>  	}
>  
> +	/* Cache the sb zone number */
> +	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; ++i) {
> +		zone_info->sb_zone_location[i] =
> +			sb_zone_number(zone_info->zone_size_shift, i);
> +	}

I don't think we need to cache the value right now, it's not in any hot
path and call to bdev_zone_no is relatively cheap (only dereferencing a
few pointers, all in-memory values).

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 11/13] null_blk: allow non power of 2 zoned devices
  2022-05-16 16:54     ` [PATCH v4 11/13] null_blk: " Pankaj Raghav
@ 2022-05-17  4:12       ` kernel test robot
  0 siblings, 0 replies; 56+ messages in thread
From: kernel test robot @ 2022-05-17  4:12 UTC (permalink / raw)
  To: Pankaj Raghav, axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: kbuild-all, linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365,
	linux-block, gost.dev, p.raghav, linux-kernel, dm-devel,
	Luis Chamberlain, Hannes Reinecke

Hi Pankaj,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20220516]
[also build test ERROR on v5.18-rc7]
[cannot apply to axboe-block/for-next kdave/for-next device-mapper-dm/for-next linus/master v5.18-rc7 v5.18-rc6 v5.18-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Pankaj-Raghav/block-make-blkdev_nr_zones-and-blk_queue_zone_no-generic-for-npo2-zsze/20220517-015435
base:    3f7bdc402fb06f897ff1f492a2d42e1f7c2efedb
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20220517/202205171251.o9RZslnw-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 11.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/1fc6f61076425c29e51cfa376f8dee4dbf588ed1
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Pankaj-Raghav/block-make-blkdev_nr_zones-and-blk_queue_zone_no-generic-for-npo2-zsze/20220517-015435
        git checkout 1fc6f61076425c29e51cfa376f8dee4dbf588ed1
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   m68k-linux-ld: drivers/block/null_blk/zoned.o: in function `null_init_zoned_dev':
>> zoned.c:(.text+0x9c6): undefined reference to `__udivdi3'
   `.exit.text' referenced in section `.data' of sound/soc/codecs/tlv320adc3xxx.o: defined in discarded section `.exit.text' of sound/soc/codecs/tlv320adc3xxx.o

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets
  2022-05-16 16:54     ` [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets Pankaj Raghav
@ 2022-05-17  6:50       ` Johannes Thumshirn
  2022-05-17  8:00         ` Pankaj Raghav
  2022-05-17 12:42       ` David Sterba
  2022-05-19  7:59       ` Naohiro Aota
  2 siblings, 1 reply; 56+ messages in thread
From: Johannes Thumshirn @ 2022-05-17  6:50 UTC (permalink / raw)
  To: Pankaj Raghav, axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, linux-kernel, dm-devel

On 16/05/2022 18:55, Pankaj Raghav wrote:
> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
> These are fixed at these locations so that recovery tools can reliably
> retrieve the superblocks even if one of the mirror gets corrupted.
> 
> power of 2 zone sizes align at these offsets irrespective of their
> value but non power of 2 zone sizes will not align.
> 
> To make sure the first zone at mirror 1 and mirror 2 align, write zero
> operation is performed to move the write pointer of the first zone to
> the expected offset. This operation is performed only after a zone reset
> of the first zone, i.e., when the second zone that contains the sb is FULL.

Hi Pankaj, stupid question. Npo2 devices still have a zone size being a 
multiple of 4k don't they?

If not, we'd need to also have a tail padding of the superblock zones, in order
to move the WP of these zones to the end, so the sb-log states match up.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 06/13] btrfs: zoned: Make sb_zone_number function non power of 2 compatible
  2022-05-16 16:54     ` [PATCH v4 06/13] btrfs: zoned: Make sb_zone_number function non power of 2 compatible Pankaj Raghav
@ 2022-05-17  6:53       ` Johannes Thumshirn
  2022-05-17 11:51         ` David Sterba
  0 siblings, 1 reply; 56+ messages in thread
From: Johannes Thumshirn @ 2022-05-17  6:53 UTC (permalink / raw)
  To: Pankaj Raghav, axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, linux-kernel, dm-devel, Luis Chamberlain

On 16/05/2022 18:54, Pankaj Raghav wrote:
>  	/* Cache the sb zone number */
>  	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; ++i) {
>  		zone_info->sb_zone_location[i] =
> -			sb_zone_number(zone_info->zone_size_shift, i);
> +			sb_zone_number(bdev, i);

I think this easily fits on one line now, doesn't it? But given David's
statement, it'll probably can go away anyways.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 05/13] btrfs: zoned: Cache superblock location in btrfs_zoned_device_info
  2022-05-16 21:58       ` David Sterba
@ 2022-05-17  7:55         ` Pankaj Raghav
  0 siblings, 0 replies; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-17  7:55 UTC (permalink / raw)
  To: dsterba
  Cc: pankydev8, dsterba, hch, linux-nvme, linux-fsdevel, linux-btrfs,
	jiangbo.365, linux-block, gost.dev, linux-kernel, dm-devel,
	Luis Chamberlain

On 2022-05-16 23:58, David Sterba wrote:
>> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>> ---
>>  fs/btrfs/zoned.c | 13 +++++++++----
>>  fs/btrfs/zoned.h |  1 +
>>  2 files changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
>> index 06f22c021..e8c7cebb2 100644
>> --- a/fs/btrfs/zoned.c
>> +++ b/fs/btrfs/zoned.c
>> @@ -511,6 +511,11 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
>>  			   max_active_zones - nactive);
>>  	}
>>  
>> +	/* Cache the sb zone number */
>> +	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; ++i) {
>> +		zone_info->sb_zone_location[i] =
>> +			sb_zone_number(zone_info->zone_size_shift, i);
>> +	}
> 
> I don't think we need to cache the value right now, it's not in any hot
> path and call to bdev_zone_no is relatively cheap (only dereferencing a
> few pointers, all in-memory values).
Ok. I will fix it up in the next revision! Thanks.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets
  2022-05-17  6:50       ` Johannes Thumshirn
@ 2022-05-17  8:00         ` Pankaj Raghav
  0 siblings, 0 replies; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-17  8:00 UTC (permalink / raw)
  To: Johannes Thumshirn, axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, linux-kernel, dm-devel

On 2022-05-17 08:50, Johannes Thumshirn wrote:
> On 16/05/2022 18:55, Pankaj Raghav wrote:
>> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
>> These are fixed at these locations so that recovery tools can reliably
>> retrieve the superblocks even if one of the mirror gets corrupted.
>>
>> power of 2 zone sizes align at these offsets irrespective of their
>> value but non power of 2 zone sizes will not align.
>>
>> To make sure the first zone at mirror 1 and mirror 2 align, write zero
>> operation is performed to move the write pointer of the first zone to
>> the expected offset. This operation is performed only after a zone reset
>> of the first zone, i.e., when the second zone that contains the sb is FULL.
> 
> Hi Pankaj, stupid question. Npo2 devices still have a zone size being a 
> multiple of 4k don't they?
> 
> If not, we'd need to also have a tail padding of the superblock zones, in order
> to move the WP of these zones to the end, so the sb-log states match up.
Hi Johannes,
NPO2 zoned devices has a minimum alignment requirement of 1MB. See
commit: `btrfs: zoned: relax the alignment constraint for zoned devices`

It will naturally align to 4k. So I don't think we need special handling
there with tail padding.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-16 16:54 ` [PATCH v4 00/13] support non power of 2 zoned devices Pankaj Raghav
                     ` (12 preceding siblings ...)
       [not found]   ` <CGME20220516165436eucas1p178d079302dae3a9fca696b13b0390deb@eucas1p1.samsung.com>
@ 2022-05-17  8:10   ` Christoph Hellwig
  2022-05-17  9:18     ` Javier González
  2022-05-17 15:34     ` [dm-devel] " Theodore Ts'o
  13 siblings, 2 replies; 56+ messages in thread
From: Christoph Hellwig @ 2022-05-17  8:10 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: axboe, damien.lemoal, pankydev8, dsterba, hch, linux-nvme,
	linux-fsdevel, linux-btrfs, jiangbo.365, linux-block, gost.dev,
	linux-kernel, dm-devel

I'm a little surprised about all this activity.

I though the conclusion at LSF/MM was that for Linux itself there
is very little benefit in supporting this scheme.  It will massively
fragment the supported based of devices and applications, while only
having the benefit of supporting some Samsung legacy devices.

So my impression was that this work, while technically feasible, is
rather useless.  So unless I missed something important I have no
interest in supporting this in NVMe.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-17  8:10   ` [PATCH v4 00/13] support non power of 2 zoned devices Christoph Hellwig
@ 2022-05-17  9:18     ` Javier González
  2022-05-18  8:00       ` Christoph Hellwig
  2022-05-17 15:34     ` [dm-devel] " Theodore Ts'o
  1 sibling, 1 reply; 56+ messages in thread
From: Javier González @ 2022-05-17  9:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pankaj Raghav, axboe, damien.lemoal, pankydev8, dsterba,
	linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, linux-kernel, dm-devel

On 17.05.2022 10:10, Christoph Hellwig wrote:
>I'm a little surprised about all this activity.
>
>I though the conclusion at LSF/MM was that for Linux itself there
>is very little benefit in supporting this scheme.  It will massively
>fragment the supported based of devices and applications, while only
>having the benefit of supporting some Samsung legacy devices.

I believed we had agreed that non-power-of-2 zoned devices was something
to explore. Let me summarize the 3 main points we covered at different
times at LSF/MM:

   - This is not for legacy Samsung ZNS devices. At least 4 other
     vendors have reported building non-power-of-2 ZNS devices to meet
     customer demands on removing holes in the address space. It seems
     like there will be more ZNS devices with size=capacity out there
     than with PO2 sizes. Block device and FS support is very desirable
     for these.

   - We also talked about how the capacity not being a PO2 is the one
     introducing the fragmentation, as applications that already worked
     with SMR HDDs will have to change their data placement policy. The
     size is just a construction, but the real work is adopting the
     capacity.

   - Besides the previous poit, the fragmentation will happen from the
     moment we have available devices. This is not a kernel-only issue.
     We have SMR, ZNS, and soon another spec for zone devices. I
     understood that as long as we do not break any existing support, we
     would be able to expend the zoned ecosystem in Linux.

>So my impression was that this work, while technically feasible, is
>rather useless.  So unless I missed something important I have no
>interest in supporting this in NVMe.

Does the above help you reconsidering your interest in supporting this
in NVMe?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 06/13] btrfs: zoned: Make sb_zone_number function non power of 2 compatible
  2022-05-17  6:53       ` Johannes Thumshirn
@ 2022-05-17 11:51         ` David Sterba
  0 siblings, 0 replies; 56+ messages in thread
From: David Sterba @ 2022-05-17 11:51 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Pankaj Raghav, axboe, damien.lemoal, pankydev8, dsterba, hch,
	linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, linux-kernel, dm-devel, Luis Chamberlain

On Tue, May 17, 2022 at 06:53:40AM +0000, Johannes Thumshirn wrote:
> On 16/05/2022 18:54, Pankaj Raghav wrote:
> >  	/* Cache the sb zone number */
> >  	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; ++i) {
> >  		zone_info->sb_zone_location[i] =
> > -			sb_zone_number(zone_info->zone_size_shift, i);
> > +			sb_zone_number(bdev, i);
> 
> I think this easily fits on one line now, doesn't it? But given David's
> statement, it'll probably can go away anyways.

I agree that the formatting can be adjusted, but I'm never sure if I
should point it out during the phase of functional changes so it's fixed
in the next iteration as well, or not to point it out to avoid fixups
that would go away anyway. I think I've seen more, a styling pass will
be done anyway once we're close to the final version..

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 07/13] btrfs: zoned: use generic btrfs zone helpers to support npo2 zoned devices
  2022-05-16 16:54     ` [PATCH v4 07/13] btrfs: zoned: use generic btrfs zone helpers to support npo2 zoned devices Pankaj Raghav
@ 2022-05-17 12:30       ` David Sterba
  2022-05-18  9:40         ` Pankaj Raghav
  2022-05-19  4:13       ` Naohiro Aota
  1 sibling, 1 reply; 56+ messages in thread
From: David Sterba @ 2022-05-17 12:30 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: axboe, damien.lemoal, pankydev8, dsterba, hch, linux-nvme,
	linux-fsdevel, linux-btrfs, jiangbo.365, linux-block, gost.dev,
	linux-kernel, dm-devel, Luis Chamberlain

On Mon, May 16, 2022 at 06:54:10PM +0200, Pankaj Raghav wrote:
> Add helpers to calculate alignment, round up and round down
> for zoned devices. These helpers encapsulates the necessary handling for
> power_of_2 and non-power_of_2 zone sizes. Optimized calculations are
> performed for zone sizes that are power_of_2 with log and shifts.
> 
> btrfs_zoned_is_aligned() is added instead of reusing bdev_zone_aligned()
> helper due to some use cases in btrfs where zone alignment is checked
> before having access to the underlying block device such as in this
> function: btrfs_load_block_group_zone_info().
> 
> Use the generic btrfs zone helpers to calculate zone index, check zone
> alignment, round up and round down operations.
> 
> The zone_size_shift field is not needed anymore as generic helpers are
> used for calculation.

Overall this looks reasonable to me.

> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/btrfs/volumes.c | 24 +++++++++-------
>  fs/btrfs/zoned.c   | 72 ++++++++++++++++++++++------------------------
>  fs/btrfs/zoned.h   | 43 +++++++++++++++++++++++----
>  3 files changed, 85 insertions(+), 54 deletions(-)
> 
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -1108,14 +1101,14 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
>  int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
>  {
>  	struct btrfs_zoned_device_info *zinfo = device->zone_info;
> -	const u8 shift = zinfo->zone_size_shift;
> -	unsigned long begin = start >> shift;
> -	unsigned long end = (start + size) >> shift;
> +	unsigned long begin = bdev_zone_no(device->bdev, start >> SECTOR_SHIFT);
> +	unsigned long end =
> +		bdev_zone_no(device->bdev, (start + size) >> SECTOR_SHIFT);

There are unsinged long types here though I'd rather see u64, better for
a separate patch. Fixed width types are cleaner here and in the zoned
code as there's always some conversion to/from sectors.

>  	u64 pos;
>  	int ret;
>  
> -	ASSERT(IS_ALIGNED(start, zinfo->zone_size));
> -	ASSERT(IS_ALIGNED(size, zinfo->zone_size));
> +	ASSERT(btrfs_zoned_is_aligned(start, zinfo->zone_size));
> +	ASSERT(btrfs_zoned_is_aligned(size, zinfo->zone_size));
>  
>  	if (end > zinfo->nr_zones)
>  		return -ERANGE;
> --- a/fs/btrfs/zoned.h
> +++ b/fs/btrfs/zoned.h
> @@ -30,6 +30,36 @@ struct btrfs_zoned_device_info {
>  	u32 sb_zone_location[BTRFS_SUPER_MIRROR_MAX];
>  };
>  
> +static inline bool btrfs_zoned_is_aligned(u64 pos, u64 zone_size)
> +{
> +	u64 remainder = 0;
> +
> +	if (is_power_of_two_u64(zone_size))
> +		return IS_ALIGNED(pos, zone_size);
> +
> +	div64_u64_rem(pos, zone_size, &remainder);
> +	return remainder == 0;
> +}
> +
> +static inline u64 btrfs_zoned_roundup(u64 pos, u64 zone_size)
> +{
> +	if (is_power_of_two_u64(zone_size))
> +		return ALIGN(pos, zone_size);

Please use round_up as the rounddown helper uses round_down

> +
> +	return div64_u64(pos + zone_size - 1, zone_size) * zone_size;
> +}
> +
> +static inline u64 btrfs_zoned_rounddown(u64 pos, u64 zone_size)
> +{
> +	u64 remainder = 0;
> +	if (is_power_of_two_u64(zone_size))
> +		return round_down(pos, zone_size);
> +
> +	div64_u64_rem(pos, zone_size, &remainder);
> +	pos -= remainder;
> +	return pos;
> +}
> +
>  #ifdef CONFIG_BLK_DEV_ZONED
>  int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>  		       struct blk_zone *zone);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets
  2022-05-16 16:54     ` [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets Pankaj Raghav
  2022-05-17  6:50       ` Johannes Thumshirn
@ 2022-05-17 12:42       ` David Sterba
  2022-05-18  9:15         ` Pankaj Raghav
  2022-05-19  7:59       ` Naohiro Aota
  2 siblings, 1 reply; 56+ messages in thread
From: David Sterba @ 2022-05-17 12:42 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: axboe, damien.lemoal, pankydev8, dsterba, hch, linux-nvme,
	linux-fsdevel, linux-btrfs, jiangbo.365, linux-block, gost.dev,
	linux-kernel, dm-devel

On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote:
> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
> These are fixed at these locations so that recovery tools can reliably
> retrieve the superblocks even if one of the mirror gets corrupted.
> 
> power of 2 zone sizes align at these offsets irrespective of their
> value but non power of 2 zone sizes will not align.
> 
> To make sure the first zone at mirror 1 and mirror 2 align, write zero
> operation is performed to move the write pointer of the first zone to
> the expected offset. This operation is performed only after a zone reset
> of the first zone, i.e., when the second zone that contains the sb is FULL.

Is it a good idea to do the "write zeros", instead of a plain "set write
pointer"? I assume setting write pointer is instant, while writing
potentially hundreds of megabytes may take significiant time. As the
functions may be called from random contexts, the increased time may
become a problem.

> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/btrfs/zoned.c | 68 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 63 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 3023c871e..805aeaa76 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -760,11 +760,44 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
>  	return 0;
>  }
>  
> +static int fill_sb_wp_offset(struct block_device *bdev, struct blk_zone *zone,
> +			     int mirror, u64 *wp_ret)
> +{
> +	u64 offset = 0;
> +	int ret = 0;
> +
> +	ASSERT(!is_power_of_two_u64(zone->len));
> +	ASSERT(zone->wp == zone->start);
> +	ASSERT(mirror != 0);

This could simply accept 0 as the mirror offset too, the calculation is
trivial.

> +
> +	switch (mirror) {
> +	case 1:
> +		div64_u64_rem(BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT,
> +			      zone->len, &offset);
> +		break;
> +	case 2:
> +		div64_u64_rem(BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT,
> +			      zone->len, &offset);
> +		break;
> +	}
> +
> +	ret =  blkdev_issue_zeroout(bdev, zone->start, offset, GFP_NOFS, 0);
> +	if (ret)
> +		return ret;
> +
> +	zone->wp += offset;
> +	zone->cond = BLK_ZONE_COND_IMP_OPEN;
> +	*wp_ret = zone->wp << SECTOR_SHIFT;
> +
> +	return 0;
> +}
> +
>  static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
> -			   int rw, u64 *bytenr_ret)
> +			   int rw, int mirror, u64 *bytenr_ret)
>  {
>  	u64 wp;
>  	int ret;
> +	bool zones_empty = false;
>  
>  	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
>  		*bytenr_ret = zones[0].start << SECTOR_SHIFT;
> @@ -775,13 +808,31 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
>  	if (ret != -ENOENT && ret < 0)
>  		return ret;
>  
> +	if (ret == -ENOENT)
> +		zones_empty = true;
> +
>  	if (rw == WRITE) {
>  		struct blk_zone *reset = NULL;
> +		bool is_sb_offset_write_req = false;
> +		u32 reset_zone_nr = -1;
>  
> -		if (wp == zones[0].start << SECTOR_SHIFT)
> +		if (wp == zones[0].start << SECTOR_SHIFT) {
>  			reset = &zones[0];
> -		else if (wp == zones[1].start << SECTOR_SHIFT)
> +			reset_zone_nr = 0;
> +		} else if (wp == zones[1].start << SECTOR_SHIFT) {
>  			reset = &zones[1];
> +			reset_zone_nr = 1;
> +		}
> +
> +		/*
> +		 * Non po2 zone sizes will not align naturally at
> +		 * mirror 1 (512GB) and mirror 2 (4TB). The wp of the
> +		 * 1st zone in those superblock mirrors need to be
> +		 * moved to align at those offsets.
> +		 */

Please move this comment to the helper fill_sb_wp_offset itself, there
it's more discoverable.

> +		is_sb_offset_write_req =
> +			(zones_empty || (reset_zone_nr == 0)) && mirror &&
> +			!is_power_of_2(zones[0].len);

Accepting 0 as the mirror number would also get rid of this wild
expression substituting and 'if'.

>  
>  		if (reset && reset->cond != BLK_ZONE_COND_EMPTY) {
>  			ASSERT(sb_zone_is_full(reset));
> @@ -795,6 +846,13 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
>  			reset->cond = BLK_ZONE_COND_EMPTY;
>  			reset->wp = reset->start;
>  		}
> +
> +		if (is_sb_offset_write_req) {

And get rid of the conditional. The point of supporting both po2 and
nonpo2 is to hide any implementation details to wrappers as much as
possible.

> +			ret = fill_sb_wp_offset(bdev, &zones[0], mirror, &wp);
> +			if (ret)
> +				return ret;
> +		}
> +
>  	} else if (ret != -ENOENT) {
>  		/*
>  		 * For READ, we want the previous one. Move write pointer to

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 04/13] nvmet: Allow ZNS target to support non-power_of_2 zone sizes
  2022-05-16 16:54     ` [PATCH v4 04/13] nvmet: Allow ZNS target to support non-power_of_2 zone sizes Pankaj Raghav
@ 2022-05-17 14:19       ` Johannes Thumshirn
  0 siblings, 0 replies; 56+ messages in thread
From: Johannes Thumshirn @ 2022-05-17 14:19 UTC (permalink / raw)
  To: Pankaj Raghav, axboe, damien.lemoal, pankydev8, dsterba, hch
  Cc: linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, linux-kernel, dm-devel, Bart Van Assche,
	Hannes Reinecke, Luis Chamberlain

I think this is useful even without the rest of the npo2 patches.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-17  8:10   ` [PATCH v4 00/13] support non power of 2 zoned devices Christoph Hellwig
  2022-05-17  9:18     ` Javier González
@ 2022-05-17 15:34     ` Theodore Ts'o
  2022-05-18 23:06       ` Luis Chamberlain
  2022-05-19  3:08       ` Damien Le Moal
  1 sibling, 2 replies; 56+ messages in thread
From: Theodore Ts'o @ 2022-05-17 15:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pankaj Raghav, axboe, pankydev8, gost.dev, damien.lemoal,
	jiangbo.365, linux-nvme, linux-kernel, linux-block,
	linux-fsdevel, dm-devel, dsterba, linux-btrfs

On Tue, May 17, 2022 at 10:10:48AM +0200, Christoph Hellwig wrote:
> I'm a little surprised about all this activity.
> 
> I though the conclusion at LSF/MM was that for Linux itself there
> is very little benefit in supporting this scheme.  It will massively
> fragment the supported based of devices and applications, while only
> having the benefit of supporting some Samsung legacy devices.

FWIW,

That wasn't my impression from that LSF/MM session, but once the
videos become available, folks can decide for themselves.

       	      		       	   	  - Ted

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-17  9:18     ` Javier González
@ 2022-05-18  8:00       ` Christoph Hellwig
  2022-05-19 15:25         ` Javier González
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Hellwig @ 2022-05-18  8:00 UTC (permalink / raw)
  To: Javier González
  Cc: Christoph Hellwig, Pankaj Raghav, axboe, damien.lemoal,
	pankydev8, dsterba, linux-nvme, linux-fsdevel, linux-btrfs,
	jiangbo.365, linux-block, gost.dev, linux-kernel, dm-devel

On Tue, May 17, 2022 at 11:18:34AM +0200, Javier González wrote:
> Does the above help you reconsidering your interest in supporting this
> in NVMe?

Very little.  It just seems like a really bad idea.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets
  2022-05-17 12:42       ` David Sterba
@ 2022-05-18  9:15         ` Pankaj Raghav
  2022-05-19  7:57           ` Johannes Thumshirn
  0 siblings, 1 reply; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-18  9:15 UTC (permalink / raw)
  To: dsterba
  Cc: axboe, damien.lemoal, pankydev8, dsterba, hch, linux-nvme,
	linux-fsdevel, linux-btrfs, jiangbo.365, linux-block, gost.dev,
	linux-kernel, dm-devel

On 2022-05-17 14:42, David Sterba wrote:
> On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote:
>> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
>> These are fixed at these locations so that recovery tools can reliably
>> retrieve the superblocks even if one of the mirror gets corrupted.
>>
>> power of 2 zone sizes align at these offsets irrespective of their
>> value but non power of 2 zone sizes will not align.
>>
>> To make sure the first zone at mirror 1 and mirror 2 align, write zero
>> operation is performed to move the write pointer of the first zone to
>> the expected offset. This operation is performed only after a zone reset
>> of the first zone, i.e., when the second zone that contains the sb is FULL.
> 
> Is it a good idea to do the "write zeros", instead of a plain "set write
> pointer"? I assume setting write pointer is instant, while writing
> potentially hundreds of megabytes may take significiant time. As the
> functions may be called from random contexts, the increased time may
> become a problem.
> 
Unfortunately it is not possible to just move the WP in zoned devices.
The only alternative that I could use is to do write zeroes which are
natively supported by some devices such as ZNS. It would be nice to know
if someone had a better solution to this instead of doing write zeroes
in zoned devices.

>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>> ---
>>  fs/btrfs/zoned.c | 68 ++++++++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 63 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
>> index 3023c871e..805aeaa76 100644
>> --- a/fs/btrfs/zoned.c
>> +++ b/fs/btrfs/zoned.c
>> @@ -760,11 +760,44 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
>>  	return 0;
>>  }
>>  
>> +static int fill_sb_wp_offset(struct block_device *bdev, struct blk_zone *zone,
>> +			     int mirror, u64 *wp_ret)
>> +{
>> +	u64 offset = 0;
>> +	int ret = 0;
>> +
>> +	ASSERT(!is_power_of_two_u64(zone->len));
>> +	ASSERT(zone->wp == zone->start);
>> +	ASSERT(mirror != 0);
> 
> This could simply accept 0 as the mirror offset too, the calculation is
> trivial.
> 
Ok. I will fix it up!
>> +
>> +	switch (mirror) {
>> +	case 1:
>> +		div64_u64_rem(BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT,
>> +			      zone->len, &offset);
>> +		break;
>> +	case 2:
>> +		div64_u64_rem(BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT,
>> +			      zone->len, &offset);
>> +		break;
>> +	}
>> +
>> +	ret =  blkdev_issue_zeroout(bdev, zone->start, offset, GFP_NOFS, 0);
>> +	if (ret)
>> +		return ret;
>> +
>> +		/*
>> +		 * Non po2 zone sizes will not align naturally at
>> +		 * mirror 1 (512GB) and mirror 2 (4TB). The wp of the
>> +		 * 1st zone in those superblock mirrors need to be
>> +		 * moved to align at those offsets.
>> +		 */
> 
> Please move this comment to the helper fill_sb_wp_offset itself, there
> it's more discoverable.
> 
Ok.
>> +		is_sb_offset_write_req =
>> +			(zones_empty || (reset_zone_nr == 0)) && mirror &&
>> +			!is_power_of_2(zones[0].len);
> 
> Accepting 0 as the mirror number would also get rid of this wild
> expression substituting and 'if'.
> 
>>  
>>  		if (reset && reset->cond != BLK_ZONE_COND_EMPTY) {
>>  			ASSERT(sb_zone_is_full(reset));
>> @@ -795,6 +846,13 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
>>  			reset->cond = BLK_ZONE_COND_EMPTY;
>>  			reset->wp = reset->start;
>>  		}
>> +
>> +		if (is_sb_offset_write_req) {
> 
> And get rid of the conditional. The point of supporting both po2 and
> nonpo2 is to hide any implementation details to wrappers as much as
> possible.
> 
Alright. I will move the logic to the wrapper instead of having the
conditional in this function.
>> +			ret = fill_sb_wp_offset(bdev, &zones[0], mirror, &wp);
>> +			if (ret)
>> +				return ret;
>> +		}
>> +
>>  	} else if (ret != -ENOENT) {
>>  		/*
>>  		 * For READ, we want the previous one. Move write pointer to
Thanks for your comments.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 07/13] btrfs: zoned: use generic btrfs zone helpers to support npo2 zoned devices
  2022-05-17 12:30       ` David Sterba
@ 2022-05-18  9:40         ` Pankaj Raghav
  2022-05-18 11:21           ` David Sterba
  0 siblings, 1 reply; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-18  9:40 UTC (permalink / raw)
  To: dsterba
  Cc: axboe, damien.lemoal, pankydev8, dsterba, hch, linux-nvme,
	linux-fsdevel, linux-btrfs, jiangbo.365, linux-block, gost.dev,
	linux-kernel, dm-devel, Luis Chamberlain

On 2022-05-17 14:30, David Sterba wrote:
> On Mon, May 16, 2022 at 06:54:10PM +0200, Pankaj Raghav wrote:
>> Add helpers to calculate alignment, round up and round down
>> for zoned devices. These helpers encapsulates the necessary handling for
>> power_of_2 and non-power_of_2 zone sizes. Optimized calculations are
>> performed for zone sizes that are power_of_2 with log and shifts.
>>
>> btrfs_zoned_is_aligned() is added instead of reusing bdev_zone_aligned()
>> helper due to some use cases in btrfs where zone alignment is checked
>> before having access to the underlying block device such as in this
>> function: btrfs_load_block_group_zone_info().
>>
>> Use the generic btrfs zone helpers to calculate zone index, check zone
>> alignment, round up and round down operations.
>>
>> The zone_size_shift field is not needed anymore as generic helpers are
>> used for calculation.
> 
> Overall this looks reasonable to me.
> 
>> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>> ---
>>  fs/btrfs/volumes.c | 24 +++++++++-------
>>  fs/btrfs/zoned.c   | 72 ++++++++++++++++++++++------------------------
>>  fs/btrfs/zoned.h   | 43 +++++++++++++++++++++++----
>>  3 files changed, 85 insertions(+), 54 deletions(-)
>>
>> --- a/fs/btrfs/zoned.c
>> +++ b/fs/btrfs/zoned.c
>> @@ -1108,14 +1101,14 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
>>  int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
>>  {
>>  	struct btrfs_zoned_device_info *zinfo = device->zone_info;
>> -	const u8 shift = zinfo->zone_size_shift;
>> -	unsigned long begin = start >> shift;
>> -	unsigned long end = (start + size) >> shift;
>> +	unsigned long begin = bdev_zone_no(device->bdev, start >> SECTOR_SHIFT);
>> +	unsigned long end =
>> +		bdev_zone_no(device->bdev, (start + size) >> SECTOR_SHIFT);
> 
> There are unsinged long types here though I'd rather see u64, better for
> a separate patch. Fixed width types are cleaner here and in the zoned
> code as there's always some conversion to/from sectors.
> 
Ok. I will probably send a separate patch to convert them to fix width
types. Is it ok if I do it as a separate patch instead of including it
in this series?
>>  	u64 pos;
>>  	int ret;
>>  
>> -	ASSERT(IS_ALIGNED(start, zinfo->zone_size));
>> -	ASSERT(IS_ALIGNED(size, zinfo->zone_size));
>> +	ASSERT(btrfs_zoned_is_aligned(start, zinfo->zone_size));
>> +	ASSERT(btrfs_zoned_is_aligned(size, zinfo->zone_size));
>>  
>>  	if (end > zinfo->nr_zones)
>>  		return -ERANGE;
>> --- a/fs/btrfs/zoned.h
>> +++ b/fs/btrfs/zoned.h
>> @@ -30,6 +30,36 @@ struct btrfs_zoned_device_info {
>>  	u32 sb_zone_location[BTRFS_SUPER_MIRROR_MAX];
>>  };
>>  
>> +static inline bool btrfs_zoned_is_aligned(u64 pos, u64 zone_size)
>> +{
>> +	u64 remainder = 0;
>> +
>> +	if (is_power_of_two_u64(zone_size))
>> +		return IS_ALIGNED(pos, zone_size);
>> +
>> +	div64_u64_rem(pos, zone_size, &remainder);
>> +	return remainder == 0;
>> +}
>> +
>> +static inline u64 btrfs_zoned_roundup(u64 pos, u64 zone_size)
>> +{
>> +	if (is_power_of_two_u64(zone_size))
>> +		return ALIGN(pos, zone_size);
> 
> Please use round_up as the rounddown helper uses round_down
> 
Ah, good catch. I will use it instead. Thanks.
>> +
>> +	return div64_u64(pos + zone_size - 1, zone_size) * zone_size;
>> +}
>> +
>> +static inline u64 btrfs_zoned_rounddown(u64 pos, u64 zone_size)
>> +{
>> +	u64 remainder = 0;
>> +	if (is_power_of_two_u64(zone_size))
>> +		return round_down(pos, zone_size);
>> +
>> +	div64_u64_rem(pos, zone_size, &remainder);
>> +	pos -= remainder;
>> +	return pos;
>> +}
>> +
>>  #ifdef CONFIG_BLK_DEV_ZONED
>>  int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>>  		       struct blk_zone *zone);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 07/13] btrfs: zoned: use generic btrfs zone helpers to support npo2 zoned devices
  2022-05-18  9:40         ` Pankaj Raghav
@ 2022-05-18 11:21           ` David Sterba
  0 siblings, 0 replies; 56+ messages in thread
From: David Sterba @ 2022-05-18 11:21 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: dsterba, axboe, damien.lemoal, pankydev8, dsterba, hch,
	linux-nvme, linux-fsdevel, linux-btrfs, jiangbo.365, linux-block,
	gost.dev, linux-kernel, dm-devel, Luis Chamberlain

On Wed, May 18, 2022 at 11:40:22AM +0200, Pankaj Raghav wrote:
> On 2022-05-17 14:30, David Sterba wrote:
> > On Mon, May 16, 2022 at 06:54:10PM +0200, Pankaj Raghav wrote:
> >> @@ -1108,14 +1101,14 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
> >>  int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
> >>  {
> >>  	struct btrfs_zoned_device_info *zinfo = device->zone_info;
> >> -	const u8 shift = zinfo->zone_size_shift;
> >> -	unsigned long begin = start >> shift;
> >> -	unsigned long end = (start + size) >> shift;
> >> +	unsigned long begin = bdev_zone_no(device->bdev, start >> SECTOR_SHIFT);
> >> +	unsigned long end =
> >> +		bdev_zone_no(device->bdev, (start + size) >> SECTOR_SHIFT);
> > 
> > There are unsinged long types here though I'd rather see u64, better for
> > a separate patch. Fixed width types are cleaner here and in the zoned
> > code as there's always some conversion to/from sectors.
> > 
> Ok. I will probably send a separate patch to convert them to fix width
> types. Is it ok if I do it as a separate patch instead of including it
> in this series?

Yes, it's a cleanup for later, not directly introduced or affecting this
patchset. I've checked zoned.c, in btrfs_ensure_empty_zones it's the
only instance so it's not some widespread problem.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-17 15:34     ` [dm-devel] " Theodore Ts'o
@ 2022-05-18 23:06       ` Luis Chamberlain
  2022-05-19  3:08       ` Damien Le Moal
  1 sibling, 0 replies; 56+ messages in thread
From: Luis Chamberlain @ 2022-05-18 23:06 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Christoph Hellwig, Pankaj Raghav, axboe, pankydev8, gost.dev,
	damien.lemoal, jiangbo.365, linux-nvme, linux-kernel,
	linux-block, linux-fsdevel, dm-devel, dsterba, linux-btrfs

On Tue, May 17, 2022 at 11:34:54AM -0400, Theodore Ts'o wrote:
> On Tue, May 17, 2022 at 10:10:48AM +0200, Christoph Hellwig wrote:
> > I'm a little surprised about all this activity.
> > 
> > I though the conclusion at LSF/MM was that for Linux itself there
> > is very little benefit in supporting this scheme.  It will massively
> > fragment the supported based of devices and applications, while only
> > having the benefit of supporting some Samsung legacy devices.
> 
> FWIW,
> 
> That wasn't my impression from that LSF/MM session, but once the
> videos become available, folks can decide for themselves.

Agreed, contrary to conventional storage devices, with the zone storage
ecosystem we simply have a requirement of zone drive replacements matching
zone size. That requirement exists for po2 or npo2. The work in this patch
set proves that supporting npo2 was in the end straight forward. As the one
putting together the BoF I can say that there were no sticking points raised
to move forward with this when the topic came up. So I am very surprised to
hear about any other perceived conclusion.

  Luis

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-17 15:34     ` [dm-devel] " Theodore Ts'o
  2022-05-18 23:06       ` Luis Chamberlain
@ 2022-05-19  3:08       ` Damien Le Moal
  2022-05-19  3:12         ` Luis Chamberlain
  1 sibling, 1 reply; 56+ messages in thread
From: Damien Le Moal @ 2022-05-19  3:08 UTC (permalink / raw)
  To: Theodore Ts'o, Christoph Hellwig
  Cc: Pankaj Raghav, axboe, pankydev8, gost.dev, jiangbo.365,
	linux-nvme, linux-kernel, linux-block, linux-fsdevel, dm-devel,
	dsterba, linux-btrfs

On 5/18/22 00:34, Theodore Ts'o wrote:
> On Tue, May 17, 2022 at 10:10:48AM +0200, Christoph Hellwig wrote:
>> I'm a little surprised about all this activity.
>>
>> I though the conclusion at LSF/MM was that for Linux itself there
>> is very little benefit in supporting this scheme.  It will massively
>> fragment the supported based of devices and applications, while only
>> having the benefit of supporting some Samsung legacy devices.
> 
> FWIW,
> 
> That wasn't my impression from that LSF/MM session, but once the
> videos become available, folks can decide for themselves.

There was no real discussion about zone size constraint on the zone
storage BoF. Many discussions happened in the hallway track though.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-19  3:08       ` Damien Le Moal
@ 2022-05-19  3:12         ` Luis Chamberlain
  2022-05-19  3:19           ` Damien Le Moal
  0 siblings, 1 reply; 56+ messages in thread
From: Luis Chamberlain @ 2022-05-19  3:12 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Theodore Ts'o, Christoph Hellwig, Pankaj Raghav, axboe,
	pankydev8, gost.dev, jiangbo.365, linux-nvme, linux-kernel,
	linux-block, linux-fsdevel, dm-devel, dsterba, linux-btrfs

On Thu, May 19, 2022 at 12:08:26PM +0900, Damien Le Moal wrote:
> On 5/18/22 00:34, Theodore Ts'o wrote:
> > On Tue, May 17, 2022 at 10:10:48AM +0200, Christoph Hellwig wrote:
> >> I'm a little surprised about all this activity.
> >>
> >> I though the conclusion at LSF/MM was that for Linux itself there
> >> is very little benefit in supporting this scheme.  It will massively
> >> fragment the supported based of devices and applications, while only
> >> having the benefit of supporting some Samsung legacy devices.
> > 
> > FWIW,
> > 
> > That wasn't my impression from that LSF/MM session, but once the
> > videos become available, folks can decide for themselves.
> 
> There was no real discussion about zone size constraint on the zone
> storage BoF. Many discussions happened in the hallway track though.

Right so no direct clear blockers mentioned at all during the BoF.

  Luis

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-19  3:12         ` Luis Chamberlain
@ 2022-05-19  3:19           ` Damien Le Moal
  2022-05-19  7:34             ` Johannes Thumshirn
  0 siblings, 1 reply; 56+ messages in thread
From: Damien Le Moal @ 2022-05-19  3:19 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Theodore Ts'o, Christoph Hellwig, Pankaj Raghav, axboe,
	pankydev8, gost.dev, jiangbo.365, linux-nvme, linux-kernel,
	linux-block, linux-fsdevel, dm-devel, dsterba, linux-btrfs

On 5/19/22 12:12, Luis Chamberlain wrote:
> On Thu, May 19, 2022 at 12:08:26PM +0900, Damien Le Moal wrote:
>> On 5/18/22 00:34, Theodore Ts'o wrote:
>>> On Tue, May 17, 2022 at 10:10:48AM +0200, Christoph Hellwig wrote:
>>>> I'm a little surprised about all this activity.
>>>>
>>>> I though the conclusion at LSF/MM was that for Linux itself there
>>>> is very little benefit in supporting this scheme.  It will massively
>>>> fragment the supported based of devices and applications, while only
>>>> having the benefit of supporting some Samsung legacy devices.
>>>
>>> FWIW,
>>>
>>> That wasn't my impression from that LSF/MM session, but once the
>>> videos become available, folks can decide for themselves.
>>
>> There was no real discussion about zone size constraint on the zone
>> storage BoF. Many discussions happened in the hallway track though.
> 
> Right so no direct clear blockers mentioned at all during the BoF.

Nor any clear OK.

> 
>   Luis


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 07/13] btrfs: zoned: use generic btrfs zone helpers to support npo2 zoned devices
  2022-05-16 16:54     ` [PATCH v4 07/13] btrfs: zoned: use generic btrfs zone helpers to support npo2 zoned devices Pankaj Raghav
  2022-05-17 12:30       ` David Sterba
@ 2022-05-19  4:13       ` Naohiro Aota
  1 sibling, 0 replies; 56+ messages in thread
From: Naohiro Aota @ 2022-05-19  4:13 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: axboe, damien.lemoal, pankydev8, dsterba, hch, linux-nvme,
	linux-fsdevel, linux-btrfs, jiangbo.365, linux-block, gost.dev,
	linux-kernel, dm-devel, Luis Chamberlain

On Mon, May 16, 2022 at 06:54:10PM +0200, Pankaj Raghav wrote:
> Add helpers to calculate alignment, round up and round down
> for zoned devices. These helpers encapsulates the necessary handling for
> power_of_2 and non-power_of_2 zone sizes. Optimized calculations are
> performed for zone sizes that are power_of_2 with log and shifts.
> 
> btrfs_zoned_is_aligned() is added instead of reusing bdev_zone_aligned()
> helper due to some use cases in btrfs where zone alignment is checked
> before having access to the underlying block device such as in this
> function: btrfs_load_block_group_zone_info().
> 
> Use the generic btrfs zone helpers to calculate zone index, check zone
> alignment, round up and round down operations.
> 
> The zone_size_shift field is not needed anymore as generic helpers are
> used for calculation.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/btrfs/volumes.c | 24 +++++++++-------
>  fs/btrfs/zoned.c   | 72 ++++++++++++++++++++++------------------------
>  fs/btrfs/zoned.h   | 43 +++++++++++++++++++++++----
>  3 files changed, 85 insertions(+), 54 deletions(-)
> 
><snip>
> diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
> index 694ab6d1e..b94ce4d1f 100644
> --- a/fs/btrfs/zoned.h
> +++ b/fs/btrfs/zoned.h
> @@ -9,6 +9,7 @@
>  #include "disk-io.h"
>  #include "block-group.h"
>  #include "btrfs_inode.h"
> +#include "misc.h"
>  
>  #define BTRFS_DEFAULT_RECLAIM_THRESH           			(75)
>  
> @@ -18,7 +19,6 @@ struct btrfs_zoned_device_info {
>  	 * zoned block device.
>  	 */
>  	u64 zone_size;
> -	u8  zone_size_shift;
>  	u32 nr_zones;
>  	unsigned int max_active_zones;
>  	atomic_t active_zones_left;
> @@ -30,6 +30,36 @@ struct btrfs_zoned_device_info {
>  	u32 sb_zone_location[BTRFS_SUPER_MIRROR_MAX];
>  };
>  
> +static inline bool btrfs_zoned_is_aligned(u64 pos, u64 zone_size)
> +{
> +	u64 remainder = 0;
> +
> +	if (is_power_of_two_u64(zone_size))
> +		return IS_ALIGNED(pos, zone_size);
> +
> +	div64_u64_rem(pos, zone_size, &remainder);
> +	return remainder == 0;
> +}
> +
> +static inline u64 btrfs_zoned_roundup(u64 pos, u64 zone_size)
> +{
> +	if (is_power_of_two_u64(zone_size))
> +		return ALIGN(pos, zone_size);
> +
> +	return div64_u64(pos + zone_size - 1, zone_size) * zone_size;
> +}
> +
> +static inline u64 btrfs_zoned_rounddown(u64 pos, u64 zone_size)
> +{
> +	u64 remainder = 0;
> +	if (is_power_of_two_u64(zone_size))
> +		return round_down(pos, zone_size);
> +
> +	div64_u64_rem(pos, zone_size, &remainder);
> +	pos -= remainder;
> +	return pos;
> +}
> +

This is just a preference, but how about naming these helpers not related
to "zoned"? While they take "zone_size" as an argument, it does not do
anything special on zoned things. For my preference, I would take
btrfs_device or btrfs_zoned_device_info for "btrfs_zoned_*" function.

Actually, I was a bit confused seeing the part
below. btrfs_zoned_is_aligned() takes "sector_t" values while the arguments
are often byte granularity address. Yeah, actually sector_t == u64 and the
code does not relies on the unit ... so, it is OK as long as the values are
in the same unit.

    @@ -409,9 +409,8 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
            }
    
            nr_sectors = bdev_nr_sectors(bdev);
    -       zone_info->zone_size_shift = ilog2(zone_info->zone_size);
    -       zone_info->nr_zones = nr_sectors >> ilog2(zone_sectors);
    -       if (!IS_ALIGNED(nr_sectors, zone_sectors))
    +       zone_info->nr_zones = bdev_zone_no(bdev, nr_sectors);
    +       if (!btrfs_zoned_is_aligned(nr_sectors, zone_sectors))
                    zone_info->nr_zones++;
    
            max_active_zones = bdev_max_active_zones(bdev);

BTW, aren't these helpers used in other subsystems? Maybe adding these in
include/linux/math64.h or so?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-19  3:19           ` Damien Le Moal
@ 2022-05-19  7:34             ` Johannes Thumshirn
  2022-05-20  3:47               ` Damien Le Moal
  0 siblings, 1 reply; 56+ messages in thread
From: Johannes Thumshirn @ 2022-05-19  7:34 UTC (permalink / raw)
  To: Damien Le Moal, Luis Chamberlain
  Cc: Theodore Ts'o, Christoph Hellwig, Pankaj Raghav, axboe,
	pankydev8, gost.dev, jiangbo.365, linux-nvme, linux-kernel,
	linux-block, linux-fsdevel, dm-devel, dsterba, linux-btrfs

On 19/05/2022 05:19, Damien Le Moal wrote:
> On 5/19/22 12:12, Luis Chamberlain wrote:
>> On Thu, May 19, 2022 at 12:08:26PM +0900, Damien Le Moal wrote:
>>> On 5/18/22 00:34, Theodore Ts'o wrote:
>>>> On Tue, May 17, 2022 at 10:10:48AM +0200, Christoph Hellwig wrote:
>>>>> I'm a little surprised about all this activity.
>>>>>
>>>>> I though the conclusion at LSF/MM was that for Linux itself there
>>>>> is very little benefit in supporting this scheme.  It will massively
>>>>> fragment the supported based of devices and applications, while only
>>>>> having the benefit of supporting some Samsung legacy devices.
>>>>
>>>> FWIW,
>>>>
>>>> That wasn't my impression from that LSF/MM session, but once the
>>>> videos become available, folks can decide for themselves.
>>>
>>> There was no real discussion about zone size constraint on the zone
>>> storage BoF. Many discussions happened in the hallway track though.
>>
>> Right so no direct clear blockers mentioned at all during the BoF.
> 
> Nor any clear OK.

So what about creating a device-mapper target, that's taking npo2 drives and
makes them po2 drives for the FS layers? It will be very similar code to 
dm-linear.

After all zoned support for FSes started with a device-mapper (dm-zoned) and 
as the need for a more integrated solution arose, it changed into natiive
support.

And all that is there is simple arithmetic and a bio_clone(), if this is the
slowest part of the stack involving a FS like f2fs or btrfs I'm throwing a
round of anyone's favorite beverage at next year's LSFMM.

Byte,
	Johannes


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets
  2022-05-18  9:15         ` Pankaj Raghav
@ 2022-05-19  7:57           ` Johannes Thumshirn
  2022-05-20  9:06             ` Pankaj Raghav
  0 siblings, 1 reply; 56+ messages in thread
From: Johannes Thumshirn @ 2022-05-19  7:57 UTC (permalink / raw)
  To: Pankaj Raghav, dsterba
  Cc: axboe, damien.lemoal, pankydev8, dsterba, hch, linux-nvme,
	linux-fsdevel, linux-btrfs, jiangbo.365, linux-block, gost.dev,
	linux-kernel, dm-devel

On 18/05/2022 11:17, Pankaj Raghav wrote:
> On 2022-05-17 14:42, David Sterba wrote:
>> On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote:
>>> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
>>> These are fixed at these locations so that recovery tools can reliably
>>> retrieve the superblocks even if one of the mirror gets corrupted.
>>>
>>> power of 2 zone sizes align at these offsets irrespective of their
>>> value but non power of 2 zone sizes will not align.
>>>
>>> To make sure the first zone at mirror 1 and mirror 2 align, write zero
>>> operation is performed to move the write pointer of the first zone to
>>> the expected offset. This operation is performed only after a zone reset
>>> of the first zone, i.e., when the second zone that contains the sb is FULL.
>> Is it a good idea to do the "write zeros", instead of a plain "set write
>> pointer"? I assume setting write pointer is instant, while writing
>> potentially hundreds of megabytes may take significiant time. As the
>> functions may be called from random contexts, the increased time may
>> become a problem.
>>
> Unfortunately it is not possible to just move the WP in zoned devices.
> The only alternative that I could use is to do write zeroes which are
> natively supported by some devices such as ZNS. It would be nice to know
> if someone had a better solution to this instead of doing write zeroes
> in zoned devices.
> 

I have another question. In case we need to pad the sb zone with a write
zeros and have a power fail between the write-zeros and the regular 
super-block write, what happens? I know this padding is only done for the
backup super blocks, never the less it can happen and it can happen when
the primary super block is also corrupted.

AFAIU we're then trying to reach out for a backup super block, look at the
write pointer and it only contains zeros but no super block, as only the 
write-zeros has reached the device and not the super block write.

How is this situation handled?

Thanks,
	Johannes

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets
  2022-05-16 16:54     ` [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets Pankaj Raghav
  2022-05-17  6:50       ` Johannes Thumshirn
  2022-05-17 12:42       ` David Sterba
@ 2022-05-19  7:59       ` Naohiro Aota
  2022-05-20  9:09         ` Pankaj Raghav
  2 siblings, 1 reply; 56+ messages in thread
From: Naohiro Aota @ 2022-05-19  7:59 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: axboe, damien.lemoal, pankydev8, dsterba, hch, linux-nvme,
	linux-fsdevel, linux-btrfs, jiangbo.365, linux-block, gost.dev,
	linux-kernel, dm-devel

On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote:
> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
> These are fixed at these locations so that recovery tools can reliably
> retrieve the superblocks even if one of the mirror gets corrupted.
> 
> power of 2 zone sizes align at these offsets irrespective of their
> value but non power of 2 zone sizes will not align.
> 
> To make sure the first zone at mirror 1 and mirror 2 align, write zero
> operation is performed to move the write pointer of the first zone to
> the expected offset. This operation is performed only after a zone reset
> of the first zone, i.e., when the second zone that contains the sb is FULL.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/btrfs/zoned.c | 68 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 63 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 3023c871e..805aeaa76 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -760,11 +760,44 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
>  	return 0;
>  }
>  
> +static int fill_sb_wp_offset(struct block_device *bdev, struct blk_zone *zone,
> +			     int mirror, u64 *wp_ret)
> +{
> +	u64 offset = 0;
> +	int ret = 0;
> +
> +	ASSERT(!is_power_of_two_u64(zone->len));
> +	ASSERT(zone->wp == zone->start);
> +	ASSERT(mirror != 0);
> +
> +	switch (mirror) {
> +	case 1:
> +		div64_u64_rem(BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT,
> +			      zone->len, &offset);
> +		break;
> +	case 2:
> +		div64_u64_rem(BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT,
> +			      zone->len, &offset);
> +		break;
> +	}
> +
> +	ret =  blkdev_issue_zeroout(bdev, zone->start, offset, GFP_NOFS, 0);
> +	if (ret)
> +		return ret;
> +
> +	zone->wp += offset;
> +	zone->cond = BLK_ZONE_COND_IMP_OPEN;
> +	*wp_ret = zone->wp << SECTOR_SHIFT;
> +
> +	return 0;
> +}
> +
>  static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
> -			   int rw, u64 *bytenr_ret)
> +			   int rw, int mirror, u64 *bytenr_ret)
>  {
>  	u64 wp;
>  	int ret;
> +	bool zones_empty = false;
>  
>  	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
>  		*bytenr_ret = zones[0].start << SECTOR_SHIFT;
> @@ -775,13 +808,31 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
>  	if (ret != -ENOENT && ret < 0)
>  		return ret;
>  
> +	if (ret == -ENOENT)
> +		zones_empty = true;
> +

I think, we don't need this. We need to issue the zeroout when
zones[0]->cond == BLK_ZONE_COND_EMPTY && !is_power_of_2(...) after sending
ZONE_RESET if necessary. No?

>  	if (rw == WRITE) {
>  		struct blk_zone *reset = NULL;
> +		bool is_sb_offset_write_req = false;
> +		u32 reset_zone_nr = -1;
>  
> -		if (wp == zones[0].start << SECTOR_SHIFT)
> +		if (wp == zones[0].start << SECTOR_SHIFT) {
>  			reset = &zones[0];
> -		else if (wp == zones[1].start << SECTOR_SHIFT)
> +			reset_zone_nr = 0;
> +		} else if (wp == zones[1].start << SECTOR_SHIFT) {
>  			reset = &zones[1];
> +			reset_zone_nr = 1;
> +		}
> +
> +		/*
> +		 * Non po2 zone sizes will not align naturally at
> +		 * mirror 1 (512GB) and mirror 2 (4TB). The wp of the
> +		 * 1st zone in those superblock mirrors need to be
> +		 * moved to align at those offsets.
> +		 */
> +		is_sb_offset_write_req =
> +			(zones_empty || (reset_zone_nr == 0)) && mirror &&
> +			!is_power_of_2(zones[0].len);
>  
>  		if (reset && reset->cond != BLK_ZONE_COND_EMPTY) {
>  			ASSERT(sb_zone_is_full(reset));
> @@ -795,6 +846,13 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
>  			reset->cond = BLK_ZONE_COND_EMPTY;
>  			reset->wp = reset->start;
>  		}
> +
> +		if (is_sb_offset_write_req) {
> +			ret = fill_sb_wp_offset(bdev, &zones[0], mirror, &wp);
> +			if (ret)
> +				return ret;
> +		}
> +
>  	} else if (ret != -ENOENT) {
>  		/*
>  		 * For READ, we want the previous one. Move write pointer to
> @@ -851,7 +909,7 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
>  	if (ret != BTRFS_NR_SB_LOG_ZONES)
>  		return -EIO;
>  
> -	return sb_log_location(bdev, zones, rw, bytenr_ret);
> +	return sb_log_location(bdev, zones, rw, mirror, bytenr_ret);
>  }
>  
>  int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
> @@ -877,7 +935,7 @@ int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
>  
>  	return sb_log_location(device->bdev,
>  			       &zinfo->sb_zones[BTRFS_NR_SB_LOG_ZONES * mirror],
> -			       rw, bytenr_ret);
> +			       rw, mirror, bytenr_ret);
>  }
>  
>  static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo,
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-18  8:00       ` Christoph Hellwig
@ 2022-05-19 15:25         ` Javier González
  0 siblings, 0 replies; 56+ messages in thread
From: Javier González @ 2022-05-19 15:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Javier González, Pankaj Raghav, axboe, damien.lemoal,
	pankydev8, dsterba, linux-nvme, linux-fsdevel, linux-btrfs,
	jiangbo.365, linux-block, gost.dev, linux-kernel, dm-devel


> On 18 May 2022, at 10.16, Christoph Hellwig <hch@lst.de> wrote:
> 
> On Tue, May 17, 2022 at 11:18:34AM +0200, Javier González wrote:
>> Does the above help you reconsidering your interest in supporting this
>> in NVMe?
> 
> Very little.  It just seems like a really bad idea.

I understand you don’t like this, but I still hope you see value in supporting it. We are getting close to a very minimal patchset, which is also helping to fix bugs in the zoned stack.

If you take a look at the last version abs give some feedback, I’m sure we can end up with a good solution. 

Can you help?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-19  7:34             ` Johannes Thumshirn
@ 2022-05-20  3:47               ` Damien Le Moal
  2022-05-20  6:07                 ` Hannes Reinecke
  0 siblings, 1 reply; 56+ messages in thread
From: Damien Le Moal @ 2022-05-20  3:47 UTC (permalink / raw)
  To: Johannes Thumshirn, Luis Chamberlain
  Cc: Theodore Ts'o, Christoph Hellwig, Pankaj Raghav, axboe,
	pankydev8, gost.dev, jiangbo.365, linux-nvme, linux-kernel,
	linux-block, linux-fsdevel, dm-devel, dsterba, linux-btrfs

On 5/19/22 16:34, Johannes Thumshirn wrote:
> On 19/05/2022 05:19, Damien Le Moal wrote:
>> On 5/19/22 12:12, Luis Chamberlain wrote:
>>> On Thu, May 19, 2022 at 12:08:26PM +0900, Damien Le Moal wrote:
>>>> On 5/18/22 00:34, Theodore Ts'o wrote:
>>>>> On Tue, May 17, 2022 at 10:10:48AM +0200, Christoph Hellwig wrote:
>>>>>> I'm a little surprised about all this activity.
>>>>>>
>>>>>> I though the conclusion at LSF/MM was that for Linux itself there
>>>>>> is very little benefit in supporting this scheme.  It will massively
>>>>>> fragment the supported based of devices and applications, while only
>>>>>> having the benefit of supporting some Samsung legacy devices.
>>>>>
>>>>> FWIW,
>>>>>
>>>>> That wasn't my impression from that LSF/MM session, but once the
>>>>> videos become available, folks can decide for themselves.
>>>>
>>>> There was no real discussion about zone size constraint on the zone
>>>> storage BoF. Many discussions happened in the hallway track though.
>>>
>>> Right so no direct clear blockers mentioned at all during the BoF.
>>
>> Nor any clear OK.
> 
> So what about creating a device-mapper target, that's taking npo2 drives and
> makes them po2 drives for the FS layers? It will be very similar code to 
> dm-linear.

+1

This will simplify the support for FSes, at least for the initial drop (if
accepted).

And more importantly, this will also allow addressing any potential
problem with user space breaking because of the non power of 2 zone size.

> 
> After all zoned support for FSes started with a device-mapper (dm-zoned) and 
> as the need for a more integrated solution arose, it changed into natiive
> support.
> 
> And all that is there is simple arithmetic and a bio_clone(), if this is the
> slowest part of the stack involving a FS like f2fs or btrfs I'm throwing a
> round of anyone's favorite beverage at next year's LSFMM.
> 
> Byte,
> 	Johannes
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-20  3:47               ` Damien Le Moal
@ 2022-05-20  6:07                 ` Hannes Reinecke
  2022-05-20  6:27                   ` Javier González
  0 siblings, 1 reply; 56+ messages in thread
From: Hannes Reinecke @ 2022-05-20  6:07 UTC (permalink / raw)
  To: Damien Le Moal, Johannes Thumshirn, Luis Chamberlain
  Cc: Theodore Ts'o, Christoph Hellwig, Pankaj Raghav, axboe,
	pankydev8, gost.dev, jiangbo.365, linux-nvme, linux-kernel,
	linux-block, linux-fsdevel, dm-devel, dsterba, linux-btrfs

On 5/19/22 20:47, Damien Le Moal wrote:
> On 5/19/22 16:34, Johannes Thumshirn wrote:
>> On 19/05/2022 05:19, Damien Le Moal wrote:
>>> On 5/19/22 12:12, Luis Chamberlain wrote:
>>>> On Thu, May 19, 2022 at 12:08:26PM +0900, Damien Le Moal wrote:
>>>>> On 5/18/22 00:34, Theodore Ts'o wrote:
>>>>>> On Tue, May 17, 2022 at 10:10:48AM +0200, Christoph Hellwig wrote:
>>>>>>> I'm a little surprised about all this activity.
>>>>>>>
>>>>>>> I though the conclusion at LSF/MM was that for Linux itself there
>>>>>>> is very little benefit in supporting this scheme.  It will massively
>>>>>>> fragment the supported based of devices and applications, while only
>>>>>>> having the benefit of supporting some Samsung legacy devices.
>>>>>>
>>>>>> FWIW,
>>>>>>
>>>>>> That wasn't my impression from that LSF/MM session, but once the
>>>>>> videos become available, folks can decide for themselves.
>>>>>
>>>>> There was no real discussion about zone size constraint on the zone
>>>>> storage BoF. Many discussions happened in the hallway track though.
>>>>
>>>> Right so no direct clear blockers mentioned at all during the BoF.
>>>
>>> Nor any clear OK.
>>
>> So what about creating a device-mapper target, that's taking npo2 drives and
>> makes them po2 drives for the FS layers? It will be very similar code to
>> dm-linear.
> 
> +1
> 
> This will simplify the support for FSes, at least for the initial drop (if
> accepted).
> 
> And more importantly, this will also allow addressing any potential
> problem with user space breaking because of the non power of 2 zone size.
> 
Seconded (or maybe thirded).

The changes to support npo2 in the block layer are pretty simple, and 
really I don't have an issue with those.
Then adding a device-mapper target transforming npo2 drives in po2 block 
devices should be pretty trivial.

And once that is in you can start arguing with the the FS folks on 
whether to implement it natively.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-20  6:07                 ` Hannes Reinecke
@ 2022-05-20  6:27                   ` Javier González
  2022-05-20  6:41                     ` Damien Le Moal
  2022-05-20  9:30                     ` Johannes Thumshirn
  0 siblings, 2 replies; 56+ messages in thread
From: Javier González @ 2022-05-20  6:27 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Damien Le Moal, Johannes Thumshirn, Luis Chamberlain,
	Theodore Ts'o, Christoph Hellwig, Pankaj Raghav, axboe,
	pankydev8, gost.dev, jiangbo.365, linux-nvme, linux-kernel,
	linux-block, linux-fsdevel, dm-devel, dsterba, linux-btrfs,
	Jaegeuk Kim

On 20.05.2022 08:07, Hannes Reinecke wrote:
>On 5/19/22 20:47, Damien Le Moal wrote:
>>On 5/19/22 16:34, Johannes Thumshirn wrote:
>>>On 19/05/2022 05:19, Damien Le Moal wrote:
>>>>On 5/19/22 12:12, Luis Chamberlain wrote:
>>>>>On Thu, May 19, 2022 at 12:08:26PM +0900, Damien Le Moal wrote:
>>>>>>On 5/18/22 00:34, Theodore Ts'o wrote:
>>>>>>>On Tue, May 17, 2022 at 10:10:48AM +0200, Christoph Hellwig wrote:
>>>>>>>>I'm a little surprised about all this activity.
>>>>>>>>
>>>>>>>>I though the conclusion at LSF/MM was that for Linux itself there
>>>>>>>>is very little benefit in supporting this scheme.  It will massively
>>>>>>>>fragment the supported based of devices and applications, while only
>>>>>>>>having the benefit of supporting some Samsung legacy devices.
>>>>>>>
>>>>>>>FWIW,
>>>>>>>
>>>>>>>That wasn't my impression from that LSF/MM session, but once the
>>>>>>>videos become available, folks can decide for themselves.
>>>>>>
>>>>>>There was no real discussion about zone size constraint on the zone
>>>>>>storage BoF. Many discussions happened in the hallway track though.
>>>>>
>>>>>Right so no direct clear blockers mentioned at all during the BoF.
>>>>
>>>>Nor any clear OK.
>>>
>>>So what about creating a device-mapper target, that's taking npo2 drives and
>>>makes them po2 drives for the FS layers? It will be very similar code to
>>>dm-linear.
>>
>>+1
>>
>>This will simplify the support for FSes, at least for the initial drop (if
>>accepted).
>>
>>And more importantly, this will also allow addressing any potential
>>problem with user space breaking because of the non power of 2 zone size.
>>
>Seconded (or maybe thirded).
>
>The changes to support npo2 in the block layer are pretty simple, and 
>really I don't have an issue with those.
>Then adding a device-mapper target transforming npo2 drives in po2 
>block devices should be pretty trivial.
>
>And once that is in you can start arguing with the the FS folks on 
>whether to implement it natively.
>

So you are suggesting adding support for !PO2 in the block layer and
then a dm to present the device as a PO2 to the FS? This at least
addresses the hole issue for raw zoned block devices, so it can be a
first step.

This said, it seems to me that the changes to the FS are not being a
real issue. In fact, we are exposing some bugs while we generalize the
zone size support.

Could you point out what the challenges in btrfs are in the current
patches, that it makes sense to add an extra dm layer?

Note that for F2FS there is no blocker. Jaegeuk picked the initial
patches, and he agreed to add native support.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-20  6:27                   ` Javier González
@ 2022-05-20  6:41                     ` Damien Le Moal
       [not found]                       ` <CGME20220520065941eucas1p105cf273ede995dc4bf92f3245fad09b1@eucas1p1.samsung.com>
  2022-05-20  9:30                       ` Pankaj Raghav
  2022-05-20  9:30                     ` Johannes Thumshirn
  1 sibling, 2 replies; 56+ messages in thread
From: Damien Le Moal @ 2022-05-20  6:41 UTC (permalink / raw)
  To: Javier González, Hannes Reinecke
  Cc: Johannes Thumshirn, Luis Chamberlain, Theodore Ts'o,
	Christoph Hellwig, Pankaj Raghav, axboe, pankydev8, gost.dev,
	jiangbo.365, linux-nvme, linux-kernel, linux-block,
	linux-fsdevel, dm-devel, dsterba, linux-btrfs, Jaegeuk Kim

On 5/20/22 15:27, Javier González wrote:
> On 20.05.2022 08:07, Hannes Reinecke wrote:
>> On 5/19/22 20:47, Damien Le Moal wrote:
>>> On 5/19/22 16:34, Johannes Thumshirn wrote:
>>>> On 19/05/2022 05:19, Damien Le Moal wrote:
>>>>> On 5/19/22 12:12, Luis Chamberlain wrote:
>>>>>> On Thu, May 19, 2022 at 12:08:26PM +0900, Damien Le Moal wrote:
>>>>>>> On 5/18/22 00:34, Theodore Ts'o wrote:
>>>>>>>> On Tue, May 17, 2022 at 10:10:48AM +0200, Christoph Hellwig wrote:
>>>>>>>>> I'm a little surprised about all this activity.
>>>>>>>>>
>>>>>>>>> I though the conclusion at LSF/MM was that for Linux itself there
>>>>>>>>> is very little benefit in supporting this scheme.  It will massively
>>>>>>>>> fragment the supported based of devices and applications, while only
>>>>>>>>> having the benefit of supporting some Samsung legacy devices.
>>>>>>>>
>>>>>>>> FWIW,
>>>>>>>>
>>>>>>>> That wasn't my impression from that LSF/MM session, but once the
>>>>>>>> videos become available, folks can decide for themselves.
>>>>>>>
>>>>>>> There was no real discussion about zone size constraint on the zone
>>>>>>> storage BoF. Many discussions happened in the hallway track though.
>>>>>>
>>>>>> Right so no direct clear blockers mentioned at all during the BoF.
>>>>>
>>>>> Nor any clear OK.
>>>>
>>>> So what about creating a device-mapper target, that's taking npo2 drives and
>>>> makes them po2 drives for the FS layers? It will be very similar code to
>>>> dm-linear.
>>>
>>> +1
>>>
>>> This will simplify the support for FSes, at least for the initial drop (if
>>> accepted).
>>>
>>> And more importantly, this will also allow addressing any potential
>>> problem with user space breaking because of the non power of 2 zone size.
>>>
>> Seconded (or maybe thirded).
>>
>> The changes to support npo2 in the block layer are pretty simple, and 
>> really I don't have an issue with those.
>> Then adding a device-mapper target transforming npo2 drives in po2 
>> block devices should be pretty trivial.
>>
>> And once that is in you can start arguing with the the FS folks on 
>> whether to implement it natively.
>>
> 
> So you are suggesting adding support for !PO2 in the block layer and
> then a dm to present the device as a PO2 to the FS? This at least
> addresses the hole issue for raw zoned block devices, so it can be a
> first step.

Yes, and it also allows supporting these new !po2 devices without
regressions (read lack of) in the support at FS level.

> 
> This said, it seems to me that the changes to the FS are not being a
> real issue. In fact, we are exposing some bugs while we generalize the
> zone size support.

Not arguing with that. But since we are still stabilizing btrfs ZNS
support, adding more code right now is a little painful.

> 
> Could you point out what the challenges in btrfs are in the current
> patches, that it makes sense to add an extra dm layer?

See above. No real challenge, just needs to be done if a clear agreement
can be reached on zone size alignment constraints. As mentioned above, the
btrfs changes timing is not ideal right now though.

Also please do not forget applications that may expect a power of 2 zone
size. A dm-zsp2 would be a nice solution for these. So regardless of the
FS work, that new DM target will be *very* nice to have.

> 
> Note that for F2FS there is no blocker. Jaegeuk picked the initial
> patches, and he agreed to add native support.

And until that is done, f2fs will not work with these new !po2 devices...
Having the new dm will avoid that support fragmentation which I personally
really dislike. With the new dm, we can keep support for *all* zoned block
devices, albeit needing a different setup depending on the device. That is
not nice at all but at least there is a way to make things work continuously.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
       [not found]                       ` <CGME20220520065941eucas1p105cf273ede995dc4bf92f3245fad09b1@eucas1p1.samsung.com>
@ 2022-05-20  6:59                         ` Javier González
  0 siblings, 0 replies; 56+ messages in thread
From: Javier González @ 2022-05-20  6:59 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Hannes Reinecke, Johannes Thumshirn, Luis Chamberlain,
	Theodore Ts'o, Christoph Hellwig, Pankaj Raghav, axboe,
	pankydev8, gost.dev, jiangbo.365, linux-nvme, linux-kernel,
	linux-block, linux-fsdevel, dm-devel, dsterba, linux-btrfs,
	Jaegeuk Kim

On 20.05.2022 15:41, Damien Le Moal wrote:
>On 5/20/22 15:27, Javier González wrote:
>> On 20.05.2022 08:07, Hannes Reinecke wrote:
>>> On 5/19/22 20:47, Damien Le Moal wrote:
>>>> On 5/19/22 16:34, Johannes Thumshirn wrote:
>>>>> On 19/05/2022 05:19, Damien Le Moal wrote:
>>>>>> On 5/19/22 12:12, Luis Chamberlain wrote:
>>>>>>> On Thu, May 19, 2022 at 12:08:26PM +0900, Damien Le Moal wrote:
>>>>>>>> On 5/18/22 00:34, Theodore Ts'o wrote:
>>>>>>>>> On Tue, May 17, 2022 at 10:10:48AM +0200, Christoph Hellwig wrote:
>>>>>>>>>> I'm a little surprised about all this activity.
>>>>>>>>>>
>>>>>>>>>> I though the conclusion at LSF/MM was that for Linux itself there
>>>>>>>>>> is very little benefit in supporting this scheme.  It will massively
>>>>>>>>>> fragment the supported based of devices and applications, while only
>>>>>>>>>> having the benefit of supporting some Samsung legacy devices.
>>>>>>>>>
>>>>>>>>> FWIW,
>>>>>>>>>
>>>>>>>>> That wasn't my impression from that LSF/MM session, but once the
>>>>>>>>> videos become available, folks can decide for themselves.
>>>>>>>>
>>>>>>>> There was no real discussion about zone size constraint on the zone
>>>>>>>> storage BoF. Many discussions happened in the hallway track though.
>>>>>>>
>>>>>>> Right so no direct clear blockers mentioned at all during the BoF.
>>>>>>
>>>>>> Nor any clear OK.
>>>>>
>>>>> So what about creating a device-mapper target, that's taking npo2 drives and
>>>>> makes them po2 drives for the FS layers? It will be very similar code to
>>>>> dm-linear.
>>>>
>>>> +1
>>>>
>>>> This will simplify the support for FSes, at least for the initial drop (if
>>>> accepted).
>>>>
>>>> And more importantly, this will also allow addressing any potential
>>>> problem with user space breaking because of the non power of 2 zone size.
>>>>
>>> Seconded (or maybe thirded).
>>>
>>> The changes to support npo2 in the block layer are pretty simple, and
>>> really I don't have an issue with those.
>>> Then adding a device-mapper target transforming npo2 drives in po2
>>> block devices should be pretty trivial.
>>>
>>> And once that is in you can start arguing with the the FS folks on
>>> whether to implement it natively.
>>>
>>
>> So you are suggesting adding support for !PO2 in the block layer and
>> then a dm to present the device as a PO2 to the FS? This at least
>> addresses the hole issue for raw zoned block devices, so it can be a
>> first step.
>
>Yes, and it also allows supporting these new !po2 devices without
>regressions (read lack of) in the support at FS level.
>
>>
>> This said, it seems to me that the changes to the FS are not being a
>> real issue. In fact, we are exposing some bugs while we generalize the
>> zone size support.
>
>Not arguing with that. But since we are still stabilizing btrfs ZNS
>support, adding more code right now is a little painful.
>
>>
>> Could you point out what the challenges in btrfs are in the current
>> patches, that it makes sense to add an extra dm layer?
>
>See above. No real challenge, just needs to be done if a clear agreement
>can be reached on zone size alignment constraints. As mentioned above, the
>btrfs changes timing is not ideal right now though.
>
>Also please do not forget applications that may expect a power of 2 zone
>size. A dm-zsp2 would be a nice solution for these. So regardless of the
>FS work, that new DM target will be *very* nice to have.
>
>>
>> Note that for F2FS there is no blocker. Jaegeuk picked the initial
>> patches, and he agreed to add native support.
>
>And until that is done, f2fs will not work with these new !po2 devices...
>Having the new dm will avoid that support fragmentation which I personally
>really dislike. With the new dm, we can keep support for *all* zoned block
>devices, albeit needing a different setup depending on the device. That is
>not nice at all but at least there is a way to make things work continuously.

All the above sounds very reasonable. Thanks Damien.

If we all can agree, we can address this in the next version and come
maintain the native FS support off-tree until you see that general btrfs
support for zoned devicse is stable. We will be happy to help with this
too.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets
  2022-05-19  7:57           ` Johannes Thumshirn
@ 2022-05-20  9:06             ` Pankaj Raghav
  2022-05-20  9:15               ` Johannes Thumshirn
  0 siblings, 1 reply; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-20  9:06 UTC (permalink / raw)
  To: Johannes Thumshirn, dsterba
  Cc: axboe, damien.lemoal, pankydev8, dsterba, hch, linux-nvme,
	linux-fsdevel, linux-btrfs, jiangbo.365, linux-block, gost.dev,
	linux-kernel, dm-devel

On 5/19/22 09:57, Johannes Thumshirn wrote:
>> Unfortunately it is not possible to just move the WP in zoned devices.
>> The only alternative that I could use is to do write zeroes which are
>> natively supported by some devices such as ZNS. It would be nice to know
>> if someone had a better solution to this instead of doing write zeroes
>> in zoned devices.
>>
> 
> I have another question. In case we need to pad the sb zone with a write
> zeros and have a power fail between the write-zeros and the regular 
> super-block write, what happens? I know this padding is only done for the
> backup super blocks, never the less it can happen and it can happen when
> the primary super block is also corrupted.
> 
> AFAIU we're then trying to reach out for a backup super block, look at the
> write pointer and it only contains zeros but no super block, as only the 
> write-zeros has reached the device and not the super block write.
> 
> How is this situation handled?
> 
That is a very good point. I did think about this situation while adding
padding to the mirror superblock with write zeroes. If the drive is
**less than 4TB** and with the **primary superblock corrupted**, then it
will be an issue with the situation you have described for npo2 drives.
That situation is not handled here. Ofc this is not an issue when we
have the second mirror at 4TB for bigger drives. Do you have some ideas
in mind for this failure mode?
> Thanks,
> 	Johannes

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets
  2022-05-19  7:59       ` Naohiro Aota
@ 2022-05-20  9:09         ` Pankaj Raghav
  0 siblings, 0 replies; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-20  9:09 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: axboe, damien.lemoal, pankydev8, dsterba, hch, linux-nvme,
	linux-fsdevel, linux-btrfs, jiangbo.365, linux-block, gost.dev,
	linux-kernel, dm-devel

On 5/19/22 09:59, Naohiro Aota wrote:
>> +
>>  static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
>> -			   int rw, u64 *bytenr_ret)
>> +			   int rw, int mirror, u64 *bytenr_ret)
>>  {
>>  	u64 wp;
>>  	int ret;
>> +	bool zones_empty = false;
>>  
>>  	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
>>  		*bytenr_ret = zones[0].start << SECTOR_SHIFT;
>> @@ -775,13 +808,31 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
>>  	if (ret != -ENOENT && ret < 0)
>>  		return ret;
>>  
>> +	if (ret == -ENOENT)
>> +		zones_empty = true;
>> +
> 
> I think, we don't need this. We need to issue the zeroout when
> zones[0]->cond == BLK_ZONE_COND_EMPTY && !is_power_of_2(...) after sending
> ZONE_RESET if necessary. No?
> 
Yeah. I added this extra check to cover all the cases possible. But you
are right that it is enough to issue zeroout only for this condition:
zones[0]->cond == BLK_ZONE_COND_EMPTY && !is_power_of_2(...) as both the
zones empty is not likely to happen.
>>  	if (rw == WRITE) {
>>  		struct blk_zone *reset = NULL;
>> +		bool is_sb_offset_write_req = false;
>> +		u32 reset_zone_nr = -1;
>>  
>> -		if (wp == zones[0].start << SECTOR_SHIFT)
>> +		if (wp == zones[0].start << SECTOR_SHIFT) {
>>  			reset = &zones[0];
>> -		else if (wp == zones[1].start << SECTOR_SHIFT)
>> +			reset_zone_nr = 0;

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets
  2022-05-20  9:06             ` Pankaj Raghav
@ 2022-05-20  9:15               ` Johannes Thumshirn
  0 siblings, 0 replies; 56+ messages in thread
From: Johannes Thumshirn @ 2022-05-20  9:15 UTC (permalink / raw)
  To: Pankaj Raghav, dsterba
  Cc: axboe, damien.lemoal, pankydev8, dsterba, hch, linux-nvme,
	linux-fsdevel, linux-btrfs, jiangbo.365, linux-block, gost.dev,
	linux-kernel, dm-devel

On 20/05/2022 11:07, Pankaj Raghav wrote:
> On 5/19/22 09:57, Johannes Thumshirn wrote:
>>> Unfortunately it is not possible to just move the WP in zoned devices.
>>> The only alternative that I could use is to do write zeroes which are
>>> natively supported by some devices such as ZNS. It would be nice to know
>>> if someone had a better solution to this instead of doing write zeroes
>>> in zoned devices.
>>>
>>
>> I have another question. In case we need to pad the sb zone with a write
>> zeros and have a power fail between the write-zeros and the regular 
>> super-block write, what happens? I know this padding is only done for the
>> backup super blocks, never the less it can happen and it can happen when
>> the primary super block is also corrupted.
>>
>> AFAIU we're then trying to reach out for a backup super block, look at the
>> write pointer and it only contains zeros but no super block, as only the 
>> write-zeros has reached the device and not the super block write.
>>
>> How is this situation handled?
>>
> That is a very good point. I did think about this situation while adding
> padding to the mirror superblock with write zeroes. If the drive is
> **less than 4TB** and with the **primary superblock corrupted**, then it
> will be an issue with the situation you have described for npo2 drives.
> That situation is not handled here. Ofc this is not an issue when we
> have the second mirror at 4TB for bigger drives. Do you have some ideas
> in mind for this failure mode?

The only idea I have for this is creating a bounce buffer, write the padding
and the super-block into the buffer and then submit it. But that's too ugly
to live.

And it would involve changing non-zoned super-block writing code, which I think
is way to risky.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-20  6:41                     ` Damien Le Moal
       [not found]                       ` <CGME20220520065941eucas1p105cf273ede995dc4bf92f3245fad09b1@eucas1p1.samsung.com>
@ 2022-05-20  9:30                       ` Pankaj Raghav
  2022-05-20 17:18                         ` David Sterba
  1 sibling, 1 reply; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-20  9:30 UTC (permalink / raw)
  To: Damien Le Moal, Javier González, Hannes Reinecke,
	Johannes Thumshirn, Mike Snitzer, Christoph Hellwig
  Cc: Luis Chamberlain, Theodore Ts'o, axboe, pankydev8, gost.dev,
	jiangbo.365, linux-nvme, linux-kernel, linux-block,
	linux-fsdevel, dm-devel, dsterba, linux-btrfs, Jaegeuk Kim,
	Keith Busch, Adam Manzanares

On 5/20/22 08:41, Damien Le Moal wrote:
>>>>>
>>>>> So what about creating a device-mapper target, that's taking npo2 drives and
>>>>> makes them po2 drives for the FS layers? It will be very similar code to
>>>>> dm-linear.
>>>>
Keith and Adam had a similar suggestion to go create a device mapper
(dm-unholy) when we tried the po2 emulation[1].
>>>> +1
>>>>
>>>> This will simplify the support for FSes, at least for the initial drop (if
>>>> accepted).
>>>>
>>>> And more importantly, this will also allow addressing any potential
>>>> problem with user space breaking because of the non power of 2 zone size.
>>>>
>>> Seconded (or maybe thirded).
>>>
>>> The changes to support npo2 in the block layer are pretty simple, and 
>>> really I don't have an issue with those.
>>> Then adding a device-mapper target transforming npo2 drives in po2 
>>> block devices should be pretty trivial.
>>>
>>> And once that is in you can start arguing with the the FS folks on 
>>> whether to implement it natively.
>>>
>>
>> So you are suggesting adding support for !PO2 in the block layer and
>> then a dm to present the device as a PO2 to the FS? This at least
>> addresses the hole issue for raw zoned block devices, so it can be a
>> first step.
> 
> Yes, and it also allows supporting these new !po2 devices without
> regressions (read lack of) in the support at FS level.
> 
>>
>> This said, it seems to me that the changes to the FS are not being a
>> real issue. In fact, we are exposing some bugs while we generalize the
>> zone size support.
> 
> Not arguing with that. But since we are still stabilizing btrfs ZNS
> support, adding more code right now is a little painful.
> 
>>
>> Could you point out what the challenges in btrfs are in the current
>> patches, that it makes sense to add an extra dm layer?
> 
> See above. No real challenge, just needs to be done if a clear agreement
> can be reached on zone size alignment constraints. As mentioned above, the
> btrfs changes timing is not ideal right now though.
> 
> Also please do not forget applications that may expect a power of 2 zone
> size. A dm-zsp2 would be a nice solution for these. So regardless of the
> FS work, that new DM target will be *very* nice to have.
> 
>>
>> Note that for F2FS there is no blocker. Jaegeuk picked the initial
>> patches, and he agreed to add native support.
> 
> And until that is done, f2fs will not work with these new !po2 devices...
> Having the new dm will avoid that support fragmentation which I personally
> really dislike. With the new dm, we can keep support for *all* zoned block
> devices, albeit needing a different setup depending on the device. That is
> not nice at all but at least there is a way to make things work continuously.
> 

I see that many people in the community feel it is better to target the
dm layer for the initial support of npo2 devices. I can give it a shot
and maintain a native out-of-tree support for FSs for npo2 devices and
merge it upstream as we see fit later.

[1]
https://lore.kernel.org/all/20220311223032.GA2439@dhcp-10-100-145-180.wdc.com/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-20  6:27                   ` Javier González
  2022-05-20  6:41                     ` Damien Le Moal
@ 2022-05-20  9:30                     ` Johannes Thumshirn
       [not found]                       ` <CGME20220520101610eucas1p1822ca6014e2a1d55ae74476f83c4de1d@eucas1p1.samsung.com>
  1 sibling, 1 reply; 56+ messages in thread
From: Johannes Thumshirn @ 2022-05-20  9:30 UTC (permalink / raw)
  To: Javier González, Hannes Reinecke
  Cc: Damien Le Moal, Luis Chamberlain, Theodore Ts'o,
	Christoph Hellwig, Pankaj Raghav, axboe, pankydev8, gost.dev,
	jiangbo.365, linux-nvme, linux-kernel, linux-block,
	linux-fsdevel, dm-devel, dsterba, linux-btrfs, Jaegeuk Kim

On 20/05/2022 08:27, Javier González wrote:
> So you are suggesting adding support for !PO2 in the block layer and
> then a dm to present the device as a PO2 to the FS? This at least
> addresses the hole issue for raw zoned block devices, so it can be a
> first step.
> 
> This said, it seems to me that the changes to the FS are not being a
> real issue. In fact, we are exposing some bugs while we generalize the
> zone size support.
> 
> Could you point out what the challenges in btrfs are in the current
> patches, that it makes sense to add an extra dm layer?

I personally don't like the padding we need to do for the super block.

As I've pointed out to Pankaj already, I don't think it is 100% powerfail
safe as of now. It could probably be made, but that would also involve
changing non-zoned btrfs code which we try to avoid as much as we can.

As Damien already said, we still have issues with the general zoned 
support in btrfs, just have a look at the list of open issues [1] we
have. 

[1] https://github.com/naota/linux/issues/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
       [not found]                       ` <CGME20220520101610eucas1p1822ca6014e2a1d55ae74476f83c4de1d@eucas1p1.samsung.com>
@ 2022-05-20 10:16                         ` Javier González
  0 siblings, 0 replies; 56+ messages in thread
From: Javier González @ 2022-05-20 10:16 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Hannes Reinecke, Damien Le Moal, Luis Chamberlain,
	Theodore Ts'o, Christoph Hellwig, Pankaj Raghav, axboe,
	pankydev8, gost.dev, jiangbo.365, linux-nvme, linux-kernel,
	linux-block, linux-fsdevel, dm-devel, dsterba, linux-btrfs,
	Jaegeuk Kim

On 20.05.2022 09:30, Johannes Thumshirn wrote:
>On 20/05/2022 08:27, Javier González wrote:
>> So you are suggesting adding support for !PO2 in the block layer and
>> then a dm to present the device as a PO2 to the FS? This at least
>> addresses the hole issue for raw zoned block devices, so it can be a
>> first step.
>>
>> This said, it seems to me that the changes to the FS are not being a
>> real issue. In fact, we are exposing some bugs while we generalize the
>> zone size support.
>>
>> Could you point out what the challenges in btrfs are in the current
>> patches, that it makes sense to add an extra dm layer?
>
>I personally don't like the padding we need to do for the super block.
>
>As I've pointed out to Pankaj already, I don't think it is 100% powerfail
>safe as of now. It could probably be made, but that would also involve
>changing non-zoned btrfs code which we try to avoid as much as we can.
>
>As Damien already said, we still have issues with the general zoned
>support in btrfs, just have a look at the list of open issues [1] we
>have.
>
Sounds good Johannes. I understand that the priority is to make btrfs
stable now, before introducing more variables. Let's stick to this and
then we can bring it back as the list of open issues becomes more
manageable.

>[1] https://protect2.fireeye.com/v1/url?k=f14a1d6f-90c10859-f14b9620-74fe485fffe0-3f1861e7739d8cc7&q=1&e=213fcc28-3f9d-41a1-b653-0dc0e203c718&u=https%3A%2F%2Fgithub.com%2Fnaota%2Flinux%2Fissues%2F

Thanks for sharing this too. It is a good way to where to help

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-20  9:30                       ` Pankaj Raghav
@ 2022-05-20 17:18                         ` David Sterba
  2022-05-23  8:25                           ` Pankaj Raghav
  0 siblings, 1 reply; 56+ messages in thread
From: David Sterba @ 2022-05-20 17:18 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Damien Le Moal, Javier González, Hannes Reinecke,
	Johannes Thumshirn, Mike Snitzer, Christoph Hellwig,
	Luis Chamberlain, Theodore Ts'o, axboe, pankydev8, gost.dev,
	jiangbo.365, linux-nvme, linux-kernel, linux-block,
	linux-fsdevel, dm-devel, dsterba, linux-btrfs, Jaegeuk Kim,
	Keith Busch, Adam Manzanares

On Fri, May 20, 2022 at 11:30:09AM +0200, Pankaj Raghav wrote:
> On 5/20/22 08:41, Damien Le Moal wrote:
> >> Note that for F2FS there is no blocker. Jaegeuk picked the initial
> >> patches, and he agreed to add native support.
> > 
> > And until that is done, f2fs will not work with these new !po2 devices...
> > Having the new dm will avoid that support fragmentation which I personally
> > really dislike. With the new dm, we can keep support for *all* zoned block
> > devices, albeit needing a different setup depending on the device. That is
> > not nice at all but at least there is a way to make things work continuously.
> 
> I see that many people in the community feel it is better to target the
> dm layer for the initial support of npo2 devices. I can give it a shot
> and maintain a native out-of-tree support for FSs for npo2 devices and
> merge it upstream as we see fit later.

Some of the changes from your patchset are cleanups or abstracting the
alignment and zone calculations, so this can be merged to minimize the
out of tree code.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
  2022-05-20 17:18                         ` David Sterba
@ 2022-05-23  8:25                           ` Pankaj Raghav
  0 siblings, 0 replies; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-23  8:25 UTC (permalink / raw)
  To: dsterba, Damien Le Moal, Javier González, Hannes Reinecke,
	Johannes Thumshirn, Mike Snitzer, Christoph Hellwig,
	Luis Chamberlain, Theodore Ts'o, axboe, pankydev8, gost.dev,
	jiangbo.365, linux-nvme, linux-kernel, linux-block,
	linux-fsdevel, dm-devel, dsterba, linux-btrfs, Jaegeuk Kim,
	Keith Busch, Adam Manzanares

On 5/20/22 19:18, David Sterba wrote:
>> I see that many people in the community feel it is better to target the
>> dm layer for the initial support of npo2 devices. I can give it a shot
>> and maintain a native out-of-tree support for FSs for npo2 devices and
>> merge it upstream as we see fit later.
> 
> Some of the changes from your patchset are cleanups or abstracting the
> alignment and zone calculations, so this can be merged to minimize the
> out of tree code.
Sounds good. I will send it separately. Thanks.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 01/13] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze
  2022-05-16 13:39   ` [PATCH v4 01/13] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze Pankaj Raghav
@ 2022-05-16 13:54     ` Damien Le Moal
  0 siblings, 0 replies; 56+ messages in thread
From: Damien Le Moal @ 2022-05-16 13:54 UTC (permalink / raw)
  To: Pankaj Raghav, axboe, naohiro.aota, Johannes.Thumshirn, snitzer,
	dsterba, jaegeuk, hch
  Cc: linux-btrfs, linux-kernel, jonathan.derrick, bvanassche,
	Keith Busch, gost.dev, linux-nvme, Johannes Thumshirn,
	Josef Bacik, linux-block, Alasdair Kergon, matias.bjorling,
	Jens Axboe, Sagi Grimberg, dm-devel, jiangbo.365,
	Chaitanya Kulkarni, linux-fsdevel, Chris Mason, Luis Chamberlain,
	Hannes Reinecke

On 2022/05/16 15:39, Pankaj Raghav wrote:
> Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
> also work for non-power-of-2 zone sizes.
> 
> As the existing deployments of zoned devices had power-of-2
> assumption, power-of-2 optimized calculation is kept for those devices.
> 
> There are no direct hot paths modified and the changes just
> introduce one new branch per call.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  block/blk-zoned.c      | 13 ++++++++++---
>  include/linux/blkdev.h |  8 +++++++-
>  2 files changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 38cd840d8..140230134 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -111,16 +111,23 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
>   * blkdev_nr_zones - Get number of zones
>   * @disk:	Target gendisk
>   *
> - * Return the total number of zones of a zoned block device.  For a block
> - * device without zone capabilities, the number of zones is always 0.
> + * Return the total number of zones of a zoned block device, including the
> + * eventual small last zone if present. For a block device without zone
> + * capabilities, the number of zones is always 0.
>   */
>  unsigned int blkdev_nr_zones(struct gendisk *disk)
>  {
>  	sector_t zone_sectors = blk_queue_zone_sectors(disk->queue);
> +	sector_t capacity = get_capacity(disk);
>  
>  	if (!blk_queue_is_zoned(disk->queue))
>  		return 0;
> -	return (get_capacity(disk) + zone_sectors - 1) >> ilog2(zone_sectors);
> +
> +	if (is_power_of_2(zone_sectors))
> +		return (capacity + zone_sectors - 1) >>
> +		       ilog2(zone_sectors);

Why the line break here ? This fits on one line, no ?

> +
> +	return div64_u64(capacity + zone_sectors - 1, zone_sectors);
>  }
>  EXPORT_SYMBOL_GPL(blkdev_nr_zones);
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 1b24c1fb3..22fe512ee 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -675,9 +675,15 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
>  static inline unsigned int blk_queue_zone_no(struct request_queue *q,
>  					     sector_t sector)
>  {
> +	sector_t zone_sectors = blk_queue_zone_sectors(q);
> +
>  	if (!blk_queue_is_zoned(q))
>  		return 0;
> -	return sector >> ilog2(q->limits.chunk_sectors);
> +
> +	if (is_power_of_2(zone_sectors))
> +		return sector >> ilog2(zone_sectors);
> +
> +	return div64_u64(sector, zone_sectors);
>  }
>  
>  static inline bool blk_queue_zone_is_seq(struct request_queue *q,


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v4 01/13] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze
       [not found] ` <CGME20220516133924eucas1p1817f306e3f2442088bf49ab513657cbe@eucas1p1.samsung.com>
@ 2022-05-16 13:39   ` Pankaj Raghav
  2022-05-16 13:54     ` Damien Le Moal
  0 siblings, 1 reply; 56+ messages in thread
From: Pankaj Raghav @ 2022-05-16 13:39 UTC (permalink / raw)
  To: axboe, naohiro.aota, damien.lemoal, Johannes.Thumshirn, snitzer,
	dsterba, jaegeuk, hch
  Cc: linux-btrfs, linux-kernel, jonathan.derrick, bvanassche,
	Keith Busch, gost.dev, linux-nvme, Johannes Thumshirn,
	Josef Bacik, linux-block, Alasdair Kergon, matias.bjorling,
	Jens Axboe, Sagi Grimberg, dm-devel, jiangbo.365,
	Chaitanya Kulkarni, linux-fsdevel, Chris Mason, Pankaj Raghav,
	Luis Chamberlain, Hannes Reinecke

Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
also work for non-power-of-2 zone sizes.

As the existing deployments of zoned devices had power-of-2
assumption, power-of-2 optimized calculation is kept for those devices.

There are no direct hot paths modified and the changes just
introduce one new branch per call.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 block/blk-zoned.c      | 13 ++++++++++---
 include/linux/blkdev.h |  8 +++++++-
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 38cd840d8..140230134 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -111,16 +111,23 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
  * blkdev_nr_zones - Get number of zones
  * @disk:	Target gendisk
  *
- * Return the total number of zones of a zoned block device.  For a block
- * device without zone capabilities, the number of zones is always 0.
+ * Return the total number of zones of a zoned block device, including the
+ * eventual small last zone if present. For a block device without zone
+ * capabilities, the number of zones is always 0.
  */
 unsigned int blkdev_nr_zones(struct gendisk *disk)
 {
 	sector_t zone_sectors = blk_queue_zone_sectors(disk->queue);
+	sector_t capacity = get_capacity(disk);
 
 	if (!blk_queue_is_zoned(disk->queue))
 		return 0;
-	return (get_capacity(disk) + zone_sectors - 1) >> ilog2(zone_sectors);
+
+	if (is_power_of_2(zone_sectors))
+		return (capacity + zone_sectors - 1) >>
+		       ilog2(zone_sectors);
+
+	return div64_u64(capacity + zone_sectors - 1, zone_sectors);
 }
 EXPORT_SYMBOL_GPL(blkdev_nr_zones);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1b24c1fb3..22fe512ee 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -675,9 +675,15 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
 static inline unsigned int blk_queue_zone_no(struct request_queue *q,
 					     sector_t sector)
 {
+	sector_t zone_sectors = blk_queue_zone_sectors(q);
+
 	if (!blk_queue_is_zoned(q))
 		return 0;
-	return sector >> ilog2(q->limits.chunk_sectors);
+
+	if (is_power_of_2(zone_sectors))
+		return sector >> ilog2(zone_sectors);
+
+	return div64_u64(sector, zone_sectors);
 }
 
 static inline bool blk_queue_zone_is_seq(struct request_queue *q,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2022-05-23  8:26 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20220516165418eucas1p2be592d9cd4b35f6b71d39ccbe87f3fef@eucas1p2.samsung.com>
2022-05-16 16:54 ` [PATCH v4 00/13] support non power of 2 zoned devices Pankaj Raghav
     [not found]   ` <CGME20220516165419eucas1p104aadda60df323e6154bfc3b92103b7b@eucas1p1.samsung.com>
2022-05-16 16:54     ` [PATCH v4 01/13] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze Pankaj Raghav
     [not found]   ` <CGME20220516165421eucas1p2515446ac290987bdb9af24ffb835b287@eucas1p2.samsung.com>
2022-05-16 16:54     ` [PATCH v4 02/13] block: allow blk-zoned devices to have non-power-of-2 zone size Pankaj Raghav
2022-05-16 19:05       ` Pankaj Raghav
     [not found]   ` <CGME20220516165422eucas1p174acec28848a9c2178376f092af3fa1c@eucas1p1.samsung.com>
2022-05-16 16:54     ` [PATCH v4 03/13] nvme: zns: Allow ZNS drives that have non-power_of_2 " Pankaj Raghav
     [not found]   ` <CGME20220516165424eucas1p2ee38cd64260539e5cac8d1fa4d0cba38@eucas1p2.samsung.com>
2022-05-16 16:54     ` [PATCH v4 04/13] nvmet: Allow ZNS target to support non-power_of_2 zone sizes Pankaj Raghav
2022-05-17 14:19       ` Johannes Thumshirn
     [not found]   ` <CGME20220516165425eucas1p29fcd11d7051d9d3a9a9efc17cd3b6999@eucas1p2.samsung.com>
2022-05-16 16:54     ` [PATCH v4 05/13] btrfs: zoned: Cache superblock location in btrfs_zoned_device_info Pankaj Raghav
2022-05-16 21:58       ` David Sterba
2022-05-17  7:55         ` Pankaj Raghav
     [not found]   ` <CGME20220516165427eucas1p1cfd87ca44ec314ea1d2ddc8ece7259f9@eucas1p1.samsung.com>
2022-05-16 16:54     ` [PATCH v4 06/13] btrfs: zoned: Make sb_zone_number function non power of 2 compatible Pankaj Raghav
2022-05-17  6:53       ` Johannes Thumshirn
2022-05-17 11:51         ` David Sterba
     [not found]   ` <CGME20220516165428eucas1p1374b5f9592db3ca6a6551aff975537ce@eucas1p1.samsung.com>
2022-05-16 16:54     ` [PATCH v4 07/13] btrfs: zoned: use generic btrfs zone helpers to support npo2 zoned devices Pankaj Raghav
2022-05-17 12:30       ` David Sterba
2022-05-18  9:40         ` Pankaj Raghav
2022-05-18 11:21           ` David Sterba
2022-05-19  4:13       ` Naohiro Aota
     [not found]   ` <CGME20220516165429eucas1p272c8b4325a488675f08f2d7016aa6230@eucas1p2.samsung.com>
2022-05-16 16:54     ` [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets Pankaj Raghav
2022-05-17  6:50       ` Johannes Thumshirn
2022-05-17  8:00         ` Pankaj Raghav
2022-05-17 12:42       ` David Sterba
2022-05-18  9:15         ` Pankaj Raghav
2022-05-19  7:57           ` Johannes Thumshirn
2022-05-20  9:06             ` Pankaj Raghav
2022-05-20  9:15               ` Johannes Thumshirn
2022-05-19  7:59       ` Naohiro Aota
2022-05-20  9:09         ` Pankaj Raghav
     [not found]   ` <CGME20220516165430eucas1p214cca8eaba1db2c98d947444cad4f18f@eucas1p2.samsung.com>
2022-05-16 16:54     ` [PATCH v4 09/13] btrfs: zoned: relax the alignment constraint for zoned devices Pankaj Raghav
     [not found]   ` <CGME20220516165432eucas1p2e1ea74d44738e44745f49e37b6b9e503@eucas1p2.samsung.com>
2022-05-16 16:54     ` [PATCH v4 10/13] zonefs: allow non power of 2 " Pankaj Raghav
     [not found]   ` <CGME20220516165434eucas1p12b178fb83cc93470933e3d72c40e9004@eucas1p1.samsung.com>
2022-05-16 16:54     ` [PATCH v4 11/13] null_blk: " Pankaj Raghav
2022-05-17  4:12       ` kernel test robot
     [not found]   ` <CGME20220516165435eucas1p1dff8d9d039a76278ef1c09dba4b4e1fe@eucas1p1.samsung.com>
2022-05-16 16:54     ` [PATCH v4 12/13] null_blk: use zone_size_sects_shift for " Pankaj Raghav
     [not found]   ` <CGME20220516165436eucas1p178d079302dae3a9fca696b13b0390deb@eucas1p1.samsung.com>
2022-05-16 16:54     ` [PATCH v4 13/13] dm-zoned: ensure only power of 2 zone sizes are allowed Pankaj Raghav
2022-05-17  8:10   ` [PATCH v4 00/13] support non power of 2 zoned devices Christoph Hellwig
2022-05-17  9:18     ` Javier González
2022-05-18  8:00       ` Christoph Hellwig
2022-05-19 15:25         ` Javier González
2022-05-17 15:34     ` [dm-devel] " Theodore Ts'o
2022-05-18 23:06       ` Luis Chamberlain
2022-05-19  3:08       ` Damien Le Moal
2022-05-19  3:12         ` Luis Chamberlain
2022-05-19  3:19           ` Damien Le Moal
2022-05-19  7:34             ` Johannes Thumshirn
2022-05-20  3:47               ` Damien Le Moal
2022-05-20  6:07                 ` Hannes Reinecke
2022-05-20  6:27                   ` Javier González
2022-05-20  6:41                     ` Damien Le Moal
     [not found]                       ` <CGME20220520065941eucas1p105cf273ede995dc4bf92f3245fad09b1@eucas1p1.samsung.com>
2022-05-20  6:59                         ` Javier González
2022-05-20  9:30                       ` Pankaj Raghav
2022-05-20 17:18                         ` David Sterba
2022-05-23  8:25                           ` Pankaj Raghav
2022-05-20  9:30                     ` Johannes Thumshirn
     [not found]                       ` <CGME20220520101610eucas1p1822ca6014e2a1d55ae74476f83c4de1d@eucas1p1.samsung.com>
2022-05-20 10:16                         ` Javier González
2022-05-16 13:39 Pankaj Raghav
     [not found] ` <CGME20220516133924eucas1p1817f306e3f2442088bf49ab513657cbe@eucas1p1.samsung.com>
2022-05-16 13:39   ` [PATCH v4 01/13] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze Pankaj Raghav
2022-05-16 13:54     ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).