All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
       [not found] <CGME20220308165414eucas1p106df0bd6a901931215cfab81660a4564@eucas1p1.samsung.com>
@ 2022-03-08 16:53 ` Pankaj Raghav
       [not found]   ` <CGME20220308165421eucas1p20575444f59702cd5478cb35fce8b72cd@eucas1p2.samsung.com>
                     ` (6 more replies)
  0 siblings, 7 replies; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-08 16:53 UTC (permalink / raw)
  To: Luis Chamberlain, Adam Manzanares, Javier González,
	kanchan Joshi, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Damien Le Moal, Matias Bjørling, jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme, Pankaj Raghav


#Motivation:
There are currently ZNS drives that are produced and deployed that do
not have power_of_2(PO2) zone size. The NVMe spec for ZNS does not
specify the PO2 requirement but the linux block layer currently checks
for zoned devices to have power_of_2 zone sizes.

As a result there are many applications in the kernel such as F2FS,
BTRFS and other userspace applications that are designed based on the assumption
that zone sizes are PO2.

This patchset aims at supporting non-power_of_2 zoned devices without
affecting the existing applications by adding an emulation layer for
NVMe ZNS devices without regressing the current upstream implementation.

#Implementation:
A new callback is added to the block device operation fops which is
called when a special handling is required by the driver when a
non-power_of_2 zoned device is discovered. This patchset adds support
only to NVMe ZNS and null block driver to measure performance.
The scsi ZAC/ZBC implementation is untouched.

Emulation is enabled by doing a static remapping of the zones only in
the host and whenever a request is sent to the device via the block
layer, a transformation is done to the actual device sector.

#Testing:
There are two things that need to be tested: no regression on the
upstream implementation for PO2 zone sizes and testing the
implementation of the emulation itself.

To do apple-apples comparison, the following device specs were chosen
for testing (both on null_blk and QEMU):
PO2 device:  zone.size=128M zone.cap=96M
NPO2 device: zone.size=96M zone.cap=96M

##Regression:
These tests are done on a **PO2 device**.
PO2 device used:  zone.size=128M zone.cap=96M

###blktests:
Blktests were executed with the following config:

TEST_DEVS=(/dev/nvme0n2)
TIMEOUT=100
RUN_ZONED_TESTS=1

block and zbd tests were performed and no regression were found in the
tests.

###Performance:
Performance tests were performed on a null blk device. The following fio
script was used to measure the performance:

fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k --loops=4

No regressions were found with the patches on a **PO2 device** compared
to the existing upstream implementation.

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Sequential Write:
x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  155     |  604     |   6.00    |  426     |  1663    |   8.77    |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  157     |  613     |   5.92    |  425     |  1741    |   8.79    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  607     |  2370    |   12.06   |  622     |  2431    |   23.61   |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  621     |  2425    |   11.80   |  633     |  2472    |   23.24   |
x-----------------x---------------------------------x---------------------------------x

Sequential read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  165     |  643     |   5.72    |  485     |  1896    |   8.03    |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  167     |  654     |   5.62    |  483     |  1888    |   8.06    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  696     |  2718    |   11.29   |  692     |  2701    |   22.92   |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  696     |  2718    |   11.29   |  730     |  2835    |   21.70   |
x-----------------x---------------------------------x---------------------------------x

Random read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  159     |  623     |   5.86    |  451     |  1760    |   8.58    |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  163     |  635     |   5.75    |  462     |  1806    |   8.36    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  544     |  2124    |   14.44   |  553     |  2162    |   28.64   |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  554     |  2165    |   14.15   |  556     |  2171    |   28.52   |
x-----------------x---------------------------------x---------------------------------x

##Emulated device
NPO2 device: zone.size=96M zone.cap=96M

###blktests:
Blktests were executed with the following config:

TEST_DEVS=(/dev/nvme0n2)
TIMEOUT=100
RUN_ZONED_TESTS=1

block and zbd tests were performed and they are passing.

###Performance:
Performance tests were performed on a null blk device. The following fio
script was used to measure the performance:

fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k --loops=4

On an average, the NPO2 devices had a performance degradation of less than 1%
compared to the PO2 devices.

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Write:
x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  155     |  606     |   5.99    |  424     |  1655    |   8.83    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  609     |  2378    |   12.04   |  620     |  2421    |   23.75   |
x-----------------x---------------------------------x---------------------------------x

SEQREAD:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  160     |  623     |   5.91    |  481     |  1878    |   8.11    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  696     |  2720    |   11.28   |  722     |  2819    |   21.96   |
x-----------------x---------------------------------x---------------------------------x

RANDREAD:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  155     |  607     |   6.03    |  465     |  1817    |   8.31    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  552     |  2158    |   14.21   |  561     |  2190    |   28.27   |
x-----------------x---------------------------------x---------------------------------x

#TODO:
- The current implementation only works for the NVMe pci transport to
  limit the scope and impact.
  Support for NVMe target will follow soon.

Pankaj Raghav (6):
  nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  block: Add npo2_zone_setup callback to block device fops
  block: add a bool member to request_queue for power_of_2 emulation
  nvme: zns: Add support for power_of_2 emulation to NVMe ZNS devices
  null_blk: forward the sector value from null_handle_memory_backend
  null_blk: Add support for power_of_2 emulation to the null blk device

 block/blk-zoned.c                 |   3 +
 drivers/block/null_blk/main.c     |  18 +--
 drivers/block/null_blk/null_blk.h |  12 ++
 drivers/block/null_blk/zoned.c    | 203 ++++++++++++++++++++++++++----
 drivers/nvme/host/core.c          |  28 +++--
 drivers/nvme/host/nvme.h          | 100 ++++++++++++++-
 drivers/nvme/host/pci.c           |   4 +
 drivers/nvme/host/zns.c           |  86 +++++++++++--
 include/linux/blk-mq.h            |   2 +
 include/linux/blkdev.h            |  25 ++++
 10 files changed, 428 insertions(+), 53 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 1/6] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
       [not found]   ` <CGME20220308165421eucas1p20575444f59702cd5478cb35fce8b72cd@eucas1p2.samsung.com>
@ 2022-03-08 16:53     ` Pankaj Raghav
  2022-03-08 17:14       ` Keith Busch
                         ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-08 16:53 UTC (permalink / raw)
  To: Luis Chamberlain, Adam Manzanares, Javier González,
	kanchan Joshi, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Damien Le Moal, Matias Bjørling, jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme, Pankaj Raghav

Remove the condition which disallows non-power_of_2 zone size ZNS drive
to be updated and use generic method to calculate number of zones
instead of relying on log and shift based calculation on zone size.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/host/zns.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 9f81beb4df4e..ad02c61c0b52 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
 	}
 
 	ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
-	if (!is_power_of_2(ns->zsze)) {
-		dev_warn(ns->ctrl->device,
-			"invalid zone size:%llu for namespace:%u\n",
-			ns->zsze, ns->head->ns_id);
-		status = -ENODEV;
-		goto free_data;
-	}
 
 	blk_queue_set_zoned(ns->disk, BLK_ZONED_HM);
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
@@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
 				   sizeof(struct nvme_zone_descriptor);
 
 	nr_zones = min_t(unsigned int, nr_zones,
-			 get_capacity(ns->disk) >> ilog2(ns->zsze));
+			 get_capacity(ns->disk) / ns->zsze);
 
 	bufsize = sizeof(struct nvme_zone_report) +
 		nr_zones * sizeof(struct nvme_zone_descriptor);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 2/6] block: Add npo2_zone_setup callback to block device fops
       [not found]   ` <CGME20220308165428eucas1p14ea0a38eef47055c4fa41d695c5a249d@eucas1p1.samsung.com>
@ 2022-03-08 16:53     ` Pankaj Raghav
  2022-03-09  3:46       ` Damien Le Moal
  0 siblings, 1 reply; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-08 16:53 UTC (permalink / raw)
  To: Luis Chamberlain, Adam Manzanares, Javier González,
	kanchan Joshi, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Damien Le Moal, Matias Bjørling, jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme, Pankaj Raghav

A new fops is added to block device which will be used to setup the
necessary hooks when a non-power_of_2 zone size is detected in a zoned
device.

This fops will be called as a part of blk_revalidate_disk_zones.

The primary use case for this callback is to deal with zoned devices
that does not have a power_of_2 zone size such as ZNS drives.For e.g,
the current NVMe ZNS specification does not require zone size to be
power_of_2 but the linux block layer still expects a all zoned device to
have a power_of_2 zone size.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 block/blk-zoned.c      | 3 +++
 include/linux/blkdev.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 602bef54c813..d3d821797559 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -575,6 +575,9 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 	if (!get_capacity(disk))
 		return -EIO;
 
+	if (disk->fops->npo2_zone_setup)
+		disk->fops->npo2_zone_setup(disk);
+
 	/*
 	 * Ensure that all memory allocations in this context are done as if
 	 * GFP_NOIO was specified.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a12c031af887..08cf039c1622 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1472,6 +1472,7 @@ struct block_device_operations {
 	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
 	int (*report_zones)(struct gendisk *, sector_t sector,
 			unsigned int nr_zones, report_zones_cb cb, void *data);
+	void (*npo2_zone_setup)(struct gendisk *disk);
 	char *(*devnode)(struct gendisk *disk, umode_t *mode);
 	/* returns the length of the identifier or a negative errno: */
 	int (*get_unique_id)(struct gendisk *disk, u8 id[16],
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 3/6] block: add a bool member to request_queue for power_of_2 emulation
       [not found]   ` <CGME20220308165432eucas1p18b36a238ef3f5a812ee7f9b0e52599a5@eucas1p1.samsung.com>
@ 2022-03-08 16:53     ` Pankaj Raghav
  0 siblings, 0 replies; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-08 16:53 UTC (permalink / raw)
  To: Luis Chamberlain, Adam Manzanares, Javier González,
	kanchan Joshi, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Damien Le Moal, Matias Bjørling, jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme, Pankaj Raghav

A new member is added to request_queue struct to indicate if power_of_2
emulation is enabled. Helpers are also added to get and set that member.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 include/linux/blkdev.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 08cf039c1622..3a5d5ddc779c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -463,6 +463,7 @@ struct request_queue {
 	unsigned long		*seq_zones_wlock;
 	unsigned int		max_open_zones;
 	unsigned int		max_active_zones;
+	bool			po2_zone_emu;
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 	int			node;
@@ -705,6 +706,18 @@ static inline unsigned int queue_max_active_zones(const struct request_queue *q)
 {
 	return q->max_active_zones;
 }
+
+static inline void blk_queue_po2_zone_emu(struct request_queue *q,
+					  bool po2_zone_emu)
+{
+	q->po2_zone_emu = po2_zone_emu;
+}
+
+static inline bool blk_queue_is_po2_zone_emu(struct request_queue *q)
+{
+	return q->po2_zone_emu;
+}
+
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
 {
@@ -728,6 +741,17 @@ static inline unsigned int queue_max_active_zones(const struct request_queue *q)
 {
 	return 0;
 }
+
+static inline bool blk_queue_is_po2_zone_emu(struct request_queue *q)
+{
+	return false;
+}
+
+static inline void blk_queue_po2_zone_emu(struct request_queue *q,
+					  unsigned int po2_zone_emu)
+{
+}
+
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 static inline unsigned int blk_queue_depth(struct request_queue *q)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 4/6] nvme: zns: Add support for power_of_2 emulation to NVMe ZNS devices
       [not found]   ` <CGME20220308165436eucas1p1b76f3cb5b4fa1f7d78b51a3b1b44d160@eucas1p1.samsung.com>
@ 2022-03-08 16:53     ` Pankaj Raghav
  2022-03-09  4:04       ` Damien Le Moal
  0 siblings, 1 reply; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-08 16:53 UTC (permalink / raw)
  To: Luis Chamberlain, Adam Manzanares, Javier González,
	kanchan Joshi, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Damien Le Moal, Matias Bjørling, jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme, Pankaj Raghav

power_of_2(PO2) is not a requirement as per the NVMe ZNSspecification,
however we still have in place a lot of assumptions
for PO2 in the block layer, in FS such as F2FS, btrfs and userspace
applications.

So in keeping with these requirements, provide an emulation layer to
non-power_of_2 zone devices and which does not create a performance
regression to existing zone storage devices which have PO2 zone sizes.
Callbacks are provided where needed in the hot paths to reduce the
impact of performance regression.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/host/core.c |  28 +++++++----
 drivers/nvme/host/nvme.h | 100 ++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/pci.c  |   4 ++
 drivers/nvme/host/zns.c  |  79 +++++++++++++++++++++++++++++--
 include/linux/blk-mq.h   |   2 +
 5 files changed, 200 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fd4720d37cc0..c7180d512b08 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -327,14 +327,6 @@ static inline enum nvme_disposition nvme_decide_disposition(struct request *req)
 	return RETRY;
 }
 
-static inline void nvme_end_req_zoned(struct request *req)
-{
-	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
-	    req_op(req) == REQ_OP_ZONE_APPEND)
-		req->__sector = nvme_lba_to_sect(req->q->queuedata,
-			le64_to_cpu(nvme_req(req)->result.u64));
-}
-
 static inline void nvme_end_req(struct request *req)
 {
 	blk_status_t status = nvme_error_status(nvme_req(req)->status);
@@ -676,6 +668,12 @@ blk_status_t nvme_fail_nonready_command(struct nvme_ctrl *ctrl,
 }
 EXPORT_SYMBOL_GPL(nvme_fail_nonready_command);
 
+blk_status_t nvme_fail_po2_zone_emu_violation(struct request *req)
+{
+	return nvme_zone_handle_po2_emu_violation(req);
+}
+EXPORT_SYMBOL_GPL(nvme_fail_po2_zone_emu_violation);
+
 bool __nvme_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
 		bool queue_live)
 {
@@ -879,6 +877,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 	req->special_vec.bv_offset = offset_in_page(range);
 	req->special_vec.bv_len = alloc_size;
 	req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+	nvme_verify_sector_value(ns, req);
 
 	return BLK_STS_OK;
 }
@@ -909,6 +908,7 @@ static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns,
 			break;
 		}
 	}
+	nvme_verify_sector_value(ns, req);
 
 	return BLK_STS_OK;
 }
@@ -973,6 +973,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 
 	cmnd->rw.control = cpu_to_le16(control);
 	cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
+	nvme_verify_sector_value(ns, req);
 	return 0;
 }
 
@@ -2105,8 +2106,14 @@ static int nvme_report_zones(struct gendisk *disk, sector_t sector,
 	return nvme_ns_report_zones(disk->private_data, sector, nr_zones, cb,
 			data);
 }
+
+static void nvme_npo2_zone_setup(struct gendisk *disk)
+{
+	nvme_ns_po2_zone_emu_setup(disk->private_data);
+}
 #else
 #define nvme_report_zones	NULL
+#define nvme_npo2_zone_setup	NULL
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 static const struct block_device_operations nvme_bdev_ops = {
@@ -2116,6 +2123,7 @@ static const struct block_device_operations nvme_bdev_ops = {
 	.release	= nvme_release,
 	.getgeo		= nvme_getgeo,
 	.report_zones	= nvme_report_zones,
+	.npo2_zone_setup = nvme_npo2_zone_setup,
 	.pr_ops		= &nvme_pr_ops,
 };
 
@@ -3844,6 +3852,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid,
 	ns->disk = disk;
 	ns->queue = disk->queue;
 
+#ifdef CONFIG_BLK_DEV_ZONED
+	ns->sect_to_lba = __nvme_sect_to_lba;
+	ns->update_sector_append = nvme_update_sector_append_noop;
+#endif
 	if (ctrl->opts && ctrl->opts->data_digest)
 		blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a162f6c6da6e..f584f760e8cc 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -457,6 +457,10 @@ struct nvme_ns {
 	u8 pi_type;
 #ifdef CONFIG_BLK_DEV_ZONED
 	u64 zsze;
+	u64 zsze_po2;
+	u32 zsze_diff;
+	u64 (*sect_to_lba)(struct nvme_ns *ns, sector_t sector);
+	sector_t (*update_sector_append)(struct nvme_ns *ns, sector_t sector);
 #endif
 	unsigned long features;
 	unsigned long flags;
@@ -562,12 +566,21 @@ static inline int nvme_reset_subsystem(struct nvme_ctrl *ctrl)
 	return ctrl->ops->reg_write32(ctrl, NVME_REG_NSSR, 0x4E564D65);
 }
 
+static inline u64 __nvme_sect_to_lba(struct nvme_ns *ns, sector_t sector)
+{
+	return sector >> (ns->lba_shift - SECTOR_SHIFT);
+}
+
 /*
  * Convert a 512B sector number to a device logical block number.
  */
 static inline u64 nvme_sect_to_lba(struct nvme_ns *ns, sector_t sector)
 {
-	return sector >> (ns->lba_shift - SECTOR_SHIFT);
+#ifdef CONFIG_BLK_DEV_ZONED
+	return ns->sect_to_lba(ns, sector);
+#else
+	return __nvme_sect_to_lba(ns, sector);
+#endif
 }
 
 /*
@@ -578,6 +591,83 @@ static inline sector_t nvme_lba_to_sect(struct nvme_ns *ns, u64 lba)
 	return lba << (ns->lba_shift - SECTOR_SHIFT);
 }
 
+#ifdef CONFIG_BLK_DEV_ZONED
+static inline u64 __nvme_sect_to_lba_po2(struct nvme_ns *ns, sector_t sector)
+{
+	sector_t zone_idx = sector >> ilog2(ns->zsze_po2);
+
+	sector = sector - (zone_idx * ns->zsze_diff);
+
+	return sector >> (ns->lba_shift - SECTOR_SHIFT);
+}
+
+static inline sector_t
+nvme_update_sector_append_po2_zone_emu(struct nvme_ns *ns, sector_t sector)
+{
+	/* The sector value passed by the drive after a append operation is the
+	 * based on the actual zone layout in the device, hence, use the actual
+	 * zone_size to calculate the zone number from the sector.
+	 */
+	u32 zone_no = sector / ns->zsze;
+
+	sector += ns->zsze_diff * zone_no;
+	return sector;
+}
+
+static inline sector_t nvme_update_sector_append_noop(struct nvme_ns *ns,
+						      sector_t sector)
+{
+	return sector;
+}
+
+static inline void nvme_end_req_zoned(struct request *req)
+{
+	if (req_op(req) == REQ_OP_ZONE_APPEND) {
+		struct nvme_ns *ns = req->q->queuedata;
+		sector_t sector;
+
+		sector = nvme_lba_to_sect(ns,
+					  le64_to_cpu(nvme_req(req)->result.u64));
+
+		sector = ns->update_sector_append(ns, sector);
+
+		req->__sector = sector;
+	}
+}
+
+static inline void nvme_verify_sector_value(struct nvme_ns *ns, struct request *req)
+{
+	if (unlikely(blk_queue_is_po2_zone_emu(ns->queue))) {
+		sector_t sector = blk_rq_pos(req);
+		sector_t zone_idx = sector >> ilog2(ns->zsze_po2);
+		sector_t device_sector = sector - (zone_idx * ns->zsze_diff);
+
+		/* Check if the IO is in the emulated area */
+		if (device_sector - (zone_idx * ns->zsze) > ns->zsze)
+			req->rq_flags |= RQF_ZONE_EMU_VIOLATION;
+	}
+}
+
+static inline bool nvme_po2_zone_emu_violation(struct request *req)
+{
+	return req->rq_flags & RQF_ZONE_EMU_VIOLATION;
+}
+#else
+static inline void nvme_end_req_zoned(struct request *req)
+{
+}
+
+static inline void nvme_verify_sector_value(struct nvme_ns *ns, struct request *req)
+{
+}
+
+static inline bool nvme_po2_zone_emu_violation(struct request *req)
+{
+	return false;
+}
+
+#endif
+
 /*
  * Convert byte length to nvme's 0-based num dwords
  */
@@ -752,6 +842,7 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
 long nvme_dev_ioctl(struct file *file, unsigned int cmd,
 		unsigned long arg);
 int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
+blk_status_t nvme_fail_po2_zone_emu_violation(struct request *req);
 
 extern const struct attribute_group *nvme_ns_id_attr_groups[];
 extern const struct pr_ops nvme_pr_ops;
@@ -873,11 +964,13 @@ static inline void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
 int nvme_revalidate_zones(struct nvme_ns *ns);
 int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 		unsigned int nr_zones, report_zones_cb cb, void *data);
+void nvme_ns_po2_zone_emu_setup(struct nvme_ns *ns);
 #ifdef CONFIG_BLK_DEV_ZONED
 int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf);
 blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
 				       struct nvme_command *cmnd,
 				       enum nvme_zone_mgmt_action action);
+blk_status_t nvme_zone_handle_po2_emu_violation(struct request *req);
 #else
 static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
 		struct request *req, struct nvme_command *cmnd,
@@ -892,6 +985,11 @@ static inline int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
 		 "Please enable CONFIG_BLK_DEV_ZONED to support ZNS devices\n");
 	return -EPROTONOSUPPORT;
 }
+
+static inline blk_status_t nvme_zone_handle_po2_emu_violation(struct request *req)
+{
+	return BLK_STS_OK;
+}
 #endif
 
 static inline struct nvme_ns *nvme_get_ns_from_dev(struct device *dev)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6a99ed680915..fc022df3f98e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -960,6 +960,10 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 		return nvme_fail_nonready_command(&dev->ctrl, req);
 
 	ret = nvme_prep_rq(dev, req);
+
+	if (unlikely(nvme_po2_zone_emu_violation(req)))
+		return nvme_fail_po2_zone_emu_violation(req);
+
 	if (unlikely(ret))
 		return ret;
 	spin_lock(&nvmeq->sq_lock);
diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index ad02c61c0b52..25516a5ae7e2 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -3,7 +3,9 @@
  * Copyright (C) 2020 Western Digital Corporation or its affiliates.
  */
 
+#include <linux/log2.h>
 #include <linux/blkdev.h>
+#include <linux/math.h>
 #include <linux/vmalloc.h>
 #include "nvme.h"
 
@@ -46,6 +48,18 @@ static int nvme_set_max_append(struct nvme_ctrl *ctrl)
 	return 0;
 }
 
+static sector_t nvme_zone_size(struct nvme_ns *ns)
+{
+	sector_t zone_size;
+
+	if (blk_queue_is_po2_zone_emu(ns->queue))
+		zone_size = ns->zsze_po2;
+	else
+		zone_size = ns->zsze;
+
+	return zone_size;
+}
+
 int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
 {
 	struct nvme_effects_log *log = ns->head->effects;
@@ -122,7 +136,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
 				   sizeof(struct nvme_zone_descriptor);
 
 	nr_zones = min_t(unsigned int, nr_zones,
-			 get_capacity(ns->disk) / ns->zsze);
+			 get_capacity(ns->disk) / nvme_zone_size(ns));
 
 	bufsize = sizeof(struct nvme_zone_report) +
 		nr_zones * sizeof(struct nvme_zone_descriptor);
@@ -147,6 +161,8 @@ static int nvme_zone_parse_entry(struct nvme_ns *ns,
 				 void *data)
 {
 	struct blk_zone zone = { };
+	u64 zone_gap = 0;
+	u32 zone_idx;
 
 	if ((entry->zt & 0xf) != NVME_ZONE_TYPE_SEQWRITE_REQ) {
 		dev_err(ns->ctrl->device, "invalid zone type %#x\n",
@@ -159,10 +175,19 @@ static int nvme_zone_parse_entry(struct nvme_ns *ns,
 	zone.len = ns->zsze;
 	zone.capacity = nvme_lba_to_sect(ns, le64_to_cpu(entry->zcap));
 	zone.start = nvme_lba_to_sect(ns, le64_to_cpu(entry->zslba));
+
+	if (blk_queue_is_po2_zone_emu(ns->queue)) {
+		zone_idx = zone.start / zone.len;
+		zone_gap = zone_idx * ns->zsze_diff;
+		zone.start += zone_gap;
+		zone.len = ns->zsze_po2;
+	}
+
 	if (zone.cond == BLK_ZONE_COND_FULL)
 		zone.wp = zone.start + zone.len;
 	else
-		zone.wp = nvme_lba_to_sect(ns, le64_to_cpu(entry->wp));
+		zone.wp =
+			nvme_lba_to_sect(ns, le64_to_cpu(entry->wp)) + zone_gap;
 
 	return cb(&zone, idx, data);
 }
@@ -173,6 +198,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 	struct nvme_zone_report *report;
 	struct nvme_command c = { };
 	int ret, zone_idx = 0;
+	u64 zone_size = nvme_zone_size(ns);
 	unsigned int nz, i;
 	size_t buflen;
 
@@ -190,7 +216,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
 	c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
 
-	sector &= ~(ns->zsze - 1);
+	sector = rounddown(sector, zone_size);
 	while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
 		memset(report, 0, buflen);
 
@@ -214,7 +240,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 			zone_idx++;
 		}
 
-		sector += ns->zsze * nz;
+		sector += zone_size * nz;
 	}
 
 	if (zone_idx > 0)
@@ -226,6 +252,32 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 	return ret;
 }
 
+void nvme_ns_po2_zone_emu_setup(struct nvme_ns *ns)
+{
+	u32 nr_zones;
+	sector_t capacity;
+
+	if (is_power_of_2(ns->zsze))
+		return;
+
+	if (!get_capacity(ns->disk))
+		return;
+
+	blk_mq_freeze_queue(ns->disk->queue);
+
+	blk_queue_po2_zone_emu(ns->queue, 1);
+	ns->zsze_po2 = 1 << get_count_order_long(ns->zsze);
+	ns->zsze_diff = ns->zsze_po2 - ns->zsze;
+
+	nr_zones = get_capacity(ns->disk) / ns->zsze;
+	capacity = nr_zones * ns->zsze_po2;
+	set_capacity_and_notify(ns->disk, capacity);
+	ns->sect_to_lba = __nvme_sect_to_lba_po2;
+	ns->update_sector_append = nvme_update_sector_append_po2_zone_emu;
+
+	blk_mq_unfreeze_queue(ns->disk->queue);
+}
+
 blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
 		struct nvme_command *c, enum nvme_zone_mgmt_action action)
 {
@@ -239,5 +291,24 @@ blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
 	if (req_op(req) == REQ_OP_ZONE_RESET_ALL)
 		c->zms.select_all = 1;
 
+	nvme_verify_sector_value(ns, req);
+	return BLK_STS_OK;
+}
+
+blk_status_t nvme_zone_handle_po2_emu_violation(struct request *req)
+{
+	/*  The spec mentions that read from ZCAP until ZSZE shall behave
+	 *  like a deallocated block. Deallocated block reads are
+	 *  deterministic, hence we fill zero.
+	 * The spec does not clearly define the result for other opreation.
+	 */
+	if (req_op(req) == REQ_OP_READ) {
+		zero_fill_bio(req->bio);
+		nvme_req(req)->status = NVME_SC_SUCCESS;
+	} else {
+		nvme_req(req)->status = NVME_SC_WRITE_FAULT;
+	}
+	blk_mq_set_request_complete(req);
+	nvme_complete_rq(req);
 	return BLK_STS_OK;
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 3a41d50b85d3..9ec59183efcd 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -57,6 +57,8 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_TIMED_OUT		((__force req_flags_t)(1 << 21))
 /* queue has elevator attached */
 #define RQF_ELV			((__force req_flags_t)(1 << 22))
+/* request to do IO in the emulated area with po2 zone emulation */
+#define RQF_ZONE_EMU_VIOLATION	((__force req_flags_t)(1 << 23))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 5/6] null_blk: forward the sector value from null_handle_memory_backend
       [not found]   ` <CGME20220308165443eucas1p17e61670a5057f21a6c073711b284bfeb@eucas1p1.samsung.com>
@ 2022-03-08 16:53     ` Pankaj Raghav
  0 siblings, 0 replies; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-08 16:53 UTC (permalink / raw)
  To: Luis Chamberlain, Adam Manzanares, Javier González,
	kanchan Joshi, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Damien Le Moal, Matias Bjørling, jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme, Pankaj Raghav

This is a preparation patch to add support for power_of_2 emulation in
the null_blk driver.

Currently, the sector value from null_handle_memory_backend is not
forwarded to the lower layer functions such as null_handle_rq and
null_handle_bio but instead they are fetched again from the request or
the bio respectively. This behaviour will not work when zone size
emulation is enabled.

Instead of fetching the sector value again from the request or bio, pass
down the sector value from null_handle_memory_backend to
null_handle_rq/bio.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/block/null_blk/main.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 05b1120e6623..625a06bfa5ad 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -1204,13 +1204,12 @@ static int null_transfer(struct nullb *nullb, struct page *page,
 	return err;
 }
 
-static int null_handle_rq(struct nullb_cmd *cmd)
+static int null_handle_rq(struct nullb_cmd *cmd, sector_t sector)
 {
 	struct request *rq = cmd->rq;
 	struct nullb *nullb = cmd->nq->dev->nullb;
 	int err;
 	unsigned int len;
-	sector_t sector = blk_rq_pos(rq);
 	struct req_iterator iter;
 	struct bio_vec bvec;
 
@@ -1231,13 +1230,12 @@ static int null_handle_rq(struct nullb_cmd *cmd)
 	return 0;
 }
 
-static int null_handle_bio(struct nullb_cmd *cmd)
+static int null_handle_bio(struct nullb_cmd *cmd, sector_t sector)
 {
 	struct bio *bio = cmd->bio;
 	struct nullb *nullb = cmd->nq->dev->nullb;
 	int err;
 	unsigned int len;
-	sector_t sector = bio->bi_iter.bi_sector;
 	struct bio_vec bvec;
 	struct bvec_iter iter;
 
@@ -1320,9 +1318,9 @@ static inline blk_status_t null_handle_memory_backed(struct nullb_cmd *cmd,
 		return null_handle_discard(dev, sector, nr_sectors);
 
 	if (dev->queue_mode == NULL_Q_BIO)
-		err = null_handle_bio(cmd);
+		err = null_handle_bio(cmd, sector);
 	else
-		err = null_handle_rq(cmd);
+		err = null_handle_rq(cmd, sector);
 
 	return errno_to_blk_status(err);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 6/6] null_blk: Add support for power_of_2 emulation to the null blk device
       [not found]   ` <CGME20220308165448eucas1p12c7c302a4b239db64b49d54cc3c1f0ac@eucas1p1.samsung.com>
@ 2022-03-08 16:53     ` Pankaj Raghav
  2022-03-09  4:09       ` Damien Le Moal
  0 siblings, 1 reply; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-08 16:53 UTC (permalink / raw)
  To: Luis Chamberlain, Adam Manzanares, Javier González,
	kanchan Joshi, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Damien Le Moal, Matias Bjørling, jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme, Pankaj Raghav

power_of_2(PO2) emulation support is added to the null blk device to
measure performance.

Callbacks are added in the hotpaths that require PO2 emulation specific
behaviour to reduce the impact on exisiting path.

The power_of_2 emulation support is wired up for both the request and
the bio queue mode and it is automatically enabled when the given zone
size is non-power_of_2.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/block/null_blk/main.c     |   8 +-
 drivers/block/null_blk/null_blk.h |  12 ++
 drivers/block/null_blk/zoned.c    | 203 ++++++++++++++++++++++++++----
 3 files changed, 196 insertions(+), 27 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 625a06bfa5ad..c926b59f2b17 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -210,7 +210,7 @@ MODULE_PARM_DESC(zoned, "Make device as a host-managed zoned block device. Defau
 
 static unsigned long g_zone_size = 256;
 module_param_named(zone_size, g_zone_size, ulong, S_IRUGO);
-MODULE_PARM_DESC(zone_size, "Zone size in MB when block device is zoned. Must be power-of-two: Default: 256");
+MODULE_PARM_DESC(zone_size, "Zone size in MB when block device is zoned. Default: 256");
 
 static unsigned long g_zone_capacity;
 module_param_named(zone_capacity, g_zone_capacity, ulong, 0444);
@@ -1772,11 +1772,13 @@ static const struct block_device_operations null_bio_ops = {
 	.owner		= THIS_MODULE,
 	.submit_bio	= null_submit_bio,
 	.report_zones	= null_report_zones,
+	.npo2_zone_setup = null_po2_zone_emu_setup,
 };
 
 static const struct block_device_operations null_rq_ops = {
 	.owner		= THIS_MODULE,
 	.report_zones	= null_report_zones,
+	.npo2_zone_setup = null_po2_zone_emu_setup,
 };
 
 static int setup_commands(struct nullb_queue *nq)
@@ -1929,8 +1931,8 @@ static int null_validate_conf(struct nullb_device *dev)
 		dev->mbps = 0;
 
 	if (dev->zoned &&
-	    (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
-		pr_err("zone_size must be power-of-two\n");
+	    (!dev->zone_size)) {
+		pr_err("zone_size must be zero\n");
 		return -EINVAL;
 	}
 
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 78eb56b0ca55..34c1b7b2546b 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -74,6 +74,16 @@ struct nullb_device {
 	unsigned int imp_close_zone_no;
 	struct nullb_zone *zones;
 	sector_t zone_size_sects;
+	sector_t zone_size_po2_sects;
+	sector_t zone_size_diff_sects;
+	/* The callbacks below are used as hook to perform po2 emulation for a
+	 * zoned device.
+	 */
+	unsigned int (*zone_no)(struct nullb_device *dev,
+				sector_t sect);
+	sector_t (*zone_update_sector)(struct nullb_device *dev, sector_t sect);
+	sector_t (*zone_update_sector_append)(struct nullb_device *dev,
+					      sector_t sect);
 	bool need_zone_res_mgmt;
 	spinlock_t zone_res_lock;
 
@@ -137,6 +147,7 @@ int null_register_zoned_dev(struct nullb *nullb);
 void null_free_zoned_dev(struct nullb_device *dev);
 int null_report_zones(struct gendisk *disk, sector_t sector,
 		      unsigned int nr_zones, report_zones_cb cb, void *data);
+void null_po2_zone_emu_setup(struct gendisk *disk);
 blk_status_t null_process_zoned_cmd(struct nullb_cmd *cmd,
 				    enum req_opf op, sector_t sector,
 				    sector_t nr_sectors);
@@ -166,5 +177,6 @@ static inline size_t null_zone_valid_read_len(struct nullb *nullb,
 	return len;
 }
 #define null_report_zones	NULL
+#define null_po2_zone_emu_setup	NULL
 #endif /* CONFIG_BLK_DEV_ZONED */
 #endif /* __NULL_BLK_H */
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index dae54dd1aeac..3bb63c170149 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -16,6 +16,44 @@ static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
 	return sect >> ilog2(dev->zone_size_sects);
 }
 
+static inline unsigned int null_npo2_zone_no(struct nullb_device *dev,
+					     sector_t sect)
+{
+	return sect / dev->zone_size_sects;
+}
+
+static inline bool null_is_po2_zone_emu(struct nullb_device *dev)
+{
+	return !!dev->zone_size_diff_sects;
+}
+
+static inline sector_t null_zone_update_sector_noop(struct nullb_device *dev,
+						    sector_t sect)
+{
+	return sect;
+}
+
+static inline sector_t null_zone_update_sector_po2_emu(struct nullb_device *dev,
+						       sector_t sect)
+{
+	sector_t zsze_po2 = dev->zone_size_po2_sects;
+	sector_t zsze_diff = dev->zone_size_diff_sects;
+	u32 zone_idx = sect >> ilog2(zsze_po2);
+
+	sect = sect - (zone_idx * zsze_diff);
+	return sect;
+}
+
+static inline sector_t
+null_zone_update_sector_append_po2_emu(struct nullb_device *dev, sector_t sect)
+{
+	/* Need to readjust the sector if po2 emulation is used. */
+	u32 zone_no = dev->zone_no(dev, sect);
+
+	sect += dev->zone_size_diff_sects * zone_no;
+	return sect;
+}
+
 static inline void null_lock_zone_res(struct nullb_device *dev)
 {
 	if (dev->need_zone_res_mgmt)
@@ -62,15 +100,14 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	sector_t sector = 0;
 	unsigned int i;
 
-	if (!is_power_of_2(dev->zone_size)) {
-		pr_err("zone_size must be power-of-two\n");
-		return -EINVAL;
-	}
 	if (dev->zone_size > dev->size) {
 		pr_err("Zone size larger than device capacity\n");
 		return -EINVAL;
 	}
 
+	if (!is_power_of_2(dev->zone_size))
+		pr_info("zone_size is not power-of-two. power-of-two emulation is enabled");
+
 	if (!dev->zone_capacity)
 		dev->zone_capacity = dev->zone_size;
 
@@ -83,8 +120,14 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 	zone_capacity_sects = mb_to_sects(dev->zone_capacity);
 	dev_capacity_sects = mb_to_sects(dev->size);
 	dev->zone_size_sects = mb_to_sects(dev->zone_size);
-	dev->nr_zones = round_up(dev_capacity_sects, dev->zone_size_sects)
-		>> ilog2(dev->zone_size_sects);
+
+	dev->nr_zones = roundup(dev_capacity_sects, dev->zone_size_sects) /
+			dev->zone_size_sects;
+
+	dev->zone_no = null_zone_no;
+	dev->zone_update_sector = null_zone_update_sector_noop;
+	dev->zone_update_sector_append = null_zone_update_sector_noop;
+
 
 	dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct nullb_zone),
 				    GFP_KERNEL | __GFP_ZERO);
@@ -166,7 +209,13 @@ int null_register_zoned_dev(struct nullb *nullb)
 		if (ret)
 			return ret;
 	} else {
-		blk_queue_chunk_sectors(q, dev->zone_size_sects);
+		nullb->disk->fops->npo2_zone_setup(nullb->disk);
+
+		if (null_is_po2_zone_emu(dev))
+			blk_queue_chunk_sectors(q, dev->zone_size_po2_sects);
+		else
+			blk_queue_chunk_sectors(q, dev->zone_size_sects);
+
 		q->nr_zones = blkdev_nr_zones(nullb->disk);
 	}
 
@@ -183,17 +232,49 @@ void null_free_zoned_dev(struct nullb_device *dev)
 	dev->zones = NULL;
 }
 
+static void null_update_zone_info(struct nullb *nullb, struct blk_zone *blkz,
+				  struct nullb_zone *zone)
+{
+	unsigned int zone_idx;
+	u64 zone_gap;
+	struct nullb_device *dev = nullb->dev;
+
+	if (null_is_po2_zone_emu(dev)) {
+		zone_idx = zone->start / zone->len;
+		zone_gap = zone_idx * dev->zone_size_diff_sects;
+
+		blkz->start = zone->start + zone_gap;
+		blkz->len = dev->zone_size_po2_sects;
+		blkz->wp = zone->wp + zone_gap;
+	} else {
+		blkz->start = zone->start;
+		blkz->len = zone->len;
+		blkz->wp = zone->wp;
+	}
+
+	blkz->type = zone->type;
+	blkz->cond = zone->cond;
+	blkz->capacity = zone->capacity;
+}
+
 int null_report_zones(struct gendisk *disk, sector_t sector,
 		unsigned int nr_zones, report_zones_cb cb, void *data)
 {
 	struct nullb *nullb = disk->private_data;
 	struct nullb_device *dev = nullb->dev;
 	unsigned int first_zone, i;
+	sector_t zone_size;
 	struct nullb_zone *zone;
 	struct blk_zone blkz;
 	int error;
 
-	first_zone = null_zone_no(dev, sector);
+	if (null_is_po2_zone_emu(dev))
+		zone_size = dev->zone_size_po2_sects;
+	else
+		zone_size = dev->zone_size_sects;
+
+	first_zone = sector / zone_size;
+
 	if (first_zone >= dev->nr_zones)
 		return 0;
 
@@ -210,12 +291,7 @@ int null_report_zones(struct gendisk *disk, sector_t sector,
 		 * array.
 		 */
 		null_lock_zone(dev, zone);
-		blkz.start = zone->start;
-		blkz.len = zone->len;
-		blkz.wp = zone->wp;
-		blkz.type = zone->type;
-		blkz.cond = zone->cond;
-		blkz.capacity = zone->capacity;
+		null_update_zone_info(nullb, &blkz, zone);
 		null_unlock_zone(dev, zone);
 
 		error = cb(&blkz, i, data);
@@ -226,6 +302,35 @@ int null_report_zones(struct gendisk *disk, sector_t sector,
 	return nr_zones;
 }
 
+void null_po2_zone_emu_setup(struct gendisk *disk)
+{
+	struct nullb *nullb = disk->private_data;
+	struct nullb_device *dev = nullb->dev;
+	sector_t capacity;
+
+	if (is_power_of_2(dev->zone_size_sects))
+		return;
+
+	if (!get_capacity(disk))
+		return;
+
+	blk_mq_freeze_queue(disk->queue);
+
+	blk_queue_po2_zone_emu(disk->queue, 1);
+	dev->zone_size_po2_sects =
+		1 << get_count_order_long(dev->zone_size_sects);
+	dev->zone_size_diff_sects =
+		dev->zone_size_po2_sects - dev->zone_size_sects;
+	dev->zone_no = null_npo2_zone_no;
+	dev->zone_update_sector = null_zone_update_sector_po2_emu;
+	dev->zone_update_sector_append = null_zone_update_sector_append_po2_emu;
+
+	capacity = dev->nr_zones * dev->zone_size_po2_sects;
+	set_capacity(disk, capacity);
+
+	blk_mq_unfreeze_queue(disk->queue);
+}
+
 /*
  * This is called in the case of memory backing from null_process_cmd()
  * with the target zone already locked.
@@ -234,7 +339,7 @@ size_t null_zone_valid_read_len(struct nullb *nullb,
 				sector_t sector, unsigned int len)
 {
 	struct nullb_device *dev = nullb->dev;
-	struct nullb_zone *zone = &dev->zones[null_zone_no(dev, sector)];
+	struct nullb_zone *zone = &dev->zones[dev->zone_no(dev, sector)];
 	unsigned int nr_sectors = len >> SECTOR_SHIFT;
 
 	/* Read must be below the write pointer position */
@@ -363,11 +468,24 @@ static blk_status_t null_check_zone_resources(struct nullb_device *dev,
 	}
 }
 
+static void null_update_sector_append(struct nullb_cmd *cmd, sector_t sector)
+{
+	struct nullb_device *dev = cmd->nq->dev;
+
+	sector = dev->zone_update_sector_append(dev, sector);
+
+	if (cmd->bio) {
+		cmd->bio->bi_iter.bi_sector = sector;
+	} else {
+		cmd->rq->__sector = sector;
+	}
+}
+
 static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
 				    unsigned int nr_sectors, bool append)
 {
 	struct nullb_device *dev = cmd->nq->dev;
-	unsigned int zno = null_zone_no(dev, sector);
+	unsigned int zno = dev->zone_no(dev, sector);
 	struct nullb_zone *zone = &dev->zones[zno];
 	blk_status_t ret;
 
@@ -395,10 +513,7 @@ static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
 	 */
 	if (append) {
 		sector = zone->wp;
-		if (cmd->bio)
-			cmd->bio->bi_iter.bi_sector = sector;
-		else
-			cmd->rq->__sector = sector;
+		null_update_sector_append(cmd, sector);
 	} else if (sector != zone->wp) {
 		ret = BLK_STS_IOERR;
 		goto unlock;
@@ -619,7 +734,7 @@ static blk_status_t null_zone_mgmt(struct nullb_cmd *cmd, enum req_opf op,
 		return BLK_STS_OK;
 	}
 
-	zone_no = null_zone_no(dev, sector);
+	zone_no = dev->zone_no(dev, sector);
 	zone = &dev->zones[zone_no];
 
 	null_lock_zone(dev, zone);
@@ -650,13 +765,54 @@ static blk_status_t null_zone_mgmt(struct nullb_cmd *cmd, enum req_opf op,
 	return ret;
 }
 
+static blk_status_t null_handle_po2_zone_emu_violation(struct nullb_cmd *cmd,
+						       enum req_opf op)
+{
+	if (op == REQ_OP_READ) {
+		if (cmd->bio)
+			zero_fill_bio(cmd->bio);
+		else
+			zero_fill_bio(cmd->rq->bio);
+
+		return BLK_STS_OK;
+	} else {
+		return BLK_STS_IOERR;
+	}
+}
+
+static bool null_verify_sector_violation(struct nullb_device *dev,
+					 sector_t sector)
+{
+	if (unlikely(null_is_po2_zone_emu(dev))) {
+		/* The zone idx should be calculated based on the emulated
+		 * layout
+		 */
+		u32 zone_idx = sector >> ilog2(dev->zone_size_po2_sects);
+		sector_t zsze = dev->zone_size_sects;
+		sector_t sect = null_zone_update_sector_po2_emu(dev, sector);
+
+		if (sect - (zone_idx * zsze) > zsze)
+			return true;
+	}
+	return false;
+}
+
 blk_status_t null_process_zoned_cmd(struct nullb_cmd *cmd, enum req_opf op,
 				    sector_t sector, sector_t nr_sectors)
 {
-	struct nullb_device *dev;
+	struct nullb_device *dev = cmd->nq->dev;
 	struct nullb_zone *zone;
 	blk_status_t sts;
 
+	/* Handle when the sector falls in the emulated area */
+	if (unlikely(null_verify_sector_violation(dev, sector)))
+		return null_handle_po2_zone_emu_violation(cmd, op);
+
+	/* The sector value is updated if po2 emulation is enabled, else it
+	 * will have no effect on the value
+	 */
+	sector = dev->zone_update_sector(dev, sector);
+
 	switch (op) {
 	case REQ_OP_WRITE:
 		return null_zone_write(cmd, sector, nr_sectors, false);
@@ -669,8 +825,7 @@ blk_status_t null_process_zoned_cmd(struct nullb_cmd *cmd, enum req_opf op,
 	case REQ_OP_ZONE_FINISH:
 		return null_zone_mgmt(cmd, op, sector);
 	default:
-		dev = cmd->nq->dev;
-		zone = &dev->zones[null_zone_no(dev, sector)];
+		zone = &dev->zones[dev->zone_no(dev, sector)];
 
 		null_lock_zone(dev, zone);
 		sts = null_process_cmd(cmd, op, sector, nr_sectors);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  2022-03-08 16:53     ` [PATCH 1/6] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size Pankaj Raghav
@ 2022-03-08 17:14       ` Keith Busch
  2022-03-08 17:43         ` Pankaj Raghav
  2022-03-09  3:40       ` Damien Le Moal
  2022-03-09  3:44       ` Damien Le Moal
  2 siblings, 1 reply; 83+ messages in thread
From: Keith Busch @ 2022-03-08 17:14 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Luis Chamberlain, Adam Manzanares, Javier González,
	kanchan Joshi, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Damien Le Moal, Matias Bjørling, jiangbo.365, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On Tue, Mar 08, 2022 at 05:53:44PM +0100, Pankaj Raghav wrote:
> Remove the condition which disallows non-power_of_2 zone size ZNS drive
> to be updated and use generic method to calculate number of zones
> instead of relying on log and shift based calculation on zone size.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/nvme/host/zns.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 9f81beb4df4e..ad02c61c0b52 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
>  	}
>  
>  	ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
> -	if (!is_power_of_2(ns->zsze)) {
> -		dev_warn(ns->ctrl->device,
> -			"invalid zone size:%llu for namespace:%u\n",
> -			ns->zsze, ns->head->ns_id);
> -		status = -ENODEV;
> -		goto free_data;
> -	}
>  
>  	blk_queue_set_zoned(ns->disk, BLK_ZONED_HM);
>  	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
> @@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
>  				   sizeof(struct nvme_zone_descriptor);
>  
>  	nr_zones = min_t(unsigned int, nr_zones,
> -			 get_capacity(ns->disk) >> ilog2(ns->zsze));
> +			 get_capacity(ns->disk) / ns->zsze);
>  
>  	bufsize = sizeof(struct nvme_zone_report) +
>  		nr_zones * sizeof(struct nvme_zone_descriptor);
> -- 

The zns report zones realigns the starting sector using an expected pow2
value, so I think you need to update that as well with something like
the following:

@@ -197,7 +189,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
 	c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
 
-	sector &= ~(ns->zsze - 1);
+	sector = sector - sector % ns->zsze;
 	while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
 		memset(report, 0, buflen);
 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  2022-03-08 17:14       ` Keith Busch
@ 2022-03-08 17:43         ` Pankaj Raghav
  0 siblings, 0 replies; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-08 17:43 UTC (permalink / raw)
  To: Keith Busch
  Cc: Luis Chamberlain, Adam Manzanares, Javier González,
	kanchan Joshi, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Damien Le Moal, Matias Bjørling, jiangbo.365, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme



On 2022-03-08 18:14, Keith Busch wrote:
> The zns report zones realigns the starting sector using an expected pow2
> value, so I think you need to update that as well with something like
> the following:
> 
> @@ -197,7 +189,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>  	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>  	c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
>  
> -	sector &= ~(ns->zsze - 1);
> +	sector = sector - sector % ns->zsze;
>  	while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
>  		memset(report, 0, buflen);
>  

I actually have these changes in the Patch 4/6:
-	sector &= ~(ns->zsze - 1);
+	sector = rounddown(sector, zone_size);

But you are right, I should move those changes to this patch as this patch
removes the po2 assumptions in NVMe ZNS driver.

I will fix it up in the next revision. Thanks.

-- 
Regards,
Pankaj

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  2022-03-08 16:53     ` [PATCH 1/6] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size Pankaj Raghav
  2022-03-08 17:14       ` Keith Busch
@ 2022-03-09  3:40       ` Damien Le Moal
  2022-03-09 13:19         ` Pankaj Raghav
  2022-03-09  3:44       ` Damien Le Moal
  2 siblings, 1 reply; 83+ messages in thread
From: Damien Le Moal @ 2022-03-09  3:40 UTC (permalink / raw)
  To: Pankaj Raghav, Luis Chamberlain, Adam Manzanares,
	Javier González, kanchan Joshi, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Matias Bjørling,
	jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme

On 3/9/22 01:53, Pankaj Raghav wrote:
> Remove the condition which disallows non-power_of_2 zone size ZNS drive
> to be updated and use generic method to calculate number of zones
> instead of relying on log and shift based calculation on zone size.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/nvme/host/zns.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 9f81beb4df4e..ad02c61c0b52 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
>  	}
>  
>  	ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
> -	if (!is_power_of_2(ns->zsze)) {
> -		dev_warn(ns->ctrl->device,
> -			"invalid zone size:%llu for namespace:%u\n",
> -			ns->zsze, ns->head->ns_id);
> -		status = -ENODEV;
> -		goto free_data;
> -	}
>  
>  	blk_queue_set_zoned(ns->disk, BLK_ZONED_HM);
>  	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
> @@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
>  				   sizeof(struct nvme_zone_descriptor);
>  
>  	nr_zones = min_t(unsigned int, nr_zones,
> -			 get_capacity(ns->disk) >> ilog2(ns->zsze));
> +			 get_capacity(ns->disk) / ns->zsze);

This will not compile on 32-bits arch. This needs to use div64_u64().

>  
>  	bufsize = sizeof(struct nvme_zone_report) +
>  		nr_zones * sizeof(struct nvme_zone_descriptor);


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  2022-03-08 16:53     ` [PATCH 1/6] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size Pankaj Raghav
  2022-03-08 17:14       ` Keith Busch
  2022-03-09  3:40       ` Damien Le Moal
@ 2022-03-09  3:44       ` Damien Le Moal
  2022-03-09 13:35         ` Pankaj Raghav
  2 siblings, 1 reply; 83+ messages in thread
From: Damien Le Moal @ 2022-03-09  3:44 UTC (permalink / raw)
  To: Pankaj Raghav, Luis Chamberlain, Adam Manzanares,
	Javier González, kanchan Joshi, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Matias Bjørling,
	jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme

On 3/9/22 01:53, Pankaj Raghav wrote:
> Remove the condition which disallows non-power_of_2 zone size ZNS drive
> to be updated and use generic method to calculate number of zones
> instead of relying on log and shift based calculation on zone size.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/nvme/host/zns.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 9f81beb4df4e..ad02c61c0b52 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
>  	}
>  
>  	ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
> -	if (!is_power_of_2(ns->zsze)) {
> -		dev_warn(ns->ctrl->device,
> -			"invalid zone size:%llu for namespace:%u\n",
> -			ns->zsze, ns->head->ns_id);
> -		status = -ENODEV;
> -		goto free_data;
> -	}

Doing this will allow a non power of 2 zone sized device to be seen by
the block layer. This will break functions such as blkdev_nr_zones() but
this patch is not changing this functions, and other using bit shift
calculations.

>  
>  	blk_queue_set_zoned(ns->disk, BLK_ZONED_HM);
>  	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
> @@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
>  				   sizeof(struct nvme_zone_descriptor);
>  
>  	nr_zones = min_t(unsigned int, nr_zones,
> -			 get_capacity(ns->disk) >> ilog2(ns->zsze));
> +			 get_capacity(ns->disk) / ns->zsze);
>  
>  	bufsize = sizeof(struct nvme_zone_report) +
>  		nr_zones * sizeof(struct nvme_zone_descriptor);


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/6] block: Add npo2_zone_setup callback to block device fops
  2022-03-08 16:53     ` [PATCH 2/6] block: Add npo2_zone_setup callback to block device fops Pankaj Raghav
@ 2022-03-09  3:46       ` Damien Le Moal
  2022-03-09 14:02         ` Pankaj Raghav
  0 siblings, 1 reply; 83+ messages in thread
From: Damien Le Moal @ 2022-03-09  3:46 UTC (permalink / raw)
  To: Pankaj Raghav, Luis Chamberlain, Adam Manzanares,
	Javier González, kanchan Joshi, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Matias Bjørling,
	jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme

On 3/9/22 01:53, Pankaj Raghav wrote:
> A new fops is added to block device which will be used to setup the
> necessary hooks when a non-power_of_2 zone size is detected in a zoned
> device.
> 
> This fops will be called as a part of blk_revalidate_disk_zones.

And what does this new hook do ? You are actually not explaining it, nor
why it should be called from blk_revalidate_disk_zones().

Also, blk_revalidate_zone_cb() uses bit shift but this patch, nor the
previous one fix that.

> 
> The primary use case for this callback is to deal with zoned devices
> that does not have a power_of_2 zone size such as ZNS drives.For e.g,
> the current NVMe ZNS specification does not require zone size to be
> power_of_2 but the linux block layer still expects a all zoned device to
> have a power_of_2 zone size.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  block/blk-zoned.c      | 3 +++
>  include/linux/blkdev.h | 1 +
>  2 files changed, 4 insertions(+)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 602bef54c813..d3d821797559 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -575,6 +575,9 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
>  	if (!get_capacity(disk))
>  		return -EIO;
>  
> +	if (disk->fops->npo2_zone_setup)
> +		disk->fops->npo2_zone_setup(disk);
> +
>  	/*
>  	 * Ensure that all memory allocations in this context are done as if
>  	 * GFP_NOIO was specified.
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index a12c031af887..08cf039c1622 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1472,6 +1472,7 @@ struct block_device_operations {
>  	void (*swap_slot_free_notify) (struct block_device *, unsigned long);
>  	int (*report_zones)(struct gendisk *, sector_t sector,
>  			unsigned int nr_zones, report_zones_cb cb, void *data);
> +	void (*npo2_zone_setup)(struct gendisk *disk);
>  	char *(*devnode)(struct gendisk *disk, umode_t *mode);
>  	/* returns the length of the identifier or a negative errno: */
>  	int (*get_unique_id)(struct gendisk *disk, u8 id[16],


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/6] nvme: zns: Add support for power_of_2 emulation to NVMe ZNS devices
  2022-03-08 16:53     ` [PATCH 4/6] nvme: zns: Add support for power_of_2 emulation to NVMe ZNS devices Pankaj Raghav
@ 2022-03-09  4:04       ` Damien Le Moal
  2022-03-09 14:33         ` Pankaj Raghav
  0 siblings, 1 reply; 83+ messages in thread
From: Damien Le Moal @ 2022-03-09  4:04 UTC (permalink / raw)
  To: Pankaj Raghav, Luis Chamberlain, Adam Manzanares,
	Javier González, kanchan Joshi, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Matias Bjørling,
	jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme

On 3/9/22 01:53, Pankaj Raghav wrote:
> power_of_2(PO2) is not a requirement as per the NVMe ZNSspecification,
> however we still have in place a lot of assumptions
> for PO2 in the block layer, in FS such as F2FS, btrfs and userspace
> applications.
> 
> So in keeping with these requirements, provide an emulation layer to
> non-power_of_2 zone devices and which does not create a performance
> regression to existing zone storage devices which have PO2 zone sizes.
> Callbacks are provided where needed in the hot paths to reduce the
> impact of performance regression.

Contradiction: reducing the impact of performance regression is not the
same as "does not create a performance regression". So which one is it ?
Please add performance numbers to this commit message.

And also please actually explain what the patch is changing. This commit
message is about the why, but nothing on the how.

> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/nvme/host/core.c |  28 +++++++----
>  drivers/nvme/host/nvme.h | 100 ++++++++++++++++++++++++++++++++++++++-
>  drivers/nvme/host/pci.c  |   4 ++
>  drivers/nvme/host/zns.c  |  79 +++++++++++++++++++++++++++++--
>  include/linux/blk-mq.h   |   2 +
>  5 files changed, 200 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fd4720d37cc0..c7180d512b08 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -327,14 +327,6 @@ static inline enum nvme_disposition nvme_decide_disposition(struct request *req)
>  	return RETRY;
>  }
>  
> -static inline void nvme_end_req_zoned(struct request *req)
> -{
> -	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
> -	    req_op(req) == REQ_OP_ZONE_APPEND)
> -		req->__sector = nvme_lba_to_sect(req->q->queuedata,
> -			le64_to_cpu(nvme_req(req)->result.u64));
> -}
> -
>  static inline void nvme_end_req(struct request *req)
>  {
>  	blk_status_t status = nvme_error_status(nvme_req(req)->status);
> @@ -676,6 +668,12 @@ blk_status_t nvme_fail_nonready_command(struct nvme_ctrl *ctrl,
>  }
>  EXPORT_SYMBOL_GPL(nvme_fail_nonready_command);
>  
> +blk_status_t nvme_fail_po2_zone_emu_violation(struct request *req)
> +{
> +	return nvme_zone_handle_po2_emu_violation(req);
> +}
> +EXPORT_SYMBOL_GPL(nvme_fail_po2_zone_emu_violation);
> +

This should go in zns.c, not in the core.

>  bool __nvme_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
>  		bool queue_live)
>  {
> @@ -879,6 +877,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
>  	req->special_vec.bv_offset = offset_in_page(range);
>  	req->special_vec.bv_len = alloc_size;
>  	req->rq_flags |= RQF_SPECIAL_PAYLOAD;
> +	nvme_verify_sector_value(ns, req);
>  
>  	return BLK_STS_OK;
>  }
> @@ -909,6 +908,7 @@ static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns,
>  			break;
>  		}
>  	}
> +	nvme_verify_sector_value(ns, req);
>  
>  	return BLK_STS_OK;
>  }
> @@ -973,6 +973,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>  
>  	cmnd->rw.control = cpu_to_le16(control);
>  	cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
> +	nvme_verify_sector_value(ns, req);
>  	return 0;
>  }
>  
> @@ -2105,8 +2106,14 @@ static int nvme_report_zones(struct gendisk *disk, sector_t sector,
>  	return nvme_ns_report_zones(disk->private_data, sector, nr_zones, cb,
>  			data);
>  }
> +
> +static void nvme_npo2_zone_setup(struct gendisk *disk)
> +{
> +	nvme_ns_po2_zone_emu_setup(disk->private_data);
> +}

This helper seems useless.

>  #else
>  #define nvme_report_zones	NULL
> +#define nvme_npo2_zone_setup	NULL
>  #endif /* CONFIG_BLK_DEV_ZONED */
>  
>  static const struct block_device_operations nvme_bdev_ops = {
> @@ -2116,6 +2123,7 @@ static const struct block_device_operations nvme_bdev_ops = {
>  	.release	= nvme_release,
>  	.getgeo		= nvme_getgeo,
>  	.report_zones	= nvme_report_zones,
> +	.npo2_zone_setup = nvme_npo2_zone_setup,
>  	.pr_ops		= &nvme_pr_ops,
>  };
>  
> @@ -3844,6 +3852,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid,
>  	ns->disk = disk;
>  	ns->queue = disk->queue;
>  
> +#ifdef CONFIG_BLK_DEV_ZONED
> +	ns->sect_to_lba = __nvme_sect_to_lba;
> +	ns->update_sector_append = nvme_update_sector_append_noop;
> +#endif
>  	if (ctrl->opts && ctrl->opts->data_digest)
>  		blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
>  
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index a162f6c6da6e..f584f760e8cc 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -457,6 +457,10 @@ struct nvme_ns {
>  	u8 pi_type;
>  #ifdef CONFIG_BLK_DEV_ZONED
>  	u64 zsze;
> +	u64 zsze_po2;
> +	u32 zsze_diff;
> +	u64 (*sect_to_lba)(struct nvme_ns *ns, sector_t sector);
> +	sector_t (*update_sector_append)(struct nvme_ns *ns, sector_t sector);
>  #endif
>  	unsigned long features;
>  	unsigned long flags;
> @@ -562,12 +566,21 @@ static inline int nvme_reset_subsystem(struct nvme_ctrl *ctrl)
>  	return ctrl->ops->reg_write32(ctrl, NVME_REG_NSSR, 0x4E564D65);
>  }
>  
> +static inline u64 __nvme_sect_to_lba(struct nvme_ns *ns, sector_t sector)
> +{
> +	return sector >> (ns->lba_shift - SECTOR_SHIFT);
> +}
> +
>  /*
>   * Convert a 512B sector number to a device logical block number.
>   */
>  static inline u64 nvme_sect_to_lba(struct nvme_ns *ns, sector_t sector)
>  {
> -	return sector >> (ns->lba_shift - SECTOR_SHIFT);
> +#ifdef CONFIG_BLK_DEV_ZONED
> +	return ns->sect_to_lba(ns, sector);

So for a power of 2 zone sized device, you are forcing an indirect call,
always. Not acceptable. What is the point of that po2_zone_emu boolean
you added to the queue ?

> +#else
> +	return __nvme_sect_to_lba(ns, sector);

This helper is useless.

> +#endif
>  }
>  
>  /*
> @@ -578,6 +591,83 @@ static inline sector_t nvme_lba_to_sect(struct nvme_ns *ns, u64 lba)
>  	return lba << (ns->lba_shift - SECTOR_SHIFT);
>  }
>  
> +#ifdef CONFIG_BLK_DEV_ZONED
> +static inline u64 __nvme_sect_to_lba_po2(struct nvme_ns *ns, sector_t sector)
> +{
> +	sector_t zone_idx = sector >> ilog2(ns->zsze_po2);
> +
> +	sector = sector - (zone_idx * ns->zsze_diff);
> +
> +	return sector >> (ns->lba_shift - SECTOR_SHIFT);
> +}
> +
> +static inline sector_t
> +nvme_update_sector_append_po2_zone_emu(struct nvme_ns *ns, sector_t sector)
> +{
> +	/* The sector value passed by the drive after a append operation is the
> +	 * based on the actual zone layout in the device, hence, use the actual
> +	 * zone_size to calculate the zone number from the sector.
> +	 */
> +	u32 zone_no = sector / ns->zsze;
> +
> +	sector += ns->zsze_diff * zone_no;
> +	return sector;
> +}
> +
> +static inline sector_t nvme_update_sector_append_noop(struct nvme_ns *ns,
> +						      sector_t sector)
> +{
> +	return sector;
> +}
> +
> +static inline void nvme_end_req_zoned(struct request *req)
> +{
> +	if (req_op(req) == REQ_OP_ZONE_APPEND) {
> +		struct nvme_ns *ns = req->q->queuedata;
> +		sector_t sector;
> +
> +		sector = nvme_lba_to_sect(ns,
> +					  le64_to_cpu(nvme_req(req)->result.u64));
> +
> +		sector = ns->update_sector_append(ns, sector);

Why not assign that to req->__sector directly ?
And again here, you are forcing the indirect function call for *all* zns
devices, even those that have a power of 2 zone size.

> +
> +		req->__sector = sector;
> +	}
> +}
> +
> +static inline void nvme_verify_sector_value(struct nvme_ns *ns, struct request *req)
> +{
> +	if (unlikely(blk_queue_is_po2_zone_emu(ns->queue))) {
> +		sector_t sector = blk_rq_pos(req);
> +		sector_t zone_idx = sector >> ilog2(ns->zsze_po2);
> +		sector_t device_sector = sector - (zone_idx * ns->zsze_diff);
> +
> +		/* Check if the IO is in the emulated area */
> +		if (device_sector - (zone_idx * ns->zsze) > ns->zsze)
> +			req->rq_flags |= RQF_ZONE_EMU_VIOLATION;
> +	}
> +}
> +
> +static inline bool nvme_po2_zone_emu_violation(struct request *req)
> +{
> +	return req->rq_flags & RQF_ZONE_EMU_VIOLATION;
> +}

This helper makes the code unreadable in my opinion.

> +#else
> +static inline void nvme_end_req_zoned(struct request *req)
> +{
> +}
> +
> +static inline void nvme_verify_sector_value(struct nvme_ns *ns, struct request *req)
> +{
> +}
> +
> +static inline bool nvme_po2_zone_emu_violation(struct request *req)
> +{
> +	return false;
> +}
> +
> +#endif
> +
>  /*
>   * Convert byte length to nvme's 0-based num dwords
>   */
> @@ -752,6 +842,7 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>  long nvme_dev_ioctl(struct file *file, unsigned int cmd,
>  		unsigned long arg);
>  int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
> +blk_status_t nvme_fail_po2_zone_emu_violation(struct request *req);
>  
>  extern const struct attribute_group *nvme_ns_id_attr_groups[];
>  extern const struct pr_ops nvme_pr_ops;
> @@ -873,11 +964,13 @@ static inline void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
>  int nvme_revalidate_zones(struct nvme_ns *ns);
>  int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>  		unsigned int nr_zones, report_zones_cb cb, void *data);
> +void nvme_ns_po2_zone_emu_setup(struct nvme_ns *ns);
>  #ifdef CONFIG_BLK_DEV_ZONED
>  int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf);
>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>  				       struct nvme_command *cmnd,
>  				       enum nvme_zone_mgmt_action action);
> +blk_status_t nvme_zone_handle_po2_emu_violation(struct request *req);
>  #else
>  static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
>  		struct request *req, struct nvme_command *cmnd,
> @@ -892,6 +985,11 @@ static inline int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
>  		 "Please enable CONFIG_BLK_DEV_ZONED to support ZNS devices\n");
>  	return -EPROTONOSUPPORT;
>  }
> +
> +static inline blk_status_t nvme_zone_handle_po2_emu_violation(struct request *req)
> +{
> +	return BLK_STS_OK;
> +}
>  #endif
>  
>  static inline struct nvme_ns *nvme_get_ns_from_dev(struct device *dev)
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 6a99ed680915..fc022df3f98e 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -960,6 +960,10 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		return nvme_fail_nonready_command(&dev->ctrl, req);
>  
>  	ret = nvme_prep_rq(dev, req);
> +
> +	if (unlikely(nvme_po2_zone_emu_violation(req)))
> +		return nvme_fail_po2_zone_emu_violation(req);
> +
>  	if (unlikely(ret))
>  		return ret;
>  	spin_lock(&nvmeq->sq_lock);
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index ad02c61c0b52..25516a5ae7e2 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -3,7 +3,9 @@
>   * Copyright (C) 2020 Western Digital Corporation or its affiliates.
>   */
>  
> +#include <linux/log2.h>
>  #include <linux/blkdev.h>
> +#include <linux/math.h>
>  #include <linux/vmalloc.h>
>  #include "nvme.h"
>  
> @@ -46,6 +48,18 @@ static int nvme_set_max_append(struct nvme_ctrl *ctrl)
>  	return 0;
>  }
>  
> +static sector_t nvme_zone_size(struct nvme_ns *ns)
> +{
> +	sector_t zone_size;
> +
> +	if (blk_queue_is_po2_zone_emu(ns->queue))
> +		zone_size = ns->zsze_po2;
> +	else
> +		zone_size = ns->zsze;
> +
> +	return zone_size;
> +}
> +
>  int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
>  {
>  	struct nvme_effects_log *log = ns->head->effects;
> @@ -122,7 +136,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
>  				   sizeof(struct nvme_zone_descriptor);
>  
>  	nr_zones = min_t(unsigned int, nr_zones,
> -			 get_capacity(ns->disk) / ns->zsze);
> +			 get_capacity(ns->disk) / nvme_zone_size(ns));
>  
>  	bufsize = sizeof(struct nvme_zone_report) +
>  		nr_zones * sizeof(struct nvme_zone_descriptor);
> @@ -147,6 +161,8 @@ static int nvme_zone_parse_entry(struct nvme_ns *ns,
>  				 void *data)
>  {
>  	struct blk_zone zone = { };
> +	u64 zone_gap = 0;
> +	u32 zone_idx;
>  
>  	if ((entry->zt & 0xf) != NVME_ZONE_TYPE_SEQWRITE_REQ) {
>  		dev_err(ns->ctrl->device, "invalid zone type %#x\n",
> @@ -159,10 +175,19 @@ static int nvme_zone_parse_entry(struct nvme_ns *ns,
>  	zone.len = ns->zsze;
>  	zone.capacity = nvme_lba_to_sect(ns, le64_to_cpu(entry->zcap));
>  	zone.start = nvme_lba_to_sect(ns, le64_to_cpu(entry->zslba));
> +
> +	if (blk_queue_is_po2_zone_emu(ns->queue)) {
> +		zone_idx = zone.start / zone.len;
> +		zone_gap = zone_idx * ns->zsze_diff;
> +		zone.start += zone_gap;
> +		zone.len = ns->zsze_po2;
> +	}
> +
>  	if (zone.cond == BLK_ZONE_COND_FULL)
>  		zone.wp = zone.start + zone.len;
>  	else
> -		zone.wp = nvme_lba_to_sect(ns, le64_to_cpu(entry->wp));
> +		zone.wp =
> +			nvme_lba_to_sect(ns, le64_to_cpu(entry->wp)) + zone_gap;
>  
>  	return cb(&zone, idx, data);
>  }
> @@ -173,6 +198,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>  	struct nvme_zone_report *report;
>  	struct nvme_command c = { };
>  	int ret, zone_idx = 0;
> +	u64 zone_size = nvme_zone_size(ns);
>  	unsigned int nz, i;
>  	size_t buflen;
>  
> @@ -190,7 +216,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>  	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>  	c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
>  
> -	sector &= ~(ns->zsze - 1);
> +	sector = rounddown(sector, zone_size);
>  	while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
>  		memset(report, 0, buflen);
>  
> @@ -214,7 +240,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>  			zone_idx++;
>  		}
>  
> -		sector += ns->zsze * nz;
> +		sector += zone_size * nz;
>  	}
>  
>  	if (zone_idx > 0)
> @@ -226,6 +252,32 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>  	return ret;
>  }
>  
> +void nvme_ns_po2_zone_emu_setup(struct nvme_ns *ns)
> +{
> +	u32 nr_zones;
> +	sector_t capacity;
> +
> +	if (is_power_of_2(ns->zsze))
> +		return;
> +
> +	if (!get_capacity(ns->disk))
> +		return;
> +
> +	blk_mq_freeze_queue(ns->disk->queue);
> +
> +	blk_queue_po2_zone_emu(ns->queue, 1);
> +	ns->zsze_po2 = 1 << get_count_order_long(ns->zsze);
> +	ns->zsze_diff = ns->zsze_po2 - ns->zsze;
> +
> +	nr_zones = get_capacity(ns->disk) / ns->zsze;
> +	capacity = nr_zones * ns->zsze_po2;
> +	set_capacity_and_notify(ns->disk, capacity);
> +	ns->sect_to_lba = __nvme_sect_to_lba_po2;
> +	ns->update_sector_append = nvme_update_sector_append_po2_zone_emu;
> +
> +	blk_mq_unfreeze_queue(ns->disk->queue);
> +}
> +
>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>  		struct nvme_command *c, enum nvme_zone_mgmt_action action)
>  {
> @@ -239,5 +291,24 @@ blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>  	if (req_op(req) == REQ_OP_ZONE_RESET_ALL)
>  		c->zms.select_all = 1;
>  
> +	nvme_verify_sector_value(ns, req);
> +	return BLK_STS_OK;
> +}
> +
> +blk_status_t nvme_zone_handle_po2_emu_violation(struct request *req)
> +{
> +	/*  The spec mentions that read from ZCAP until ZSZE shall behave
> +	 *  like a deallocated block. Deallocated block reads are
> +	 *  deterministic, hence we fill zero.
> +	 * The spec does not clearly define the result for other opreation.
> +	 */

Comment style and indentation is weird.

> +	if (req_op(req) == REQ_OP_READ) {
> +		zero_fill_bio(req->bio);
> +		nvme_req(req)->status = NVME_SC_SUCCESS;
> +	} else {
> +		nvme_req(req)->status = NVME_SC_WRITE_FAULT;
> +	}

What about requests that straddle the zone capacity ? They need to be
partially zeroed too, otherwise data from the next zone may be exposed.

> +	blk_mq_set_request_complete(req);
> +	nvme_complete_rq(req);
>  	return BLK_STS_OK;
>  }
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 3a41d50b85d3..9ec59183efcd 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -57,6 +57,8 @@ typedef __u32 __bitwise req_flags_t;
>  #define RQF_TIMED_OUT		((__force req_flags_t)(1 << 21))
>  /* queue has elevator attached */
>  #define RQF_ELV			((__force req_flags_t)(1 << 22))
> +/* request to do IO in the emulated area with po2 zone emulation */
> +#define RQF_ZONE_EMU_VIOLATION	((__force req_flags_t)(1 << 23))
>  
>  /* flags that prevent us from merging requests: */
>  #define RQF_NOMERGE_FLAGS \


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] null_blk: Add support for power_of_2 emulation to the null blk device
  2022-03-08 16:53     ` [PATCH 6/6] null_blk: Add support for power_of_2 emulation to the null blk device Pankaj Raghav
@ 2022-03-09  4:09       ` Damien Le Moal
  2022-03-09 14:42         ` Pankaj Raghav
  0 siblings, 1 reply; 83+ messages in thread
From: Damien Le Moal @ 2022-03-09  4:09 UTC (permalink / raw)
  To: Pankaj Raghav, Luis Chamberlain, Adam Manzanares,
	Javier González, kanchan Joshi, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Matias Bjørling,
	jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme

On 3/9/22 01:53, Pankaj Raghav wrote:
> power_of_2(PO2) emulation support is added to the null blk device to
> measure performance.
> 
> Callbacks are added in the hotpaths that require PO2 emulation specific
> behaviour to reduce the impact on exisiting path.
> 
> The power_of_2 emulation support is wired up for both the request and
> the bio queue mode and it is automatically enabled when the given zone
> size is non-power_of_2.

This does not make any sense. Why would you want to add power of 2 zone
size emulation to nullblk ? Just set the zone size to be a power of 2...

If this is for test purpose, then use QEMU. These changes make no sense
to me here.

A change that would make sense in the context of this series is to allow
for setting a zoned null_blk device zone size to a non power of 2 size.
But this series does not actually deal with that. So do not touch this
driver please.

> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  drivers/block/null_blk/main.c     |   8 +-
>  drivers/block/null_blk/null_blk.h |  12 ++
>  drivers/block/null_blk/zoned.c    | 203 ++++++++++++++++++++++++++----
>  3 files changed, 196 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
> index 625a06bfa5ad..c926b59f2b17 100644
> --- a/drivers/block/null_blk/main.c
> +++ b/drivers/block/null_blk/main.c
> @@ -210,7 +210,7 @@ MODULE_PARM_DESC(zoned, "Make device as a host-managed zoned block device. Defau
>  
>  static unsigned long g_zone_size = 256;
>  module_param_named(zone_size, g_zone_size, ulong, S_IRUGO);
> -MODULE_PARM_DESC(zone_size, "Zone size in MB when block device is zoned. Must be power-of-two: Default: 256");
> +MODULE_PARM_DESC(zone_size, "Zone size in MB when block device is zoned. Default: 256");
>  
>  static unsigned long g_zone_capacity;
>  module_param_named(zone_capacity, g_zone_capacity, ulong, 0444);
> @@ -1772,11 +1772,13 @@ static const struct block_device_operations null_bio_ops = {
>  	.owner		= THIS_MODULE,
>  	.submit_bio	= null_submit_bio,
>  	.report_zones	= null_report_zones,
> +	.npo2_zone_setup = null_po2_zone_emu_setup,
>  };
>  
>  static const struct block_device_operations null_rq_ops = {
>  	.owner		= THIS_MODULE,
>  	.report_zones	= null_report_zones,
> +	.npo2_zone_setup = null_po2_zone_emu_setup,
>  };
>  
>  static int setup_commands(struct nullb_queue *nq)
> @@ -1929,8 +1931,8 @@ static int null_validate_conf(struct nullb_device *dev)
>  		dev->mbps = 0;
>  
>  	if (dev->zoned &&
> -	    (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
> -		pr_err("zone_size must be power-of-two\n");
> +	    (!dev->zone_size)) {
> +		pr_err("zone_size must be zero\n");
>  		return -EINVAL;
>  	}
>  
> diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
> index 78eb56b0ca55..34c1b7b2546b 100644
> --- a/drivers/block/null_blk/null_blk.h
> +++ b/drivers/block/null_blk/null_blk.h
> @@ -74,6 +74,16 @@ struct nullb_device {
>  	unsigned int imp_close_zone_no;
>  	struct nullb_zone *zones;
>  	sector_t zone_size_sects;
> +	sector_t zone_size_po2_sects;
> +	sector_t zone_size_diff_sects;
> +	/* The callbacks below are used as hook to perform po2 emulation for a
> +	 * zoned device.
> +	 */
> +	unsigned int (*zone_no)(struct nullb_device *dev,
> +				sector_t sect);
> +	sector_t (*zone_update_sector)(struct nullb_device *dev, sector_t sect);
> +	sector_t (*zone_update_sector_append)(struct nullb_device *dev,
> +					      sector_t sect);
>  	bool need_zone_res_mgmt;
>  	spinlock_t zone_res_lock;
>  
> @@ -137,6 +147,7 @@ int null_register_zoned_dev(struct nullb *nullb);
>  void null_free_zoned_dev(struct nullb_device *dev);
>  int null_report_zones(struct gendisk *disk, sector_t sector,
>  		      unsigned int nr_zones, report_zones_cb cb, void *data);
> +void null_po2_zone_emu_setup(struct gendisk *disk);
>  blk_status_t null_process_zoned_cmd(struct nullb_cmd *cmd,
>  				    enum req_opf op, sector_t sector,
>  				    sector_t nr_sectors);
> @@ -166,5 +177,6 @@ static inline size_t null_zone_valid_read_len(struct nullb *nullb,
>  	return len;
>  }
>  #define null_report_zones	NULL
> +#define null_po2_zone_emu_setup	NULL
>  #endif /* CONFIG_BLK_DEV_ZONED */
>  #endif /* __NULL_BLK_H */
> diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
> index dae54dd1aeac..3bb63c170149 100644
> --- a/drivers/block/null_blk/zoned.c
> +++ b/drivers/block/null_blk/zoned.c
> @@ -16,6 +16,44 @@ static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
>  	return sect >> ilog2(dev->zone_size_sects);
>  }
>  
> +static inline unsigned int null_npo2_zone_no(struct nullb_device *dev,
> +					     sector_t sect)
> +{
> +	return sect / dev->zone_size_sects;
> +}
> +
> +static inline bool null_is_po2_zone_emu(struct nullb_device *dev)
> +{
> +	return !!dev->zone_size_diff_sects;
> +}
> +
> +static inline sector_t null_zone_update_sector_noop(struct nullb_device *dev,
> +						    sector_t sect)
> +{
> +	return sect;
> +}
> +
> +static inline sector_t null_zone_update_sector_po2_emu(struct nullb_device *dev,
> +						       sector_t sect)
> +{
> +	sector_t zsze_po2 = dev->zone_size_po2_sects;
> +	sector_t zsze_diff = dev->zone_size_diff_sects;
> +	u32 zone_idx = sect >> ilog2(zsze_po2);
> +
> +	sect = sect - (zone_idx * zsze_diff);
> +	return sect;
> +}
> +
> +static inline sector_t
> +null_zone_update_sector_append_po2_emu(struct nullb_device *dev, sector_t sect)
> +{
> +	/* Need to readjust the sector if po2 emulation is used. */
> +	u32 zone_no = dev->zone_no(dev, sect);
> +
> +	sect += dev->zone_size_diff_sects * zone_no;
> +	return sect;
> +}
> +
>  static inline void null_lock_zone_res(struct nullb_device *dev)
>  {
>  	if (dev->need_zone_res_mgmt)
> @@ -62,15 +100,14 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
>  	sector_t sector = 0;
>  	unsigned int i;
>  
> -	if (!is_power_of_2(dev->zone_size)) {
> -		pr_err("zone_size must be power-of-two\n");
> -		return -EINVAL;
> -	}
>  	if (dev->zone_size > dev->size) {
>  		pr_err("Zone size larger than device capacity\n");
>  		return -EINVAL;
>  	}
>  
> +	if (!is_power_of_2(dev->zone_size))
> +		pr_info("zone_size is not power-of-two. power-of-two emulation is enabled");
> +
>  	if (!dev->zone_capacity)
>  		dev->zone_capacity = dev->zone_size;
>  
> @@ -83,8 +120,14 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
>  	zone_capacity_sects = mb_to_sects(dev->zone_capacity);
>  	dev_capacity_sects = mb_to_sects(dev->size);
>  	dev->zone_size_sects = mb_to_sects(dev->zone_size);
> -	dev->nr_zones = round_up(dev_capacity_sects, dev->zone_size_sects)
> -		>> ilog2(dev->zone_size_sects);
> +
> +	dev->nr_zones = roundup(dev_capacity_sects, dev->zone_size_sects) /
> +			dev->zone_size_sects;
> +
> +	dev->zone_no = null_zone_no;
> +	dev->zone_update_sector = null_zone_update_sector_noop;
> +	dev->zone_update_sector_append = null_zone_update_sector_noop;
> +
>  
>  	dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct nullb_zone),
>  				    GFP_KERNEL | __GFP_ZERO);
> @@ -166,7 +209,13 @@ int null_register_zoned_dev(struct nullb *nullb)
>  		if (ret)
>  			return ret;
>  	} else {
> -		blk_queue_chunk_sectors(q, dev->zone_size_sects);
> +		nullb->disk->fops->npo2_zone_setup(nullb->disk);
> +
> +		if (null_is_po2_zone_emu(dev))
> +			blk_queue_chunk_sectors(q, dev->zone_size_po2_sects);
> +		else
> +			blk_queue_chunk_sectors(q, dev->zone_size_sects);
> +
>  		q->nr_zones = blkdev_nr_zones(nullb->disk);
>  	}
>  
> @@ -183,17 +232,49 @@ void null_free_zoned_dev(struct nullb_device *dev)
>  	dev->zones = NULL;
>  }
>  
> +static void null_update_zone_info(struct nullb *nullb, struct blk_zone *blkz,
> +				  struct nullb_zone *zone)
> +{
> +	unsigned int zone_idx;
> +	u64 zone_gap;
> +	struct nullb_device *dev = nullb->dev;
> +
> +	if (null_is_po2_zone_emu(dev)) {
> +		zone_idx = zone->start / zone->len;
> +		zone_gap = zone_idx * dev->zone_size_diff_sects;
> +
> +		blkz->start = zone->start + zone_gap;
> +		blkz->len = dev->zone_size_po2_sects;
> +		blkz->wp = zone->wp + zone_gap;
> +	} else {
> +		blkz->start = zone->start;
> +		blkz->len = zone->len;
> +		blkz->wp = zone->wp;
> +	}
> +
> +	blkz->type = zone->type;
> +	blkz->cond = zone->cond;
> +	blkz->capacity = zone->capacity;
> +}
> +
>  int null_report_zones(struct gendisk *disk, sector_t sector,
>  		unsigned int nr_zones, report_zones_cb cb, void *data)
>  {
>  	struct nullb *nullb = disk->private_data;
>  	struct nullb_device *dev = nullb->dev;
>  	unsigned int first_zone, i;
> +	sector_t zone_size;
>  	struct nullb_zone *zone;
>  	struct blk_zone blkz;
>  	int error;
>  
> -	first_zone = null_zone_no(dev, sector);
> +	if (null_is_po2_zone_emu(dev))
> +		zone_size = dev->zone_size_po2_sects;
> +	else
> +		zone_size = dev->zone_size_sects;
> +
> +	first_zone = sector / zone_size;
> +
>  	if (first_zone >= dev->nr_zones)
>  		return 0;
>  
> @@ -210,12 +291,7 @@ int null_report_zones(struct gendisk *disk, sector_t sector,
>  		 * array.
>  		 */
>  		null_lock_zone(dev, zone);
> -		blkz.start = zone->start;
> -		blkz.len = zone->len;
> -		blkz.wp = zone->wp;
> -		blkz.type = zone->type;
> -		blkz.cond = zone->cond;
> -		blkz.capacity = zone->capacity;
> +		null_update_zone_info(nullb, &blkz, zone);
>  		null_unlock_zone(dev, zone);
>  
>  		error = cb(&blkz, i, data);
> @@ -226,6 +302,35 @@ int null_report_zones(struct gendisk *disk, sector_t sector,
>  	return nr_zones;
>  }
>  
> +void null_po2_zone_emu_setup(struct gendisk *disk)
> +{
> +	struct nullb *nullb = disk->private_data;
> +	struct nullb_device *dev = nullb->dev;
> +	sector_t capacity;
> +
> +	if (is_power_of_2(dev->zone_size_sects))
> +		return;
> +
> +	if (!get_capacity(disk))
> +		return;
> +
> +	blk_mq_freeze_queue(disk->queue);
> +
> +	blk_queue_po2_zone_emu(disk->queue, 1);
> +	dev->zone_size_po2_sects =
> +		1 << get_count_order_long(dev->zone_size_sects);
> +	dev->zone_size_diff_sects =
> +		dev->zone_size_po2_sects - dev->zone_size_sects;
> +	dev->zone_no = null_npo2_zone_no;
> +	dev->zone_update_sector = null_zone_update_sector_po2_emu;
> +	dev->zone_update_sector_append = null_zone_update_sector_append_po2_emu;
> +
> +	capacity = dev->nr_zones * dev->zone_size_po2_sects;
> +	set_capacity(disk, capacity);
> +
> +	blk_mq_unfreeze_queue(disk->queue);
> +}
> +
>  /*
>   * This is called in the case of memory backing from null_process_cmd()
>   * with the target zone already locked.
> @@ -234,7 +339,7 @@ size_t null_zone_valid_read_len(struct nullb *nullb,
>  				sector_t sector, unsigned int len)
>  {
>  	struct nullb_device *dev = nullb->dev;
> -	struct nullb_zone *zone = &dev->zones[null_zone_no(dev, sector)];
> +	struct nullb_zone *zone = &dev->zones[dev->zone_no(dev, sector)];
>  	unsigned int nr_sectors = len >> SECTOR_SHIFT;
>  
>  	/* Read must be below the write pointer position */
> @@ -363,11 +468,24 @@ static blk_status_t null_check_zone_resources(struct nullb_device *dev,
>  	}
>  }
>  
> +static void null_update_sector_append(struct nullb_cmd *cmd, sector_t sector)
> +{
> +	struct nullb_device *dev = cmd->nq->dev;
> +
> +	sector = dev->zone_update_sector_append(dev, sector);
> +
> +	if (cmd->bio) {
> +		cmd->bio->bi_iter.bi_sector = sector;
> +	} else {
> +		cmd->rq->__sector = sector;
> +	}
> +}
> +
>  static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
>  				    unsigned int nr_sectors, bool append)
>  {
>  	struct nullb_device *dev = cmd->nq->dev;
> -	unsigned int zno = null_zone_no(dev, sector);
> +	unsigned int zno = dev->zone_no(dev, sector);
>  	struct nullb_zone *zone = &dev->zones[zno];
>  	blk_status_t ret;
>  
> @@ -395,10 +513,7 @@ static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
>  	 */
>  	if (append) {
>  		sector = zone->wp;
> -		if (cmd->bio)
> -			cmd->bio->bi_iter.bi_sector = sector;
> -		else
> -			cmd->rq->__sector = sector;
> +		null_update_sector_append(cmd, sector);
>  	} else if (sector != zone->wp) {
>  		ret = BLK_STS_IOERR;
>  		goto unlock;
> @@ -619,7 +734,7 @@ static blk_status_t null_zone_mgmt(struct nullb_cmd *cmd, enum req_opf op,
>  		return BLK_STS_OK;
>  	}
>  
> -	zone_no = null_zone_no(dev, sector);
> +	zone_no = dev->zone_no(dev, sector);
>  	zone = &dev->zones[zone_no];
>  
>  	null_lock_zone(dev, zone);
> @@ -650,13 +765,54 @@ static blk_status_t null_zone_mgmt(struct nullb_cmd *cmd, enum req_opf op,
>  	return ret;
>  }
>  
> +static blk_status_t null_handle_po2_zone_emu_violation(struct nullb_cmd *cmd,
> +						       enum req_opf op)
> +{
> +	if (op == REQ_OP_READ) {
> +		if (cmd->bio)
> +			zero_fill_bio(cmd->bio);
> +		else
> +			zero_fill_bio(cmd->rq->bio);
> +
> +		return BLK_STS_OK;
> +	} else {
> +		return BLK_STS_IOERR;
> +	}
> +}
> +
> +static bool null_verify_sector_violation(struct nullb_device *dev,
> +					 sector_t sector)
> +{
> +	if (unlikely(null_is_po2_zone_emu(dev))) {
> +		/* The zone idx should be calculated based on the emulated
> +		 * layout
> +		 */
> +		u32 zone_idx = sector >> ilog2(dev->zone_size_po2_sects);
> +		sector_t zsze = dev->zone_size_sects;
> +		sector_t sect = null_zone_update_sector_po2_emu(dev, sector);
> +
> +		if (sect - (zone_idx * zsze) > zsze)
> +			return true;
> +	}
> +	return false;
> +}
> +
>  blk_status_t null_process_zoned_cmd(struct nullb_cmd *cmd, enum req_opf op,
>  				    sector_t sector, sector_t nr_sectors)
>  {
> -	struct nullb_device *dev;
> +	struct nullb_device *dev = cmd->nq->dev;
>  	struct nullb_zone *zone;
>  	blk_status_t sts;
>  
> +	/* Handle when the sector falls in the emulated area */
> +	if (unlikely(null_verify_sector_violation(dev, sector)))
> +		return null_handle_po2_zone_emu_violation(cmd, op);
> +
> +	/* The sector value is updated if po2 emulation is enabled, else it
> +	 * will have no effect on the value
> +	 */
> +	sector = dev->zone_update_sector(dev, sector);
> +
>  	switch (op) {
>  	case REQ_OP_WRITE:
>  		return null_zone_write(cmd, sector, nr_sectors, false);
> @@ -669,8 +825,7 @@ blk_status_t null_process_zoned_cmd(struct nullb_cmd *cmd, enum req_opf op,
>  	case REQ_OP_ZONE_FINISH:
>  		return null_zone_mgmt(cmd, op, sector);
>  	default:
> -		dev = cmd->nq->dev;
> -		zone = &dev->zones[null_zone_no(dev, sector)];
> +		zone = &dev->zones[dev->zone_no(dev, sector)];
>  
>  		null_lock_zone(dev, zone);
>  		sts = null_process_cmd(cmd, op, sector, nr_sectors);


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  2022-03-09  3:40       ` Damien Le Moal
@ 2022-03-09 13:19         ` Pankaj Raghav
  0 siblings, 0 replies; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-09 13:19 UTC (permalink / raw)
  To: Damien Le Moal, Luis Chamberlain, Adam Manzanares,
	Javier González, kanchan Joshi, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Matias Bjørling,
	jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme



On 2022-03-09 04:40, Damien Le Moal wrote:
> On 3/9/22 01:53, Pankaj Raghav wrote:
>>  	nr_zones = min_t(unsigned int, nr_zones,
>> -			 get_capacity(ns->disk) >> ilog2(ns->zsze));
>> +			 get_capacity(ns->disk) / ns->zsze);
> 
> This will not compile on 32-bits arch. This needs to use div64_u64().
> 
Oops. I will fix that up in the next revision and also in other places
that does not use a div64_u64. Thanks. 
> 
> 

-- 
Regards,
Pankaj

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/6] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  2022-03-09  3:44       ` Damien Le Moal
@ 2022-03-09 13:35         ` Pankaj Raghav
  0 siblings, 0 replies; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-09 13:35 UTC (permalink / raw)
  To: Damien Le Moal, Luis Chamberlain, Adam Manzanares,
	Javier González, kanchan Joshi, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Matias Bjørling,
	jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme



On 2022-03-09 04:44, Damien Le Moal wrote:
> On 3/9/22 01:53, Pankaj Raghav wrote:
>>  
>>  	ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
>> -	if (!is_power_of_2(ns->zsze)) {
>> -		dev_warn(ns->ctrl->device,
>> -			"invalid zone size:%llu for namespace:%u\n",
>> -			ns->zsze, ns->head->ns_id);
>> -		status = -ENODEV;
>> -		goto free_data;
>> -	}
> 
> Doing this will allow a non power of 2 zone sized device to be seen by
> the block layer. This will break functions such as blkdev_nr_zones() but
> this patch is not changing this functions, and other using bit shift
> calculations.
> The goal of this patchset was to emulate a po2 zone size for a npo2 device to the
block layer. If you see the `npo2_zone_setup` callback in the NVMe driver (patch 4/6),
we do the following:
```
+   ns->zsze_po2 = 1 << get_count_order_long(ns->zsze);
+   capacity = nr_zones * ns->zsze_po2;
+   set_capacity_and_notify(ns->disk, capacity);
```
So we adapt the capacity of the disk based on the po2 zone size. The chunk sectors
are also set to this new po2 zone size. Therefore, all the block layer functions will
continue to work as the block layer sees the zone size of the device to be ns->zsze_po2 and
not the actual device zone size which is ns->zsze.

Changing the functions such blkdev_nr_zones that uses po2 calculation will/should be dealt separately
if decide to relax the po2 constraint in the block layer.

-- 
Regards,
Pankaj

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/6] block: Add npo2_zone_setup callback to block device fops
  2022-03-09  3:46       ` Damien Le Moal
@ 2022-03-09 14:02         ` Pankaj Raghav
  0 siblings, 0 replies; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-09 14:02 UTC (permalink / raw)
  To: Damien Le Moal, Luis Chamberlain, Adam Manzanares,
	Javier González, kanchan Joshi, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Matias Bjørling,
	jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme



On 2022-03-09 04:46, Damien Le Moal wrote:
> On 3/9/22 01:53, Pankaj Raghav wrote:
>> A new fops is added to block device which will be used to setup the
>> necessary hooks when a non-power_of_2 zone size is detected in a zoned
>> device.
>>
>> This fops will be called as a part of blk_revalidate_disk_zones.
> 
> And what does this new hook do ? You are actually not explaining it, nor
> why it should be called from blk_revalidate_disk_zones().
I should have elaborated the "why" and "how" a bit more in my commit log.
I will fix it in my next revision.

The main idea why it was added and called as a part of blk_revalidate_disk_zones
is this: As the block layer expects the zone sizes to be po2, this fops can be used
by the driver to configure a npo2 device to be presented as a po2 device before the parameters
such as zone bitmaps and chunk sectors are set.

> Also, blk_revalidate_zone_cb() uses bit shift but this patch, nor the
> previous one fix that.
> 
The answer I gave to blkdev_nr_zones question for patch 1/6 applies here as well.
The zone sizes used by blk_revalidate_zone_cb will be the emulated po2 zone size
and not the actual device zone size which is npo2. So the math currently used in
blk_revalidate_zone_cb is still applicable.

-- 
Regards,
Pankaj

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/6] nvme: zns: Add support for power_of_2 emulation to NVMe ZNS devices
  2022-03-09  4:04       ` Damien Le Moal
@ 2022-03-09 14:33         ` Pankaj Raghav
  2022-03-09 21:43           ` Damien Le Moal
  0 siblings, 1 reply; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-09 14:33 UTC (permalink / raw)
  To: Damien Le Moal, Luis Chamberlain, Adam Manzanares,
	Javier González, kanchan Joshi, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Matias Bjørling,
	jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme



On 2022-03-09 05:04, Damien Le Moal wrote:
> On 3/9/22 01:53, Pankaj Raghav wrote:
 
> Contradiction: reducing the impact of performance regression is not the
> same as "does not create a performance regression". So which one is it ?
> Please add performance numbers to this commit message.

> And also please actually explain what the patch is changing. This commit
> message is about the why, but nothing on the how.
>
I will reword and add a bit more context to the commit log with perf numbers
in the next revision
>> +EXPORT_SYMBOL_GPL(nvme_fail_po2_zone_emu_violation);
>> +
> 
> This should go in zns.c, not in the core.
> 
Ok.

>> +
>> +static void nvme_npo2_zone_setup(struct gendisk *disk)
>> +{
>> +	nvme_ns_po2_zone_emu_setup(disk->private_data);
>> +}
> 
> This helper seems useless.
>
I tried to retain the pattern with report_zones which is currently like this:
static int nvme_report_zones(struct gendisk *disk, sector_t sector,
		unsigned int nr_zones, report_zones_cb cb, void *data)
{
	return nvme_ns_report_zones(disk->private_data, sector, nr_zones, cb,
			data);
}

But I can just remove this helper and use nvme_ns_po2_zone_emu_setup cb directly
in nvme_bdev_fops.

>> +
>>  /*
>>   * Convert a 512B sector number to a device logical block number.
>>   */
>>  static inline u64 nvme_sect_to_lba(struct nvme_ns *ns, sector_t sector)
>>  {
>> -	return sector >> (ns->lba_shift - SECTOR_SHIFT);
>> +#ifdef CONFIG_BLK_DEV_ZONED
>> +	return ns->sect_to_lba(ns, sector);
> 
> So for a power of 2 zone sized device, you are forcing an indirect call,
> always. Not acceptable. What is the point of that po2_zone_emu boolean
> you added to the queue ?
This is a good point and we had a discussion about this internally.
Initially I had something like this:
if (!blk_queue_is_po2_zone_emu(disk))
	return sector >> (ns->lba_shift - SECTOR_SHIFT);
else
	return __nvme_sect_to_lba_po2(ns, sec);

But @Luis indicated that it was better to set an op which comes at a cost of indirection
instead of having a runtime check with a if/else in the **hot path**. The code also looks
more clear with having an op.

The performance analysis that we performed did not show any regression while using the indirection
for po2 zone sizes, at least on the x86_64 architecture.
So maybe we could use this opportunity to discuss which approach could be used here.


>> +
>> +		sector = nvme_lba_to_sect(ns,
>> +					  le64_to_cpu(nvme_req(req)->result.u64));
>> +
>> +		sector = ns->update_sector_append(ns, sector);
> 
> Why not assign that to req->__sector directly ?
> And again here, you are forcing the indirect function call for *all* zns
> devices, even those that have a power of 2 zone size.
>
Same answer as above. I will adapt them based on the outcome of our
discussions.
>> +
>> +		req->__sector = sector;
>> +	}
>> +
>> +static inline bool nvme_po2_zone_emu_violation(struct request *req)
>> +{
>> +	return req->rq_flags & RQF_ZONE_EMU_VIOLATION;
>> +}
> 
> This helper makes the code unreadable in my opinion.
>
I will open code it then.
>> +#else
>> +static inline void nvme_end_req_zoned(struct request *req)
>> +{
>> +}
>> +
>> +static inline void nvme_verify_sector_value(struct nvme_ns *ns, struct request *req)
>> +{
>> +}
>> +
>> +static inline bool nvme_po2_zone_emu_violation(struct request *req)
>> +{
>> +	return false;
>> +}
>> +
>> +#endif
>> +
>>  /*
>>   * Convert byte length to nvme's 0-based num dwords
>>   */
>> @@ -752,6 +842,7 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>>  long nvme_dev_ioctl(struct file *file, unsigned int cmd,
>>  		unsigned long arg);
>>  int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
>> +blk_status_t nvme_fail_po2_zone_emu_violation(struct request *req);
>>  
>>  extern const struct attribute_group *nvme_ns_id_attr_groups[];
>>  extern const struct pr_ops nvme_pr_ops;
>> @@ -873,11 +964,13 @@ static inline void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
>>  int nvme_revalidate_zones(struct nvme_ns *ns);
>>  int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>  		unsigned int nr_zones, report_zones_cb cb, void *data);
>> +void nvme_ns_po2_zone_emu_setup(struct nvme_ns *ns);
>>  #ifdef CONFIG_BLK_DEV_ZONED
>>  int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf);
>>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>>  				       struct nvme_command *cmnd,
>>  				       enum nvme_zone_mgmt_action action);
>> +blk_status_t nvme_zone_handle_po2_emu_violation(struct request *req);
>>  #else
>>  static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
>>  		struct request *req, struct nvme_command *cmnd,
>> @@ -892,6 +985,11 @@ static inline int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
>>  		 "Please enable CONFIG_BLK_DEV_ZONED to support ZNS devices\n");
>>  	return -EPROTONOSUPPORT;
>>  }
>> +
>> +static inline blk_status_t nvme_zone_handle_po2_emu_violation(struct request *req)
>> +{
>> +	return BLK_STS_OK;
>> +}
>>  #endif
>>  
>>  static inline struct nvme_ns *nvme_get_ns_from_dev(struct device *dev)
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index 6a99ed680915..fc022df3f98e 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -960,6 +960,10 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>>  		return nvme_fail_nonready_command(&dev->ctrl, req);
>>  
>>  	ret = nvme_prep_rq(dev, req);
>> +
>> +	if (unlikely(nvme_po2_zone_emu_violation(req)))
>> +		return nvme_fail_po2_zone_emu_violation(req);
>> +
>>  	if (unlikely(ret))
>>  		return ret;
>>  	spin_lock(&nvmeq->sq_lock);
>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>> index ad02c61c0b52..25516a5ae7e2 100644
>> --- a/drivers/nvme/host/zns.c
>> +++ b/drivers/nvme/host/zns.c
>> @@ -3,7 +3,9 @@
>>   * Copyright (C) 2020 Western Digital Corporation or its affiliates.
>>   */
>>  
>> +#include <linux/log2.h>
>>  #include <linux/blkdev.h>
>> +#include <linux/math.h>
>>  #include <linux/vmalloc.h>
>>  #include "nvme.h"
>>  
>> @@ -46,6 +48,18 @@ static int nvme_set_max_append(struct nvme_ctrl *ctrl)
>>  	return 0;
>>  }
>>  
>> +static sector_t nvme_zone_size(struct nvme_ns *ns)
>> +{
>> +	sector_t zone_size;
>> +
>> +	if (blk_queue_is_po2_zone_emu(ns->queue))
>> +		zone_size = ns->zsze_po2;
>> +	else
>> +		zone_size = ns->zsze;
>> +
>> +	return zone_size;
>> +}
>> +
>>  int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
>>  {
>>  	struct nvme_effects_log *log = ns->head->effects;
>> @@ -122,7 +136,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
>>  				   sizeof(struct nvme_zone_descriptor);
>>  
>>  	nr_zones = min_t(unsigned int, nr_zones,
>> -			 get_capacity(ns->disk) / ns->zsze);
>> +			 get_capacity(ns->disk) / nvme_zone_size(ns));
>>  
>>  	bufsize = sizeof(struct nvme_zone_report) +
>>  		nr_zones * sizeof(struct nvme_zone_descriptor);
>> @@ -147,6 +161,8 @@ static int nvme_zone_parse_entry(struct nvme_ns *ns,
>>  				 void *data)
>>  {
>>  	struct blk_zone zone = { };
>> +	u64 zone_gap = 0;
>> +	u32 zone_idx;
>>  
>>  	if ((entry->zt & 0xf) != NVME_ZONE_TYPE_SEQWRITE_REQ) {
>>  		dev_err(ns->ctrl->device, "invalid zone type %#x\n",
>> @@ -159,10 +175,19 @@ static int nvme_zone_parse_entry(struct nvme_ns *ns,
>>  	zone.len = ns->zsze;
>>  	zone.capacity = nvme_lba_to_sect(ns, le64_to_cpu(entry->zcap));
>>  	zone.start = nvme_lba_to_sect(ns, le64_to_cpu(entry->zslba));
>> +
>> +	if (blk_queue_is_po2_zone_emu(ns->queue)) {
>> +		zone_idx = zone.start / zone.len;
>> +		zone_gap = zone_idx * ns->zsze_diff;
>> +		zone.start += zone_gap;
>> +		zone.len = ns->zsze_po2;
>> +	}
>> +
>>  	if (zone.cond == BLK_ZONE_COND_FULL)
>>  		zone.wp = zone.start + zone.len;
>>  	else
>> -		zone.wp = nvme_lba_to_sect(ns, le64_to_cpu(entry->wp));
>> +		zone.wp =
>> +			nvme_lba_to_sect(ns, le64_to_cpu(entry->wp)) + zone_gap;
>>  
>>  	return cb(&zone, idx, data);
>>  }
>> @@ -173,6 +198,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>  	struct nvme_zone_report *report;
>>  	struct nvme_command c = { };
>>  	int ret, zone_idx = 0;
>> +	u64 zone_size = nvme_zone_size(ns);
>>  	unsigned int nz, i;
>>  	size_t buflen;
>>  
>> @@ -190,7 +216,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>  	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>>  	c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
>>  
>> -	sector &= ~(ns->zsze - 1);
>> +	sector = rounddown(sector, zone_size);
>>  	while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
>>  		memset(report, 0, buflen);
>>  
>> @@ -214,7 +240,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>  			zone_idx++;
>>  		}
>>  
>> -		sector += ns->zsze * nz;
>> +		sector += zone_size * nz;
>>  	}
>>  
>>  	if (zone_idx > 0)
>> @@ -226,6 +252,32 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>  	return ret;
>>  }
>>  
>> +void nvme_ns_po2_zone_emu_setup(struct nvme_ns *ns)
>> +{
>> +	u32 nr_zones;
>> +	sector_t capacity;
>> +
>> +	if (is_power_of_2(ns->zsze))
>> +		return;
>> +
>> +	if (!get_capacity(ns->disk))
>> +		return;
>> +
>> +	blk_mq_freeze_queue(ns->disk->queue);
>> +
>> +	blk_queue_po2_zone_emu(ns->queue, 1);
>> +	ns->zsze_po2 = 1 << get_count_order_long(ns->zsze);
>> +	ns->zsze_diff = ns->zsze_po2 - ns->zsze;
>> +
>> +	nr_zones = get_capacity(ns->disk) / ns->zsze;
>> +	capacity = nr_zones * ns->zsze_po2;
>> +	set_capacity_and_notify(ns->disk, capacity);
>> +	ns->sect_to_lba = __nvme_sect_to_lba_po2;
>> +	ns->update_sector_append = nvme_update_sector_append_po2_zone_emu;
>> +
>> +	blk_mq_unfreeze_queue(ns->disk->queue);
>> +}
>> +
>>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>>  		struct nvme_command *c, enum nvme_zone_mgmt_action action)
>>  {
>> @@ -239,5 +291,24 @@ blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>>  	if (req_op(req) == REQ_OP_ZONE_RESET_ALL)
>>  		c->zms.select_all = 1;
>>  
>> +	nvme_verify_sector_value(ns, req);
>> +	return BLK_STS_OK;
>> +}
>> +
>> +blk_status_t nvme_zone_handle_po2_emu_violation(struct request *req)
>> +{
>> +	/*  The spec mentions that read from ZCAP until ZSZE shall behave
>> +	 *  like a deallocated block. Deallocated block reads are
>> +	 *  deterministic, hence we fill zero.
>> +	 * The spec does not clearly define the result for other opreation.
>> +	 */
> 
> Comment style and indentation is weird.
>
Ack.
>> +	if (req_op(req) == REQ_OP_READ) {
>> +		zero_fill_bio(req->bio);
>> +		nvme_req(req)->status = NVME_SC_SUCCESS;
>> +	} else {
>> +		nvme_req(req)->status = NVME_SC_WRITE_FAULT;
>> +	}
> 
> What about requests that straddle the zone capacity ? They need to be
> partially zeroed too, otherwise data from the next zone may be exposed.
>
Good point. I will add this support in the next revision. Thanks.
>> +	blk_mq_set_request_complete(req);
>> +	nvme_complete_rq(req);
>>  	return BLK_STS_OK;
>>  }
>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>> index 3a41d50b85d3..9ec59183efcd 100644
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -57,6 +57,8 @@ typedef __u32 __bitwise req_flags_t;
>>  #define RQF_TIMED_OUT		((__force req_flags_t)(1 << 21))
>>  /* queue has elevator attached */
>>  #define RQF_ELV			((__force req_flags_t)(1 << 22))
>> +/* request to do IO in the emulated area with po2 zone emulation */
>> +#define RQF_ZONE_EMU_VIOLATION	((__force req_flags_t)(1 << 23))
>>  
>>  /* flags that prevent us from merging requests: */
>>  #define RQF_NOMERGE_FLAGS \
> 
> 

-- 
Regards,
Pankaj

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 6/6] null_blk: Add support for power_of_2 emulation to the null blk device
  2022-03-09  4:09       ` Damien Le Moal
@ 2022-03-09 14:42         ` Pankaj Raghav
  0 siblings, 0 replies; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-09 14:42 UTC (permalink / raw)
  To: Damien Le Moal, Luis Chamberlain, Adam Manzanares,
	Javier González, kanchan Joshi, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Matias Bjørling,
	jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme



On 2022-03-09 05:09, Damien Le Moal wrote:
> On 3/9/22 01:53, Pankaj Raghav wrote:
>> power_of_2(PO2) emulation support is added to the null blk device to
>> measure performance.
>>
>> Callbacks are added in the hotpaths that require PO2 emulation specific
>> behaviour to reduce the impact on exisiting path.
>>
>> The power_of_2 emulation support is wired up for both the request and
>> the bio queue mode and it is automatically enabled when the given zone
>> size is non-power_of_2.
> 
> This does not make any sense. Why would you want to add power of 2 zone
> size emulation to nullblk ? Just set the zone size to be a power of 2...
> 
> If this is for test purpose, then use QEMU. These changes make no sense
> to me here.
> 
I see your point but this was mainly added to measure the performance impact.

I ran the conformance test with different configurations in QEMU but I don't
think QEMU would be a preferred option to measure performance, especially, if we
want to account for changes we did to the hot path with a indirection. 

As ZNS drives are not available in retail stores, this patch also provides a way
for the community to reproduce the performance analysis that we did without needing
a real device.

> A change that would make sense in the context of this series is to allow
> for setting a zoned null_blk device zone size to a non power of 2 size.
This is not possible with the block layer expecting the zone sizes to be po2.
A null blk device with non po2 zone size will only work with the emulation that is
added as a part of this patch.

As I said before, once we relax the block layer requirement, then we could allow
non po2 zone sizes without a lot of changes to the null blk device.
> But this series does not actually deal with that. So do not touch this
> driver please.
> 
If you really think it doesn't belong here, then I can take it out in the next
revision.
-- 
Regards,
Pankaj

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/6] nvme: zns: Add support for power_of_2 emulation to NVMe ZNS devices
  2022-03-09 14:33         ` Pankaj Raghav
@ 2022-03-09 21:43           ` Damien Le Moal
  2022-03-10 20:35             ` Luis Chamberlain
  0 siblings, 1 reply; 83+ messages in thread
From: Damien Le Moal @ 2022-03-09 21:43 UTC (permalink / raw)
  To: Pankaj Raghav, Luis Chamberlain, Adam Manzanares,
	Javier González, kanchan Joshi, Jens Axboe, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Matias Bjørling,
	jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme

On 3/9/22 23:33, Pankaj Raghav wrote:
> 
> 
> On 2022-03-09 05:04, Damien Le Moal wrote:
>> On 3/9/22 01:53, Pankaj Raghav wrote:
>  
>> Contradiction: reducing the impact of performance regression is not the
>> same as "does not create a performance regression". So which one is it ?
>> Please add performance numbers to this commit message.
> 
>> And also please actually explain what the patch is changing. This commit
>> message is about the why, but nothing on the how.
>>
> I will reword and add a bit more context to the commit log with perf numbers
> in the next revision
>>> +EXPORT_SYMBOL_GPL(nvme_fail_po2_zone_emu_violation);
>>> +
>>
>> This should go in zns.c, not in the core.
>>
> Ok.
> 
>>> +
>>> +static void nvme_npo2_zone_setup(struct gendisk *disk)
>>> +{
>>> +	nvme_ns_po2_zone_emu_setup(disk->private_data);
>>> +}
>>
>> This helper seems useless.
>>
> I tried to retain the pattern with report_zones which is currently like this:
> static int nvme_report_zones(struct gendisk *disk, sector_t sector,
> 		unsigned int nr_zones, report_zones_cb cb, void *data)
> {
> 	return nvme_ns_report_zones(disk->private_data, sector, nr_zones, cb,
> 			data);
> }
> 
> But I can just remove this helper and use nvme_ns_po2_zone_emu_setup cb directly
> in nvme_bdev_fops.
> 
>>> +
>>>  /*
>>>   * Convert a 512B sector number to a device logical block number.
>>>   */
>>>  static inline u64 nvme_sect_to_lba(struct nvme_ns *ns, sector_t sector)
>>>  {
>>> -	return sector >> (ns->lba_shift - SECTOR_SHIFT);
>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>> +	return ns->sect_to_lba(ns, sector);
>>
>> So for a power of 2 zone sized device, you are forcing an indirect call,
>> always. Not acceptable. What is the point of that po2_zone_emu boolean
>> you added to the queue ?
> This is a good point and we had a discussion about this internally.
> Initially I had something like this:
> if (!blk_queue_is_po2_zone_emu(disk))
> 	return sector >> (ns->lba_shift - SECTOR_SHIFT);
> else
> 	return __nvme_sect_to_lba_po2(ns, sec);

No need for the else.

> 
> But @Luis indicated that it was better to set an op which comes at a cost of indirection
> instead of having a runtime check with a if/else in the **hot path**. The code also looks
> more clear with having an op.

The indirect call using a function pointer makes the code obscure. And
the cost of that call is far greater and always present compared to the
CPU branch prediction which will luckily avoid most of the time taking
the wrong branch of an if.

> 
> The performance analysis that we performed did not show any regression while using the indirection
> for po2 zone sizes, at least on the x86_64 architecture.
> So maybe we could use this opportunity to discuss which approach could be used here.

The easiest one that makes the code clear and easy to understand.



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-08 16:53 ` [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices Pankaj Raghav
                     ` (5 preceding siblings ...)
       [not found]   ` <CGME20220308165448eucas1p12c7c302a4b239db64b49d54cc3c1f0ac@eucas1p1.samsung.com>
@ 2022-03-10  9:47   ` Christoph Hellwig
  2022-03-10 12:57     ` Pankaj Raghav
  2022-03-10 17:38     ` Adam Manzanares
  6 siblings, 2 replies; 83+ messages in thread
From: Christoph Hellwig @ 2022-03-10  9:47 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Luis Chamberlain, Adam Manzanares, Javier González,
	kanchan Joshi, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Damien Le Moal, Matias Bjørling, jiangbo.365,
	Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme

This is complete bonkers.  IFF we have a good reason to support non
power of two zones size (and I'd like to see evidence for that) we'll
need to go through all the layers to support it.  But doing this emulation
is just idiotic and will at tons of code just to completely confuse users.

On Tue, Mar 08, 2022 at 05:53:43PM +0100, Pankaj Raghav wrote:
> 
> #Motivation:
> There are currently ZNS drives that are produced and deployed that do
> not have power_of_2(PO2) zone size. The NVMe spec for ZNS does not
> specify the PO2 requirement but the linux block layer currently checks
> for zoned devices to have power_of_2 zone sizes.

Well, apparently whoever produces these drives never cared about supporting
Linux as the power of two requirement goes back to SMR HDDs, which also
don't have that requirement in the spec (and even allow non-uniform zone
size), but Linux decided that we want this for sanity.

Do these drives even support Zone Append?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-10  9:47   ` [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices Christoph Hellwig
@ 2022-03-10 12:57     ` Pankaj Raghav
  2022-03-10 13:07       ` Matias Bjørling
  2022-03-10 14:44       ` Christoph Hellwig
  2022-03-10 17:38     ` Adam Manzanares
  1 sibling, 2 replies; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-10 12:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Luis Chamberlain, Adam Manzanares, Javier González,
	jiangbo.365, kanchan Joshi, Jens Axboe, Keith Busch,
	Sagi Grimberg, Matias Bjørling, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme, Damien Le Moal



On 2022-03-10 10:47, Christoph Hellwig wrote:
> This is complete bonkers.  IFF we have a good reason to support non
> power of two zones size (and I'd like to see evidence for that) we'll

non power of 2 support is important to the users and that is why we
started this effort to do that. I have also CCed Bo from Bytedance based
on their request.

> need to go through all the layers to support it.  But doing this emulation
> is just idiotic and will at tons of code just to completely confuse users.
> 

I agree with your point to create the non power of 2 support through all
the layers but this is the first step.

One of the early feedback that we got from Damien is to not break the
existing kernel and userspace applications that are written with the po2
assumption.

The following are the steps we have in the pipeline:
- Remove the constraint in the block layer
- Start migrating the Kernel applications such as btrfs so that it also
works on non power of 2 devices.

Of course, we wanted to post RFCs to the steps mentioned above so that
there could be a public discussion about the issues.

> Well, apparently whoever produces these drives never cared about supporting
> Linux as the power of two requirement goes back to SMR HDDs, which also
> don't have that requirement in the spec (and even allow non-uniform zone
> size), but Linux decided that we want this for sanity.
> 
> Do these drives even support Zone Append?

Yes, these drives are intended for Linux users that would use the zoned
block device. Append is supported but holes in the LBA space (due to
diff in zone cap and zone size) is still a problem for these users.
-- 
Regards,
Pankaj

^ permalink raw reply	[flat|nested] 83+ messages in thread

* RE: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-10 12:57     ` Pankaj Raghav
@ 2022-03-10 13:07       ` Matias Bjørling
  2022-03-10 13:14         ` Javier González
  2022-03-10 14:44       ` Christoph Hellwig
  1 sibling, 1 reply; 83+ messages in thread
From: Matias Bjørling @ 2022-03-10 13:07 UTC (permalink / raw)
  To: Pankaj Raghav, Christoph Hellwig
  Cc: Luis Chamberlain, Adam Manzanares, Javier González,
	jiangbo.365, kanchan Joshi, Jens Axboe, Keith Busch,
	Sagi Grimberg, Pankaj Raghav, Kanchan Joshi, linux-block,
	linux-nvme, Damien Le Moal

> Yes, these drives are intended for Linux users that would use the zoned
> block device. Append is supported but holes in the LBA space (due to diff in
> zone cap and zone size) is still a problem for these users.

With respect to the specific users, what does it break specifically? What are key features are they missing when there's holes? 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-10 13:07       ` Matias Bjørling
@ 2022-03-10 13:14         ` Javier González
  2022-03-10 14:58           ` Matias Bjørling
  0 siblings, 1 reply; 83+ messages in thread
From: Javier González @ 2022-03-10 13:14 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: Pankaj Raghav, Christoph Hellwig, Luis Chamberlain,
	Adam Manzanares, Javier González, jiangbo.365,
	kanchan Joshi, Jens Axboe, Keith Busch, Sagi Grimberg,
	Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme,
	Damien Le Moal



> On 10 Mar 2022, at 14.07, Matias Bjørling <matias.bjorling@wdc.com> wrote:
> 
> 
>> 
>> Yes, these drives are intended for Linux users that would use the zoned
>> block device. Append is supported but holes in the LBA space (due to diff in
>> zone cap and zone size) is still a problem for these users.
> 
> With respect to the specific users, what does it break specifically? What are key features are they missing when there's holes? 

What we hear is that it breaks existing mapping in applications, where the address space is seen as contiguous; with holes it needs to account for the unmapped space. This affects performance and and CPU due to unnecessary splits. This is for both reads and writes. 

For more details, I guess they will have to jump in and share the parts that they consider is proper to share in the mailing list. 

I guess we will have more conversations around this as we push the block layer changes after this series. 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-10 12:57     ` Pankaj Raghav
  2022-03-10 13:07       ` Matias Bjørling
@ 2022-03-10 14:44       ` Christoph Hellwig
  2022-03-11 20:19         ` Luis Chamberlain
  1 sibling, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2022-03-10 14:44 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Christoph Hellwig, Luis Chamberlain, Adam Manzanares,
	Javier González, jiangbo.365, kanchan Joshi, Jens Axboe,
	Keith Busch, Sagi Grimberg, Matias Bjørling, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme, Damien Le Moal

On Thu, Mar 10, 2022 at 01:57:58PM +0100, Pankaj Raghav wrote:
> Yes, these drives are intended for Linux users that would use the zoned
> block device. Append is supported but holes in the LBA space (due to
> diff in zone cap and zone size) is still a problem for these users.

I'd really like to hear from the users.  Because really, either they
should use a proper file system abstraction (including zonefs if that is
all they need), or raw nvme passthrough which will alredy work for this
case.  But adding a whole bunch of crap because people want to use the
block device special file for something it is not designed for just
does not make any sense.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* RE: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-10 13:14         ` Javier González
@ 2022-03-10 14:58           ` Matias Bjørling
  2022-03-10 15:07             ` Keith Busch
  2022-03-10 15:13             ` Javier González
  0 siblings, 2 replies; 83+ messages in thread
From: Matias Bjørling @ 2022-03-10 14:58 UTC (permalink / raw)
  To: Javier González
  Cc: Pankaj Raghav, Christoph Hellwig, Luis Chamberlain,
	Adam Manzanares, Javier González, jiangbo.365,
	kanchan Joshi, Jens Axboe, Keith Busch, Sagi Grimberg,
	Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme,
	Damien Le Moal

 >> Yes, these drives are intended for Linux users that would use the
> >> zoned block device. Append is supported but holes in the LBA space
> >> (due to diff in zone cap and zone size) is still a problem for these users.
> >
> > With respect to the specific users, what does it break specifically? What are
> key features are they missing when there's holes?
> 
> What we hear is that it breaks existing mapping in applications, where the
> address space is seen as contiguous; with holes it needs to account for the
> unmapped space. This affects performance and and CPU due to unnecessary
> splits. This is for both reads and writes.
> 
> For more details, I guess they will have to jump in and share the parts that
> they consider is proper to share in the mailing list.
> 
> I guess we will have more conversations around this as we push the block
> layer changes after this series.

Ok, so I hear that one issue is I/O splits - If I assume that reads are sequential, zone cap/size between 100MiB and 1GiB, then my gut feeling would tell me its less CPU intensive to split every 100MiB to 1GiB of reads, than it would be to not have power of 2 zones due to the extra per io calculations. 

Do I have a faulty assumption about the above, or is there more to it?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-10 14:58           ` Matias Bjørling
@ 2022-03-10 15:07             ` Keith Busch
  2022-03-10 15:16               ` Javier González
  2022-03-10 15:13             ` Javier González
  1 sibling, 1 reply; 83+ messages in thread
From: Keith Busch @ 2022-03-10 15:07 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: Javier González, Pankaj Raghav, Christoph Hellwig,
	Luis Chamberlain, Adam Manzanares, Javier González,
	jiangbo.365, kanchan Joshi, Jens Axboe, Sagi Grimberg,
	Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme,
	Damien Le Moal

On Thu, Mar 10, 2022 at 02:58:07PM +0000, Matias Bjørling wrote:
>  >> Yes, these drives are intended for Linux users that would use the
> > >> zoned block device. Append is supported but holes in the LBA space
> > >> (due to diff in zone cap and zone size) is still a problem for these users.
> > >
> > > With respect to the specific users, what does it break specifically? What are
> > key features are they missing when there's holes?
> > 
> > What we hear is that it breaks existing mapping in applications, where the
> > address space is seen as contiguous; with holes it needs to account for the
> > unmapped space. This affects performance and and CPU due to unnecessary
> > splits. This is for both reads and writes.
> > 
> > For more details, I guess they will have to jump in and share the parts that
> > they consider is proper to share in the mailing list.
> > 
> > I guess we will have more conversations around this as we push the block
> > layer changes after this series.
> 
> Ok, so I hear that one issue is I/O splits - If I assume that reads
> are sequential, zone cap/size between 100MiB and 1GiB, then my gut
> feeling would tell me its less CPU intensive to split every 100MiB to
> 1GiB of reads, than it would be to not have power of 2 zones due to
> the extra per io calculations. 

Don't you need to split anyway when spanning two zones to avoid the zone
boundary error?

Maybe this is a silly idea, but it would be a trivial device-mapper
to remap the gaps out of the lba range.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-10 14:58           ` Matias Bjørling
  2022-03-10 15:07             ` Keith Busch
@ 2022-03-10 15:13             ` Javier González
  1 sibling, 0 replies; 83+ messages in thread
From: Javier González @ 2022-03-10 15:13 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: Pankaj Raghav, Christoph Hellwig, Luis Chamberlain,
	Adam Manzanares, jiangbo.365, kanchan Joshi, Jens Axboe,
	Keith Busch, Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
	linux-block, linux-nvme, Damien Le Moal

On 10.03.2022 14:58, Matias Bjørling wrote:
> >> Yes, these drives are intended for Linux users that would use the
>> >> zoned block device. Append is supported but holes in the LBA space
>> >> (due to diff in zone cap and zone size) is still a problem for these users.
>> >
>> > With respect to the specific users, what does it break specifically? What are
>> key features are they missing when there's holes?
>>
>> What we hear is that it breaks existing mapping in applications, where the
>> address space is seen as contiguous; with holes it needs to account for the
>> unmapped space. This affects performance and and CPU due to unnecessary
>> splits. This is for both reads and writes.
>>
>> For more details, I guess they will have to jump in and share the parts that
>> they consider is proper to share in the mailing list.
>>
>> I guess we will have more conversations around this as we push the block
>> layer changes after this series.
>
>Ok, so I hear that one issue is I/O splits - If I assume that reads are sequential, zone cap/size between 100MiB and 1GiB, then my gut feeling would tell me its less CPU intensive to split every 100MiB to 1GiB of reads, than it would be to not have power of 2 zones due to the extra per io calculations.
>
>Do I have a faulty assumption about the above, or is there more to it?

I do not have numbers on the number of splits. I can only say that it is
an issue. Then the whole management is apparently also costing some DRAM
for extra mapping, instead of simply doing +1.

The goal for these customers is not having the emulation, so the cost of
the !PO2 path would be 0.

For the existing applications that require a PO2, we have the emulation.
In this case, the cost will only be paid on the devices that implement
!PO2 zones.

Hope this answer the question.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-10 15:07             ` Keith Busch
@ 2022-03-10 15:16               ` Javier González
  2022-03-10 23:44                 ` Damien Le Moal
  0 siblings, 1 reply; 83+ messages in thread
From: Javier González @ 2022-03-10 15:16 UTC (permalink / raw)
  To: Keith Busch
  Cc: Matias Bjørling, Pankaj Raghav, Christoph Hellwig,
	Luis Chamberlain, Adam Manzanares, jiangbo.365, kanchan Joshi,
	Jens Axboe, Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
	linux-block, linux-nvme, Damien Le Moal

On 10.03.2022 07:07, Keith Busch wrote:
>On Thu, Mar 10, 2022 at 02:58:07PM +0000, Matias Bjørling wrote:
>>  >> Yes, these drives are intended for Linux users that would use the
>> > >> zoned block device. Append is supported but holes in the LBA space
>> > >> (due to diff in zone cap and zone size) is still a problem for these users.
>> > >
>> > > With respect to the specific users, what does it break specifically? What are
>> > key features are they missing when there's holes?
>> >
>> > What we hear is that it breaks existing mapping in applications, where the
>> > address space is seen as contiguous; with holes it needs to account for the
>> > unmapped space. This affects performance and and CPU due to unnecessary
>> > splits. This is for both reads and writes.
>> >
>> > For more details, I guess they will have to jump in and share the parts that
>> > they consider is proper to share in the mailing list.
>> >
>> > I guess we will have more conversations around this as we push the block
>> > layer changes after this series.
>>
>> Ok, so I hear that one issue is I/O splits - If I assume that reads
>> are sequential, zone cap/size between 100MiB and 1GiB, then my gut
>> feeling would tell me its less CPU intensive to split every 100MiB to
>> 1GiB of reads, than it would be to not have power of 2 zones due to
>> the extra per io calculations.
>
>Don't you need to split anyway when spanning two zones to avoid the zone
>boundary error?

If you have size = capacity then you can do a cross-zone read. This is
only a problem when we have gaps.

>Maybe this is a silly idea, but it would be a trivial device-mapper
>to remap the gaps out of the lba range.

One thing we have considered is that as we remove the PO2 constraint
from the block layer is that devices exposing PO2 zone sizes are able to
do the emulation the other way around to support things like this.

A device mapper is also a fine place to put this, but it seems like a
very simple task. Is it worth all the boilerplate code for the device
mapper only for this?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-10  9:47   ` [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices Christoph Hellwig
  2022-03-10 12:57     ` Pankaj Raghav
@ 2022-03-10 17:38     ` Adam Manzanares
  2022-03-14  7:36       ` Christoph Hellwig
  1 sibling, 1 reply; 83+ messages in thread
From: Adam Manzanares @ 2022-03-10 17:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pankaj Raghav, Luis Chamberlain, Javier González,
	kanchan Joshi, Jens Axboe, Keith Busch, Sagi Grimberg,
	Damien Le Moal, Matias Bjørling, jiangbo.365, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On Thu, Mar 10, 2022 at 10:47:25AM +0100, Christoph Hellwig wrote:
> This is complete bonkers.  IFF we have a good reason to support non
> power of two zones size (and I'd like to see evidence for that) we'll
> need to go through all the layers to support it.  But doing this emulation
> is just idiotic and will at tons of code just to completely confuse users.
> 
> On Tue, Mar 08, 2022 at 05:53:43PM +0100, Pankaj Raghav wrote:
> > 
> > #Motivation:
> > There are currently ZNS drives that are produced and deployed that do
> > not have power_of_2(PO2) zone size. The NVMe spec for ZNS does not
> > specify the PO2 requirement but the linux block layer currently checks
> > for zoned devices to have power_of_2 zone sizes.
> 
> Well, apparently whoever produces these drives never cared about supporting
> Linux as the power of two requirement goes back to SMR HDDs, which also
> don't have that requirement in the spec (and even allow non-uniform zone
> size), but Linux decided that we want this for sanity.

Non uniform zone size definitely seems like a mess. Fixed zone sizes that
are non po2 doesn't seem insane to me given that chunk sectors is no longer 
assumed to be po2. We have looked at removing po2 and the only hot path 
optimization for po2 is for appends.

> 
> Do these drives even support Zone Append?

Should it matter if the drives support append? SMR drives do not support append
and they are considered zone block devices. Append seems to be an optimization
for users that want higher concurrency per zone. One can also build concurrency
by leveraging multiple zones simultaneously as well.



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/6] nvme: zns: Add support for power_of_2 emulation to NVMe ZNS devices
  2022-03-09 21:43           ` Damien Le Moal
@ 2022-03-10 20:35             ` Luis Chamberlain
  2022-03-10 23:50               ` Damien Le Moal
  0 siblings, 1 reply; 83+ messages in thread
From: Luis Chamberlain @ 2022-03-10 20:35 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Pankaj Raghav, Adam Manzanares, Javier González,
	kanchan Joshi, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Matias Bjørling, jiangbo.365, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On Thu, Mar 10, 2022 at 06:43:47AM +0900, Damien Le Moal wrote:
> On 3/9/22 23:33, Pankaj Raghav wrote:
> > On 2022-03-09 05:04, Damien Le Moal wrote:
> >> So for a power of 2 zone sized device, you are forcing an indirect call,
> >> always. Not acceptable. What is the point of that po2_zone_emu boolean
> >> you added to the queue ?
> > This is a good point and we had a discussion about this internally.
> > Initially I had something like this:
> > if (!blk_queue_is_po2_zone_emu(disk))
> > 	return sector >> (ns->lba_shift - SECTOR_SHIFT);
> > else
> > 	return __nvme_sect_to_lba_po2(ns, sec);
> 
> No need for the else.

If true then great.

> > But @Luis indicated that it was better to set an op which comes at a cost of indirection
> > instead of having a runtime check with a if/else in the **hot path**. The code also looks
> > more clear with having an op.
> 
> The indirect call using a function pointer makes the code obscure. And
> the cost of that call is far greater and always present compared to the
> CPU branch prediction which will luckily avoid most of the time taking
> the wrong branch of an if.

The goal was to ensure no performance impact, and given a hot path
was involved and we simply cannot microbench append as there is no
way / API to do that, we can't be sure. But if you are certain that
there is no perf impact, it would be wonderful to live without it.

Thanks for the suggestion and push!

  Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-10 15:16               ` Javier González
@ 2022-03-10 23:44                 ` Damien Le Moal
  0 siblings, 0 replies; 83+ messages in thread
From: Damien Le Moal @ 2022-03-10 23:44 UTC (permalink / raw)
  To: Javier González, Keith Busch
  Cc: Matias Bjørling, Pankaj Raghav, Christoph Hellwig,
	Luis Chamberlain, Adam Manzanares, jiangbo.365, kanchan Joshi,
	Jens Axboe, Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
	linux-block, linux-nvme

On 3/11/22 00:16, Javier González wrote:
> On 10.03.2022 07:07, Keith Busch wrote:
>> On Thu, Mar 10, 2022 at 02:58:07PM +0000, Matias Bjørling wrote:
>>>  >> Yes, these drives are intended for Linux users that would use the
>>>>>> zoned block device. Append is supported but holes in the LBA space
>>>>>> (due to diff in zone cap and zone size) is still a problem for these users.
>>>>>
>>>>> With respect to the specific users, what does it break specifically? What are
>>>> key features are they missing when there's holes?
>>>>
>>>> What we hear is that it breaks existing mapping in applications, where the
>>>> address space is seen as contiguous; with holes it needs to account for the
>>>> unmapped space. This affects performance and and CPU due to unnecessary
>>>> splits. This is for both reads and writes.
>>>>
>>>> For more details, I guess they will have to jump in and share the parts that
>>>> they consider is proper to share in the mailing list.
>>>>
>>>> I guess we will have more conversations around this as we push the block
>>>> layer changes after this series.
>>>
>>> Ok, so I hear that one issue is I/O splits - If I assume that reads
>>> are sequential, zone cap/size between 100MiB and 1GiB, then my gut
>>> feeling would tell me its less CPU intensive to split every 100MiB to
>>> 1GiB of reads, than it would be to not have power of 2 zones due to
>>> the extra per io calculations.
>>
>> Don't you need to split anyway when spanning two zones to avoid the zone
>> boundary error?
> 
> If you have size = capacity then you can do a cross-zone read. This is
> only a problem when we have gaps.
> 
>> Maybe this is a silly idea, but it would be a trivial device-mapper
>> to remap the gaps out of the lba range.
> 
> One thing we have considered is that as we remove the PO2 constraint
> from the block layer is that devices exposing PO2 zone sizes are able to
> do the emulation the other way around to support things like this.
> 
> A device mapper is also a fine place to put this, but it seems like a
> very simple task. Is it worth all the boilerplate code for the device
> mapper only for this?

Boiler plate ? DM already support zoned devices. Writing a "dm-unhole"
target would be extremely simple as it would essentially be a variation
of dm-linear. There should be no DM core changes needed.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/6] nvme: zns: Add support for power_of_2 emulation to NVMe ZNS devices
  2022-03-10 20:35             ` Luis Chamberlain
@ 2022-03-10 23:50               ` Damien Le Moal
  2022-03-11  0:56                 ` Luis Chamberlain
  0 siblings, 1 reply; 83+ messages in thread
From: Damien Le Moal @ 2022-03-10 23:50 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Pankaj Raghav, Adam Manzanares, Javier González,
	kanchan Joshi, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Matias Bjørling, jiangbo.365, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On 3/11/22 05:35, Luis Chamberlain wrote:
> On Thu, Mar 10, 2022 at 06:43:47AM +0900, Damien Le Moal wrote:
>> On 3/9/22 23:33, Pankaj Raghav wrote:
>>> On 2022-03-09 05:04, Damien Le Moal wrote:
>>>> So for a power of 2 zone sized device, you are forcing an indirect call,
>>>> always. Not acceptable. What is the point of that po2_zone_emu boolean
>>>> you added to the queue ?
>>> This is a good point and we had a discussion about this internally.
>>> Initially I had something like this:
>>> if (!blk_queue_is_po2_zone_emu(disk))
>>> 	return sector >> (ns->lba_shift - SECTOR_SHIFT);
>>> else
>>> 	return __nvme_sect_to_lba_po2(ns, sec);
>>
>> No need for the else.
> 
> If true then great.

If true ? The else is clearly not needed here. One less line of code.

> 
>>> But @Luis indicated that it was better to set an op which comes at a cost of indirection
>>> instead of having a runtime check with a if/else in the **hot path**. The code also looks
>>> more clear with having an op.
>>
>> The indirect call using a function pointer makes the code obscure. And
>> the cost of that call is far greater and always present compared to the
>> CPU branch prediction which will luckily avoid most of the time taking
>> the wrong branch of an if.
> 
> The goal was to ensure no performance impact, and given a hot path
> was involved and we simply cannot microbench append as there is no
> way / API to do that, we can't be sure. But if you are certain that
> there is no perf impact, it would be wonderful to live without it.

Use zonefs. It uses zone append.

> 
> Thanks for the suggestion and push!
> 
>   Luis


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 4/6] nvme: zns: Add support for power_of_2 emulation to NVMe ZNS devices
  2022-03-10 23:50               ` Damien Le Moal
@ 2022-03-11  0:56                 ` Luis Chamberlain
  0 siblings, 0 replies; 83+ messages in thread
From: Luis Chamberlain @ 2022-03-11  0:56 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Pankaj Raghav, Adam Manzanares, Javier González,
	kanchan Joshi, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Matias Bjørling, jiangbo.365, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On Fri, Mar 11, 2022 at 08:50:47AM +0900, Damien Le Moal wrote:
> On 3/11/22 05:35, Luis Chamberlain wrote:
> > On Thu, Mar 10, 2022 at 06:43:47AM +0900, Damien Le Moal wrote:
> >> The indirect call using a function pointer makes the code obscure. And
> >> the cost of that call is far greater and always present compared to the
> >> CPU branch prediction which will luckily avoid most of the time taking
> >> the wrong branch of an if.
> > 
> > The goal was to ensure no performance impact, and given a hot path
> > was involved and we simply cannot microbench append as there is no
> > way / API to do that, we can't be sure. But if you are certain that
> > there is no perf impact, it would be wonderful to live without it.
> 
> Use zonefs. It uses zone append.

That'd be a microbench on zonefs append, not raw append, and so I
suspect a simple branch cannot make a difference and would get lost
as noise. So I don't think a perf test is required here. Please
let me know.

  Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-10 14:44       ` Christoph Hellwig
@ 2022-03-11 20:19         ` Luis Chamberlain
  2022-03-11 20:51           ` Keith Busch
  0 siblings, 1 reply; 83+ messages in thread
From: Luis Chamberlain @ 2022-03-11 20:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pankaj Raghav, Adam Manzanares, Javier González,
	jiangbo.365, kanchan Joshi, Jens Axboe, Keith Busch,
	Sagi Grimberg, Matias Bjørling, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme, Damien Le Moal

On Thu, Mar 10, 2022 at 03:44:49PM +0100, Christoph Hellwig wrote:
> On Thu, Mar 10, 2022 at 01:57:58PM +0100, Pankaj Raghav wrote:
> > Yes, these drives are intended for Linux users that would use the zoned
> > block device. Append is supported but holes in the LBA space (due to
> > diff in zone cap and zone size) is still a problem for these users.
> 
> I'd really like to hear from the users.  Because really, either they
> should use a proper file system abstraction (including zonefs if that is
> all they need),

That requires access to at least the block device and without PO2
emulation that is not possible. Using zonefs is not possible today
for !PO2 devices.

> or raw nvme passthrough which will alredy work for this
> case. 

This effort is not upstream yet, however, once and if this does land
upstream it does mean something other than zonefs must be used since
!PO2 devices are not supported by zonefs. So although the goal with the
zonefs was to provide a unified interface for raw access for
applications, the PO2 requirement will essentially create
fragmenetation.

> But adding a whole bunch of crap because people want to use the
> block device special file for something it is not designed for just
> does not make any sense.

Using Linux requires PO2. And so on behalf of Damien's request the
logical thing to do was to upkeep that requirement and to avoid any
performance regressions. That "crap" was done to slowly pave the way
forward to then later remove the PO2 requirement.

I think we'll all acknowledge that doing emulation just means adding more
software for something that is not a NAND requirement, but a requirement
imposed by the inheritance of zoned software designed for SMR HDDs. I
think we may also all acknowledge now that keeping this emulation code
*forever* seems like complete insanity.

Since the PO2 requirement imposed on Linux today seems to be now
sending us down a dubious effort we'd need to support, let me
then try to get folks who have been saying that we must keep this
requirement to answer the following question:

Are you 100% sure your ZNS hardware team and firmware team will always
be happy you have caked in a PO2 requirement for ZNS drives on Linux
and are you ready to deal with those consequences on Linux forever? Really?

NAND has no PO2 requirement. The emulation effort was only done to help
add support for !PO2 devices because there is no alternative. If we
however are ready instead to go down the avenue of removing those
restrictions well let's go there then instead. If that's not even
something we are willing to consider I'd really like folks who stand
behind the PO2 requirement to stick their necks out and clearly say that
their hw/fw teams are happy to deal with this requirement forever on ZNS.

From what I am seeing this is a legacy requirement which we should be
able to remove. Keeping the requirement will only do harm to ZNS
adoption on Linux and it will also create *more* fragmentation.

  Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-11 20:19         ` Luis Chamberlain
@ 2022-03-11 20:51           ` Keith Busch
  2022-03-11 21:04             ` Luis Chamberlain
                               ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Keith Busch @ 2022-03-11 20:51 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, Pankaj Raghav, Adam Manzanares,
	Javier González, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Matias Bjørling, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme, Damien Le Moal

On Fri, Mar 11, 2022 at 12:19:38PM -0800, Luis Chamberlain wrote:
> NAND has no PO2 requirement. The emulation effort was only done to help
> add support for !PO2 devices because there is no alternative. If we
> however are ready instead to go down the avenue of removing those
> restrictions well let's go there then instead. If that's not even
> something we are willing to consider I'd really like folks who stand
> behind the PO2 requirement to stick their necks out and clearly say that
> their hw/fw teams are happy to deal with this requirement forever on ZNS.

Regardless of the merits of the current OS requirement, it's a trivial
matter for firmware to round up their reported zone size to the next
power of 2. This does not create a significant burden on their part, as
far as I know.

And po2 does not even seem to be the real problem here. The holes seem
to be what's causing a concern, which you have even without po2 zones.
I'm starting to like the previous idea of creating an unholey
device-mapper for such users...

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-11 20:51           ` Keith Busch
@ 2022-03-11 21:04             ` Luis Chamberlain
  2022-03-11 21:31               ` Keith Busch
  2022-03-11 22:23             ` Adam Manzanares
  2022-03-21 16:21             ` Jonathan Derrick
  2 siblings, 1 reply; 83+ messages in thread
From: Luis Chamberlain @ 2022-03-11 21:04 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Pankaj Raghav, Adam Manzanares,
	Javier González, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Matias Bjørling, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme, Damien Le Moal

On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote:
> On Fri, Mar 11, 2022 at 12:19:38PM -0800, Luis Chamberlain wrote:
> > NAND has no PO2 requirement. The emulation effort was only done to help
> > add support for !PO2 devices because there is no alternative. If we
> > however are ready instead to go down the avenue of removing those
> > restrictions well let's go there then instead. If that's not even
> > something we are willing to consider I'd really like folks who stand
> > behind the PO2 requirement to stick their necks out and clearly say that
> > their hw/fw teams are happy to deal with this requirement forever on ZNS.
> 
> Regardless of the merits of the current OS requirement, it's a trivial
> matter for firmware to round up their reported zone size to the next
> power of 2. This does not create a significant burden on their part, as
> far as I know.

Sure sure.. fw can do crap like that too...

> And po2 does not even seem to be the real problem here. The holes seem
> to be what's causing a concern, which you have even without po2 zones.

Exactly.

> I'm starting to like the previous idea of creating an unholey
> device-mapper for such users...

Won't that restrict nvme with chunk size crap. For instance later if we
want much larger block sizes.

  Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-11 21:04             ` Luis Chamberlain
@ 2022-03-11 21:31               ` Keith Busch
  2022-03-11 22:24                 ` Luis Chamberlain
  0 siblings, 1 reply; 83+ messages in thread
From: Keith Busch @ 2022-03-11 21:31 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, Pankaj Raghav, Adam Manzanares,
	Javier González, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Matias Bjørling, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme, Damien Le Moal

On Fri, Mar 11, 2022 at 01:04:35PM -0800, Luis Chamberlain wrote:
> On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote:
> 
> > I'm starting to like the previous idea of creating an unholey
> > device-mapper for such users...
> 
> Won't that restrict nvme with chunk size crap. For instance later if we
> want much larger block sizes.

I'm not sure I understand. The chunk_size has nothing to do with the
block size. And while nvme is a user of this in some circumstances, it
can't be used concurrently with ZNS because the block layer appropriates
the field for the zone size.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-11 20:51           ` Keith Busch
  2022-03-11 21:04             ` Luis Chamberlain
@ 2022-03-11 22:23             ` Adam Manzanares
  2022-03-11 22:30               ` Keith Busch
  2022-03-21 16:21             ` Jonathan Derrick
  2 siblings, 1 reply; 83+ messages in thread
From: Adam Manzanares @ 2022-03-11 22:23 UTC (permalink / raw)
  To: Keith Busch
  Cc: Luis Chamberlain, Christoph Hellwig, Pankaj Raghav,
	Javier González, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Matias Bjørling, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme, Damien Le Moal

On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote:
> On Fri, Mar 11, 2022 at 12:19:38PM -0800, Luis Chamberlain wrote:
> > NAND has no PO2 requirement. The emulation effort was only done to help
> > add support for !PO2 devices because there is no alternative. If we
> > however are ready instead to go down the avenue of removing those
> > restrictions well let's go there then instead. If that's not even
> > something we are willing to consider I'd really like folks who stand
> > behind the PO2 requirement to stick their necks out and clearly say that
> > their hw/fw teams are happy to deal with this requirement forever on ZNS.
> 
> Regardless of the merits of the current OS requirement, it's a trivial
> matter for firmware to round up their reported zone size to the next
> power of 2. This does not create a significant burden on their part, as
> far as I know.

I can't comment on FW burdens but adding po2 zone size creates holes for the 
FW to deal with as well.

> 
> And po2 does not even seem to be the real problem here. The holes seem
> to be what's causing a concern, which you have even without po2 zones.
> I'm starting to like the previous idea of creating an unholey
> device-mapper for such users...

I see holes as being caused by having to make zone size po2 when capacity is 
not po2. po2 should be tied to the holes, unless I am missing something. BTW if
we go down the dm route can we start calling it dm-unholy.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-11 21:31               ` Keith Busch
@ 2022-03-11 22:24                 ` Luis Chamberlain
  2022-03-12  7:58                   ` Damien Le Moal
  0 siblings, 1 reply; 83+ messages in thread
From: Luis Chamberlain @ 2022-03-11 22:24 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Pankaj Raghav, Adam Manzanares,
	Javier González, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Matias Bjørling, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme, Damien Le Moal

On Fri, Mar 11, 2022 at 01:31:02PM -0800, Keith Busch wrote:
> On Fri, Mar 11, 2022 at 01:04:35PM -0800, Luis Chamberlain wrote:
> > On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote:
> > 
> > > I'm starting to like the previous idea of creating an unholey
> > > device-mapper for such users...
> > 
> > Won't that restrict nvme with chunk size crap. For instance later if we
> > want much larger block sizes.
> 
> I'm not sure I understand. The chunk_size has nothing to do with the
> block size. And while nvme is a user of this in some circumstances, it
> can't be used concurrently with ZNS because the block layer appropriates
> the field for the zone size.

Many device mapper targets split I/O into chunks, see max_io_len(),
wouldn't this create an overhead?

Using a device mapper target also creates a divergence in strategy
for ZNS. Some will use the block device, others the dm target. The
goal should be to create a unified path.

And all this, just because SMR. Is that worth it? Are we sure?

  Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-11 22:23             ` Adam Manzanares
@ 2022-03-11 22:30               ` Keith Busch
  0 siblings, 0 replies; 83+ messages in thread
From: Keith Busch @ 2022-03-11 22:30 UTC (permalink / raw)
  To: Adam Manzanares
  Cc: Luis Chamberlain, Christoph Hellwig, Pankaj Raghav,
	Javier González, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Matias Bjørling, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme, Damien Le Moal

On Fri, Mar 11, 2022 at 10:23:33PM +0000, Adam Manzanares wrote:
> On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote:
> > And po2 does not even seem to be the real problem here. The holes seem
> > to be what's causing a concern, which you have even without po2 zones.
> > I'm starting to like the previous idea of creating an unholey
> > device-mapper for such users...
> 
> I see holes as being caused by having to make zone size po2 when capacity is 
> not po2. po2 should be tied to the holes, unless I am missing something. 

Practically speaking, you're probably not missing anything. The spec,
however, doesn't constrain the existence of holes to any particular zone
size.

> BTW if we go down the dm route can we start calling it dm-unholy.

I was thinking "dm-evil" but unholy works too. :)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-11 22:24                 ` Luis Chamberlain
@ 2022-03-12  7:58                   ` Damien Le Moal
  2022-03-14  7:35                     ` Christoph Hellwig
  2022-03-14  8:36                     ` Matias Bjørling
  0 siblings, 2 replies; 83+ messages in thread
From: Damien Le Moal @ 2022-03-12  7:58 UTC (permalink / raw)
  To: Luis Chamberlain, Keith Busch
  Cc: Christoph Hellwig, Pankaj Raghav, Adam Manzanares,
	Javier González, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Matias Bjørling, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On 3/12/22 07:24, Luis Chamberlain wrote:
> On Fri, Mar 11, 2022 at 01:31:02PM -0800, Keith Busch wrote:
>> On Fri, Mar 11, 2022 at 01:04:35PM -0800, Luis Chamberlain wrote:
>>> On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote:
>>>
>>>> I'm starting to like the previous idea of creating an unholey
>>>> device-mapper for such users...
>>>
>>> Won't that restrict nvme with chunk size crap. For instance later if we
>>> want much larger block sizes.
>>
>> I'm not sure I understand. The chunk_size has nothing to do with the
>> block size. And while nvme is a user of this in some circumstances, it
>> can't be used concurrently with ZNS because the block layer appropriates
>> the field for the zone size.
> 
> Many device mapper targets split I/O into chunks, see max_io_len(),
> wouldn't this create an overhead?

Apart from the bio clone, the overhead should not be higher than what
the block layer already has. IOs that are too large or that are
straddling zones are split by the block layer, and DM splitting leads
generally to no split in the block layer for the underlying device IO.
DM essentially follows the same pattern: max_io_len() depends on the
target design limits, which in turn depend on the underlying device. For
a dm-unhole target, the IO size limit would typically be the same as
that of the underlying device.

> Using a device mapper target also creates a divergence in strategy
> for ZNS. Some will use the block device, others the dm target. The
> goal should be to create a unified path.

If we allow non power of 2 zone sized devices, the path will *never* be
unified because we will get fragmentation on what can run on these
devices as opposed to power of 2 sized ones. E.g. f2fs will not work for
the former but will for the latter. That is really not an ideal situation.

> 
> And all this, just because SMR. Is that worth it? Are we sure?

No. This is *not* because of SMR. Never has been. The first prototype
SMR drives I received in my lab 10 years ago did not have a power of 2
sized zone size because zones where naturally aligned to tracks, which
like NAND erase blocks, are not necessarily power of 2 sized. And all
zones were not even the same size. That was not usable.

The reason for the power of 2 requirement is 2 fold:
1) At the time we added zone support for SMR, chunk_sectors had to be a
power of 2 number of sectors.
2) SMR users did request power of 2 zone sizes and that all zones have
the same size as that simplified software design. There was even a
de-facto agreement that 256MB zone size is a good compromise between
usability and overhead of zone reclaim/GC. But that particular number is
for HDD due to their performance characteristics.

Hence the current Linux requirements which have been serving us well so
far. DM needed that chunk_sectors be changed to allow non power of 2
values. So the chunk_sectors requirement was lifted recently (can't
remember which version added this). Allowing non power of 2 zone size
would thus be more easily feasible now. Allowing devices with a non
power of 2 zone size is not technically difficult.

But...

The problem being raised is all about the fact that the power of 2 zone
size requirement creates a hole of unusable sectors in every zone when
the device implementation has a zone capacity lower than the zone size.

I have been arguing all along that I think this problem is a
non-problem, simply because a well designed application should *always*
use zones as storage containers without ever hoping that the next zone
in sequence can be used as well. The application should *never* consider
the entire LBA space of the device capacity without this zone split. The
zone based management of capacity is necessary for any good design to
deal correctly with write error recovery and active/open zone resources
management. And as Keith said. there is always a "hole" anyway for any
non-full zone, between the zone write pointer and the last usable sector
in the zone. Reads there are nonsensical and writes can only go to one
place.

Now, in the spirit of trying to facilitate software development for
zoned devices, we can try finding solutions to remove that hole. zonefs
is a obvious solution. But back to the previous point: with one zone ==
one file, there is no continuity in the storage address space that the
application can use. The application has to be designed to use
individual files representing a zone. And with such design, an
equivalent design directly using the block device file would have no
difficulties due to the the sector hole between zone capacity and zone
size. I have a prototype LevelDB implementation that can use both zonefs
and block device file on ZNS with only a few different lines of code to
prove this point.

The other solution would be adding a dm-unhole target to remap sectors
to remove the holes from the device address space. Such target would be
easy to write, but in my opinion, this would still not change the fact
that applications still have to deal with error recovery and active/open
zone resources. So they still have to be zone aware and operate per zone.

Furthermore, adding such DM target would create a non power of 2 zone
size zoned device which will need support from the block layer. So some
block layer functions will need to change. In the end, this may not be
different than enabling non power of 2 zone sized devices for ZNS.

And for this decision, I maintain some of my requirements:
1) The added overhead from multiplication & divisions should be
acceptable and not degrade performance. Otherwise, this would be a
disservice to the zone ecosystem.
2) Nothing that works today on available devices should break
3) Zone size requirements will still exist. E.g. btrfs 64K alignment
requirement

But even with all these properly addressed, f2fs will not work anymore,
some in-kernel users will still need some zone size requirements (btrfs)
and *all* applications using a zoned block device file will now have to
be designed based on non power of 2 zone size so that they can work on
all devices. Meaning that this is also potentially forcing changes on
existing applications to use newer zoned devices that may not have a
power of 2 zone size.

This entire discussion is about the problem that power of 2 zone size
creates (which again I think is a non-problem). However, based on the
arguments above, allowing non power of 2 zone sized devices is not
exactly problem free either.

My answer to your last question ("Are we sure?") is thus: No. I am not
sure this is a good idea. But as always, I would be happy to be proven
wrong. So far, I have not seen any argument doing that.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-12  7:58                   ` Damien Le Moal
@ 2022-03-14  7:35                     ` Christoph Hellwig
  2022-03-14  7:45                       ` Damien Le Moal
  2022-03-14  8:36                     ` Matias Bjørling
  1 sibling, 1 reply; 83+ messages in thread
From: Christoph Hellwig @ 2022-03-14  7:35 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Luis Chamberlain, Keith Busch, Christoph Hellwig, Pankaj Raghav,
	Adam Manzanares, Javier González, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Matias Bjørling,
	Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme

On Sat, Mar 12, 2022 at 04:58:08PM +0900, Damien Le Moal wrote:
> The reason for the power of 2 requirement is 2 fold:
> 1) At the time we added zone support for SMR, chunk_sectors had to be a
> power of 2 number of sectors.
> 2) SMR users did request power of 2 zone sizes and that all zones have
> the same size as that simplified software design. There was even a
> de-facto agreement that 256MB zone size is a good compromise between
> usability and overhead of zone reclaim/GC. But that particular number is
> for HDD due to their performance characteristics.

Also for NVMe we initially went down the road to try to support
non power of two sizes.  But there was another major early host that
really wanted the power of two zone sizes to support hardware based
hosts that can cheaply do shifts but not divisions.  The variable
zone capacity feature (something that Linux does not currently support)
is a feature requested by NVMe members on the host and device side
also can only be supported with the the zone size / zone capacity split.

> The other solution would be adding a dm-unhole target to remap sectors
> to remove the holes from the device address space. Such target would be
> easy to write, but in my opinion, this would still not change the fact
> that applications still have to deal with error recovery and active/open
> zone resources. So they still have to be zone aware and operate per zone.

I don't think we even need a new target for it.  I think you can do
this with a table using multiple dm-linear sections already if you
want.

> My answer to your last question ("Are we sure?") is thus: No. I am not
> sure this is a good idea. But as always, I would be happy to be proven
> wrong. So far, I have not seen any argument doing that.

Agreed. Supporting non-power of two sizes in the block layer is fairly
easy as shown by some of the patches seens in this series.  Supporting
them properly in the whole ecosystem is not trivial and will create a
long-term burden.  We could do that, but we'd rather have a really good
reason for it, and right now I don't see that.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-10 17:38     ` Adam Manzanares
@ 2022-03-14  7:36       ` Christoph Hellwig
  0 siblings, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2022-03-14  7:36 UTC (permalink / raw)
  To: Adam Manzanares
  Cc: Christoph Hellwig, Pankaj Raghav, Luis Chamberlain,
	Javier González, kanchan Joshi, Jens Axboe, Keith Busch,
	Sagi Grimberg, Damien Le Moal, Matias Bjørling, jiangbo.365,
	Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme

On Thu, Mar 10, 2022 at 05:38:35PM +0000, Adam Manzanares wrote:
> > Do these drives even support Zone Append?
> 
> Should it matter if the drives support append? SMR drives do not support append
> and they are considered zone block devices. Append seems to be an optimization
> for users that want higher concurrency per zone. One can also build concurrency
> by leveraging multiple zones simultaneously as well.

Not supporting it natively for SMR is a major pain.  Due to hard drives
being relatively slow the emulation is somewhat workable, but on SSDs
the serialization would completely kill performance.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-14  7:35                     ` Christoph Hellwig
@ 2022-03-14  7:45                       ` Damien Le Moal
  2022-03-14  7:58                         ` Christoph Hellwig
  2022-03-14 10:49                         ` Javier González
  0 siblings, 2 replies; 83+ messages in thread
From: Damien Le Moal @ 2022-03-14  7:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Luis Chamberlain, Keith Busch, Pankaj Raghav, Adam Manzanares,
	Javier González, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Matias Bjørling, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On 3/14/22 16:35, Christoph Hellwig wrote:
> On Sat, Mar 12, 2022 at 04:58:08PM +0900, Damien Le Moal wrote:
>> The reason for the power of 2 requirement is 2 fold:
>> 1) At the time we added zone support for SMR, chunk_sectors had to be a
>> power of 2 number of sectors.
>> 2) SMR users did request power of 2 zone sizes and that all zones have
>> the same size as that simplified software design. There was even a
>> de-facto agreement that 256MB zone size is a good compromise between
>> usability and overhead of zone reclaim/GC. But that particular number is
>> for HDD due to their performance characteristics.
> 
> Also for NVMe we initially went down the road to try to support
> non power of two sizes.  But there was another major early host that
> really wanted the power of two zone sizes to support hardware based
> hosts that can cheaply do shifts but not divisions.  The variable
> zone capacity feature (something that Linux does not currently support)
> is a feature requested by NVMe members on the host and device side
> also can only be supported with the the zone size / zone capacity split.
> 
>> The other solution would be adding a dm-unhole target to remap sectors
>> to remove the holes from the device address space. Such target would be
>> easy to write, but in my opinion, this would still not change the fact
>> that applications still have to deal with error recovery and active/open
>> zone resources. So they still have to be zone aware and operate per zone.
> 
> I don't think we even need a new target for it.  I think you can do
> this with a table using multiple dm-linear sections already if you
> want.

Nope, this is currently not possible: DM requires the target zone size
to be the same as the underlying device zone size. So that would not work.

> 
>> My answer to your last question ("Are we sure?") is thus: No. I am not
>> sure this is a good idea. But as always, I would be happy to be proven
>> wrong. So far, I have not seen any argument doing that.
> 
> Agreed. Supporting non-power of two sizes in the block layer is fairly
> easy as shown by some of the patches seens in this series.  Supporting
> them properly in the whole ecosystem is not trivial and will create a
> long-term burden.  We could do that, but we'd rather have a really good
> reason for it, and right now I don't see that.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-14  7:45                       ` Damien Le Moal
@ 2022-03-14  7:58                         ` Christoph Hellwig
  2022-03-14 10:49                         ` Javier González
  1 sibling, 0 replies; 83+ messages in thread
From: Christoph Hellwig @ 2022-03-14  7:58 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Luis Chamberlain, Keith Busch, Pankaj Raghav,
	Adam Manzanares, Javier González, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Matias Bjørling,
	Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme

On Mon, Mar 14, 2022 at 04:45:12PM +0900, Damien Le Moal wrote:
> Nope, this is currently not possible: DM requires the target zone size
> to be the same as the underlying device zone size. So that would not work.

Indeed.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* RE: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-12  7:58                   ` Damien Le Moal
  2022-03-14  7:35                     ` Christoph Hellwig
@ 2022-03-14  8:36                     ` Matias Bjørling
  1 sibling, 0 replies; 83+ messages in thread
From: Matias Bjørling @ 2022-03-14  8:36 UTC (permalink / raw)
  To: Damien Le Moal, Luis Chamberlain, Keith Busch
  Cc: Christoph Hellwig, Pankaj Raghav, Adam Manzanares,
	Javier González, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Pankaj Raghav, Kanchan Joshi, linux-block,
	linux-nvme

> 
> Furthermore, adding such DM target would create a non power of 2 zone size
> zoned device which will need support from the block layer. So some block layer
> functions will need to change. In the end, this may not be different than
> enabling non power of 2 zone sized devices for ZNS.
> 
> And for this decision, I maintain some of my requirements:
> 1) The added overhead from multiplication & divisions should be acceptable
> and not degrade performance. Otherwise, this would be a disservice to the
> zone ecosystem.
> 2) Nothing that works today on available devices should break
> 3) Zone size requirements will still exist. E.g. btrfs 64K alignment requirement
> 

Adding to the existing points that has been made.

I believe it hasn't been mentioned that for non-power of 2 zone sizes, holes are still allowed due to zones being/becoming offline. The offline zone state supports neither writes nor reads, and applications must be aware and work around such holes in the address space. 

Furthermore, the specification doesn't allow writes to cross zones - so while reads may cross a zone, the writes must always be broken up across zone boundaries. 

As a result, applications must work with zones independently and can't assume that it can write to the adjacent zone nor write across two zones. 

Best, Matias



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-14  7:45                       ` Damien Le Moal
  2022-03-14  7:58                         ` Christoph Hellwig
@ 2022-03-14 10:49                         ` Javier González
  2022-03-14 14:16                           ` Matias Bjørling
  1 sibling, 1 reply; 83+ messages in thread
From: Javier González @ 2022-03-14 10:49 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Luis Chamberlain, Keith Busch, Pankaj Raghav,
	Adam Manzanares, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Matias Bjørling, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On 14.03.2022 16:45, Damien Le Moal wrote:
>On 3/14/22 16:35, Christoph Hellwig wrote:
>> On Sat, Mar 12, 2022 at 04:58:08PM +0900, Damien Le Moal wrote:
>>> The reason for the power of 2 requirement is 2 fold:
>>> 1) At the time we added zone support for SMR, chunk_sectors had to be a
>>> power of 2 number of sectors.
>>> 2) SMR users did request power of 2 zone sizes and that all zones have
>>> the same size as that simplified software design. There was even a
>>> de-facto agreement that 256MB zone size is a good compromise between
>>> usability and overhead of zone reclaim/GC. But that particular number is
>>> for HDD due to their performance characteristics.
>>
>> Also for NVMe we initially went down the road to try to support
>> non power of two sizes.  But there was another major early host that
>> really wanted the power of two zone sizes to support hardware based
>> hosts that can cheaply do shifts but not divisions.  The variable
>> zone capacity feature (something that Linux does not currently support)
>> is a feature requested by NVMe members on the host and device side
>> also can only be supported with the the zone size / zone capacity split.
>>
>>> The other solution would be adding a dm-unhole target to remap sectors
>>> to remove the holes from the device address space. Such target would be
>>> easy to write, but in my opinion, this would still not change the fact
>>> that applications still have to deal with error recovery and active/open
>>> zone resources. So they still have to be zone aware and operate per zone.
>>
>> I don't think we even need a new target for it.  I think you can do
>> this with a table using multiple dm-linear sections already if you
>> want.
>
>Nope, this is currently not possible: DM requires the target zone size
>to be the same as the underlying device zone size. So that would not work.
>
>>
>>> My answer to your last question ("Are we sure?") is thus: No. I am not
>>> sure this is a good idea. But as always, I would be happy to be proven
>>> wrong. So far, I have not seen any argument doing that.
>>
>> Agreed. Supporting non-power of two sizes in the block layer is fairly
>> easy as shown by some of the patches seens in this series.  Supporting
>> them properly in the whole ecosystem is not trivial and will create a
>> long-term burden.  We could do that, but we'd rather have a really good
>> reason for it, and right now I don't see that.

I think that Bo's use-case is an example of a major upstream Linux host
that is struggling with unmmapped LBAs. Can we focus on this use-case
and the parts that we are missing to support Bytedance?

If you agree to this, I believe we can add support for ZoneFS pretty
easily. We also have a POC in btrfs that we will follow on. For the time
being, F2FS would fail at mkfs time if zone size is not a PO2.

What do you think?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* RE: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-14 10:49                         ` Javier González
@ 2022-03-14 14:16                           ` Matias Bjørling
  2022-03-14 16:23                             ` Luis Chamberlain
  2022-03-14 19:55                             ` Javier González
  0 siblings, 2 replies; 83+ messages in thread
From: Matias Bjørling @ 2022-03-14 14:16 UTC (permalink / raw)
  To: Javier González, Damien Le Moal
  Cc: Christoph Hellwig, Luis Chamberlain, Keith Busch, Pankaj Raghav,
	Adam Manzanares, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Pankaj Raghav, Kanchan Joshi, linux-block,
	linux-nvme

> >> Agreed. Supporting non-power of two sizes in the block layer is
> >> fairly easy as shown by some of the patches seens in this series.
> >> Supporting them properly in the whole ecosystem is not trivial and
> >> will create a long-term burden.  We could do that, but we'd rather
> >> have a really good reason for it, and right now I don't see that.
> 
> I think that Bo's use-case is an example of a major upstream Linux host that is
> struggling with unmmapped LBAs. Can we focus on this use-case and the parts
> that we are missing to support Bytedance?

Any application that uses zoned storage devices would have to manage unmapped LBAs due to the potential of zones being/becoming offline (no reads/writes allowed). Eliminating the difference between zone cap and zone size will not remove this requirement, and holes will continue to exist. Furthermore, writing to LBAs across zones is not allowed by the specification and must also be managed.

Given the above, applications have to be conscious of zones in general and work within their boundaries. I don't understand how applications can work without having per-zone knowledge. An application would have to know about zones and their writeable capacity. To decide where and how data is written, an application must manage writing across zones, specific offline zones, and (currently) its writeable capacity. I.e., knowledge about zones and holes is required for writing to zoned devices and isn't eliminated by removing the PO2 zone size requirement.

For years, the PO2 requirement has been known in the Linux community and by the ZNS SSD vendors. Some SSD implementors have chosen not to support PO2 zone sizes, which is a perfectly valid decision. But its implementors knowingly did that while knowing that the Linux kernel didn't support it. 

I want to turn the argument around to see it from the kernel developer's point of view. They have communicated the PO2 requirement clearly, there's good precedence working with PO2 zone sizes, and at last, holes can't be avoided and are part of the overall design of zoned storage devices. So why should the kernel developer's take on the long-term maintenance burden of NPO2 zone sizes?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-14 14:16                           ` Matias Bjørling
@ 2022-03-14 16:23                             ` Luis Chamberlain
  2022-03-14 19:30                               ` Matias Bjørling
  2022-03-14 19:55                             ` Javier González
  1 sibling, 1 reply; 83+ messages in thread
From: Luis Chamberlain @ 2022-03-14 16:23 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: Javier González, Damien Le Moal, Christoph Hellwig,
	Keith Busch, Pankaj Raghav, Adam Manzanares, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On Mon, Mar 14, 2022 at 02:16:36PM +0000, Matias Bjørling wrote:
> I want to turn the argument around to see it from the kernel
> developer's point of view. They have communicated the PO2 requirement
> clearly,

Such requirement is based on history and effort put in place to assume
a PO2 requirement for zone storage, and clearly it is not. And clearly
even vendors who have embraced PO2 don't know for sure they'll always
be able to stick to PO2...

> there's good precedence working with PO2 zone sizes, and at
> last, holes can't be avoided and are part of the overall design of
> zoned storage devices. So why should the kernel developer's take on
> the long-term maintenance burden of NPO2 zone sizes?

I think the better question to address here is:

Do we *not* want to support NPO2 zone sizes in Linux out of principal?

If we *are* open to support NPO2 zone sizes, what path should we take to
incur the least pain and fragmentation?

Emulation was a path being considered, and I think at this point the
answer to eveluating that path is: this is cumbersome, probably not.

The next question then is: are we open to evaluate what it looks like
to slowly shave off the PO2 requirement in different layers, with an
goal to avoid further fragmentation? There is effort on evaluating that
path and it doesn't seem to be that bad.

So I'd advise to evaluate that, there is nothing to loose other than
awareness of what that path might look like.

Uness of course we already have a clear path forward for NPO2 we can
all agree on.

  Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* RE: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-14 16:23                             ` Luis Chamberlain
@ 2022-03-14 19:30                               ` Matias Bjørling
  2022-03-14 19:51                                 ` Luis Chamberlain
  0 siblings, 1 reply; 83+ messages in thread
From: Matias Bjørling @ 2022-03-14 19:30 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Javier González, Damien Le Moal, Christoph Hellwig,
	Keith Busch, Pankaj Raghav, Adam Manzanares, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

> -----Original Message-----
> From: Luis Chamberlain <mcgrof@infradead.org> On Behalf Of Luis
> Chamberlain
> Sent: Monday, 14 March 2022 17.24
> To: Matias Bjørling <Matias.Bjorling@wdc.com>
> Cc: Javier González <javier@javigon.com>; Damien Le Moal
> <damien.lemoal@opensource.wdc.com>; Christoph Hellwig <hch@lst.de>;
> Keith Busch <kbusch@kernel.org>; Pankaj Raghav <p.raghav@samsung.com>;
> Adam Manzanares <a.manzanares@samsung.com>;
> jiangbo.365@bytedance.com; kanchan Joshi <joshi.k@samsung.com>; Jens
> Axboe <axboe@kernel.dk>; Sagi Grimberg <sagi@grimberg.me>; Pankaj
> Raghav <pankydev8@gmail.com>; Kanchan Joshi <joshiiitr@gmail.com>; linux-
> block@vger.kernel.org; linux-nvme@lists.infradead.org
> Subject: Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
> 
> On Mon, Mar 14, 2022 at 02:16:36PM +0000, Matias Bjørling wrote:
> > I want to turn the argument around to see it from the kernel
> > developer's point of view. They have communicated the PO2 requirement
> > clearly,
> 
> Such requirement is based on history and effort put in place to assume a PO2
> requirement for zone storage, and clearly it is not. And clearly even vendors
> who have embraced PO2 don't know for sure they'll always be able to stick to
> PO2...

Sure - It'll be naïve to give a carte blanche promise.

However, you're skipping the next two elements, which state that there are both good precedence working with PO2 zone sizes and that holes/unmapped LBAs can't be avoided. Making an argument for why NPO2 zone sizes may not bring what one is looking for. It's a lot of work for little practical change, if any. 

> 
> > there's good precedence working with PO2 zone sizes, and at last,
> > holes can't be avoided and are part of the overall design of zoned
> > storage devices. So why should the kernel developer's take on the
> > long-term maintenance burden of NPO2 zone sizes?
> 
> I think the better question to address here is:
> 
> Do we *not* want to support NPO2 zone sizes in Linux out of principal?
> 
> If we *are* open to support NPO2 zone sizes, what path should we take to
> incur the least pain and fragmentation?
> 
> Emulation was a path being considered, and I think at this point the answer to
> eveluating that path is: this is cumbersome, probably not.
> 
> The next question then is: are we open to evaluate what it looks like to slowly
> shave off the PO2 requirement in different layers, with an goal to avoid further
> fragmentation? There is effort on evaluating that path and it doesn't seem to
> be that bad.
> 
> So I'd advise to evaluate that, there is nothing to loose other than awareness of
> what that path might look like.
> 
> Uness of course we already have a clear path forward for NPO2 we can all
> agree on.

It looks like there isn't currently one that can be agreed upon.

If evaluating different approaches, it would be helpful to the reviewers if interfaces and all of its kernel users are converted in a single patchset. This would also help to avoid users getting hit by what is supported, and what isn't supported by a particular device implementation and allow better to review the full set of changes required to add the support.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-14 19:30                               ` Matias Bjørling
@ 2022-03-14 19:51                                 ` Luis Chamberlain
  2022-03-15 10:45                                   ` Matias Bjørling
  0 siblings, 1 reply; 83+ messages in thread
From: Luis Chamberlain @ 2022-03-14 19:51 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: Javier González, Damien Le Moal, Christoph Hellwig,
	Keith Busch, Pankaj Raghav, Adam Manzanares, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On Mon, Mar 14, 2022 at 07:30:25PM +0000, Matias Bjørling wrote:
> > -----Original Message-----
> > From: Luis Chamberlain <mcgrof@infradead.org> On Behalf Of Luis
> > Chamberlain
> > Sent: Monday, 14 March 2022 17.24
> > To: Matias Bjørling <Matias.Bjorling@wdc.com>
> > Cc: Javier González <javier@javigon.com>; Damien Le Moal
> > <damien.lemoal@opensource.wdc.com>; Christoph Hellwig <hch@lst.de>;
> > Keith Busch <kbusch@kernel.org>; Pankaj Raghav <p.raghav@samsung.com>;
> > Adam Manzanares <a.manzanares@samsung.com>;
> > jiangbo.365@bytedance.com; kanchan Joshi <joshi.k@samsung.com>; Jens
> > Axboe <axboe@kernel.dk>; Sagi Grimberg <sagi@grimberg.me>; Pankaj
> > Raghav <pankydev8@gmail.com>; Kanchan Joshi <joshiiitr@gmail.com>; linux-
> > block@vger.kernel.org; linux-nvme@lists.infradead.org
> > Subject: Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
> > 
> > On Mon, Mar 14, 2022 at 02:16:36PM +0000, Matias Bjørling wrote:
> > > I want to turn the argument around to see it from the kernel
> > > developer's point of view. They have communicated the PO2 requirement
> > > clearly,
> > 
> > Such requirement is based on history and effort put in place to assume a PO2
> > requirement for zone storage, and clearly it is not. And clearly even vendors
> > who have embraced PO2 don't know for sure they'll always be able to stick to
> > PO2...
> 
> Sure - It'll be naïve to give a carte blanche promise.

Exactly. So taking a position to not support NPO2 I think seems counter
productive to the future of ZNS, the question whould be, *how* to best
do this in light of what we need to support / avoid performance
regressions / strive towards avoiding fragmentation.

> However, you're skipping the next two elements, which state that there
> are both good precedence working with PO2 zone sizes and that
> holes/unmapped LBAs can't be avoided.

I'm not, but I admit that it's a good point of having the possibility of
zones being taken offline also implicates holes. I also think it was a
good excercise to discuss and evaluate emulation given I don't think
this point you made would have been made clear otherwise. This is why
I treat ZNS as evolving effort, and I can't seriously take any position
stating all answers are known.

> Making an argument for why NPO2
> zone sizes may not bring what one is looking for. It's a lot of work
> for little practical change, if any. 

NAND does not incur a PO2 requirement, that should be enough to
implicate that PO2 zones *can* be expected. If no vendor wants
to take a position that they know for a fact they'll never adopt
PO2 zones should be enough to keep an open mind to consider *how*
to support them.

> > > there's good precedence working with PO2 zone sizes, and at last,
> > > holes can't be avoided and are part of the overall design of zoned
> > > storage devices. So why should the kernel developer's take on the
> > > long-term maintenance burden of NPO2 zone sizes?
> > 
> > I think the better question to address here is:
> > 
> > Do we *not* want to support NPO2 zone sizes in Linux out of principal?
> > 
> > If we *are* open to support NPO2 zone sizes, what path should we take to
> > incur the least pain and fragmentation?
> > 
> > Emulation was a path being considered, and I think at this point the answer to
> > eveluating that path is: this is cumbersome, probably not.
> > 
> > The next question then is: are we open to evaluate what it looks like to slowly
> > shave off the PO2 requirement in different layers, with an goal to avoid further
> > fragmentation? There is effort on evaluating that path and it doesn't seem to
> > be that bad.
> > 
> > So I'd advise to evaluate that, there is nothing to loose other than awareness of
> > what that path might look like.
> > 
> > Uness of course we already have a clear path forward for NPO2 we can all
> > agree on.
> 
> It looks like there isn't currently one that can be agreed upon.

I'm not quite sure that is the case. To reach consensus one has
to take a position of accepting the right answer may not be known
and we evaluate all prospects. It is not clear to me that we've done
that yet and it is why I think a venue such as LSFMM may be good to
review these things.

> If evaluating different approaches, it would be helpful to the
> reviewers if interfaces and all of its kernel users are converted in a
> single patchset. This would also help to avoid users getting hit by
> what is supported, and what isn't supported by a particular device
> implementation and allow better to review the full set of changes
> required to add the support.

Sorry I didn't understand the suggestion here, can you clarify what it
is you are suggesting?

Thanks!

  Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-14 14:16                           ` Matias Bjørling
  2022-03-14 16:23                             ` Luis Chamberlain
@ 2022-03-14 19:55                             ` Javier González
  2022-03-15 12:32                               ` Matias Bjørling
  1 sibling, 1 reply; 83+ messages in thread
From: Javier González @ 2022-03-14 19:55 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: Damien Le Moal, Christoph Hellwig, Luis Chamberlain, Keith Busch,
	Pankaj Raghav, Adam Manzanares, jiangbo.365, kanchan Joshi,
	Jens Axboe, Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
	linux-block, linux-nvme

On 14.03.2022 14:16, Matias Bjørling wrote:
>> >> Agreed. Supporting non-power of two sizes in the block layer is
>> >> fairly easy as shown by some of the patches seens in this series.
>> >> Supporting them properly in the whole ecosystem is not trivial and
>> >> will create a long-term burden.  We could do that, but we'd rather
>> >> have a really good reason for it, and right now I don't see that.
>>
>> I think that Bo's use-case is an example of a major upstream Linux host that is
>> struggling with unmmapped LBAs. Can we focus on this use-case and the parts
>> that we are missing to support Bytedance?
>
>Any application that uses zoned storage devices would have to manage
>unmapped LBAs due to the potential of zones being/becoming offline (no
>reads/writes allowed). Eliminating the difference between zone cap and
>zone size will not remove this requirement, and holes will continue to
>exist. Furthermore, writing to LBAs across zones is not allowed by the
>specification and must also be managed.
>
>Given the above, applications have to be conscious of zones in general and work within their boundaries. I don't understand how applications can work without having per-zone knowledge. An application would have to know about zones and their writeable capacity. To decide where and how data is written, an application must manage writing across zones, specific offline zones, and (currently) its writeable capacity. I.e., knowledge about zones and holes is required for writing to zoned devices and isn't eliminated by removing the PO2 zone size requirement.

Supporting offlines zones is optional in the ZNS spec? We are not
considering supporting this in the host. This will be handled by the
device for exactly maintaining the SW stack simpler.
>
>For years, the PO2 requirement has been known in the Linux community and by the ZNS SSD vendors. Some SSD implementors have chosen not to support PO2 zone sizes, which is a perfectly valid decision. But its implementors knowingly did that while knowing that the Linux kernel didn't support it.
>
>I want to turn the argument around to see it from the kernel developer's point of view. They have communicated the PO2 requirement clearly, there's good precedence working with PO2 zone sizes, and at last, holes can't be avoided and are part of the overall design of zoned storage devices. So why should the kernel developer's take on the long-term maintenance burden of NPO2 zone sizes?

You have a good point, and that is the question we need to help answer.
As I see it, requirements evolve and the kernel changes with it as long
as there are active upstream users for it.

The main constraint for PO2 is removed in the block layer, we have Linux
hosts stating that unmapped LBAs are a problem, and we have HW
supporting size=capacity.

I would be happy to hear what else you would like to see for this to be
of use to the kernel community.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* RE: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-14 19:51                                 ` Luis Chamberlain
@ 2022-03-15 10:45                                   ` Matias Bjørling
  0 siblings, 0 replies; 83+ messages in thread
From: Matias Bjørling @ 2022-03-15 10:45 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Javier González, Damien Le Moal, Christoph Hellwig,
	Keith Busch, Pankaj Raghav, Adam Manzanares, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

> > > On Mon, Mar 14, 2022 at 02:16:36PM +0000, Matias Bjørling wrote:
> > > > I want to turn the argument around to see it from the kernel
> > > > developer's point of view. They have communicated the PO2
> > > > requirement clearly,
> > >
> > > Such requirement is based on history and effort put in place to
> > > assume a PO2 requirement for zone storage, and clearly it is not.
> > > And clearly even vendors who have embraced PO2 don't know for sure
> > > they'll always be able to stick to PO2...
> >
> > Sure - It'll be naïve to give a carte blanche promise.
> 
> Exactly. So taking a position to not support NPO2 I think seems counter
> productive to the future of ZNS, the question whould be, *how* to best do this
> in light of what we need to support / avoid performance regressions / strive
> towards avoiding fragmentation.

Having non-power of two zone sizes is a derivation from existing devices being used in full production today. That there is a wish to introduce support for such drives is interesting, but given the background and development of zoned devices. Damien mentioned that SMR HDDs didn't start off with PO2 zone sizes - that was what became the norm due to its overall benefits. I.e., drives with NPO2 zone sizes is the odd one, and in some views, is the one creating fragmentation.

That there is a wish to revisit that design decision is fair, and it sounds like there is willingness to explorer such options. But please be advised that the Linux community have had communicated the specific requirement for a long time to avoid this particular issue. Thus, the community have been trying to help the vendors make the appropriate design decisions, such that they could take advantage of the Linux kernel stack from day one.

> > However, you're skipping the next two elements, which state that there
> > are both good precedence working with PO2 zone sizes and that
> > holes/unmapped LBAs can't be avoided.
> 
> I'm not, but I admit that it's a good point of having the possibility of zones being
> taken offline also implicates holes. I also think it was a good excercise to
> discuss and evaluate emulation given I don't think this point you made would
> have been made clear otherwise. This is why I treat ZNS as evolving effort, and
> I can't seriously take any position stating all answers are known.

That's good to hear. I would note that some members in this thread have been doing zoned storage for close to a decade, and have a very thorough understanding of the zoned storage model - so it might be a stretch for them to hear that you're considering everything up in the air and early. This stack is already being used by a large percentage of the bits being shipped in the world. Thus, there is an interest in maintaining these things, and making sure that things don't regress and so on. 

> 
> > Making an argument for why NPO2
> > zone sizes may not bring what one is looking for. It's a lot of work
> > for little practical change, if any.
> 
> NAND does not incur a PO2 requirement, that should be enough to implicate
> that PO2 zones *can* be expected. If no vendor wants to take a position that
> they know for a fact they'll never adopt
> PO2 zones should be enough to keep an open mind to consider *how* to
> support them.

As long as it doesn't also imply that support *has* to be added to the kernel, then that's okay.

<snip>
> 
> > If evaluating different approaches, it would be helpful to the
> > reviewers if interfaces and all of its kernel users are converted in a
> > single patchset. This would also help to avoid users getting hit by
> > what is supported, and what isn't supported by a particular device
> > implementation and allow better to review the full set of changes
> > required to add the support.
> 
> Sorry I didn't understand the suggestion here, can you clarify what it is you are
> suggesting?

It would help reviewers that a potential patchset would convert all users (e.g., f2fs, btrfs, device mappers, io schedulers, etc.), such that the full effect can be evaluated with the added benefits that end-users not having to think about what is and what isn't supported.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* RE: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-14 19:55                             ` Javier González
@ 2022-03-15 12:32                               ` Matias Bjørling
  2022-03-15 13:05                                 ` Javier González
  0 siblings, 1 reply; 83+ messages in thread
From: Matias Bjørling @ 2022-03-15 12:32 UTC (permalink / raw)
  To: Javier González
  Cc: Damien Le Moal, Christoph Hellwig, Luis Chamberlain, Keith Busch,
	Pankaj Raghav, Adam Manzanares, jiangbo.365, kanchan Joshi,
	Jens Axboe, Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
	linux-block, linux-nvme

> >Given the above, applications have to be conscious of zones in general and
> work within their boundaries. I don't understand how applications can work
> without having per-zone knowledge. An application would have to know about
> zones and their writeable capacity. To decide where and how data is written,
> an application must manage writing across zones, specific offline zones, and
> (currently) its writeable capacity. I.e., knowledge about zones and holes is
> required for writing to zoned devices and isn't eliminated by removing the PO2
> zone size requirement.
> 
> Supporting offlines zones is optional in the ZNS spec? We are not considering
> supporting this in the host. This will be handled by the device for exactly
> maintaining the SW stack simpler.

It isn't optional. The spec allows any zones to go to Read Only or Offline state at any point in time. A specific implementation might give some guarantees to when such transitions happens, but it must nevertheless must be managed by the host software. 

Given that, and the need to not issue writes that spans zones, an application would have to aware of such behaviors. The information to make those decisions are in a zone's attributes, and thus applications would pull those, it would also know the writeable capability of a zone. So, all in all, creating support for NPO2 is something that takes a lot of work, but might have little to no impact on the overall software design. 

> >
> >For years, the PO2 requirement has been known in the Linux community and
> by the ZNS SSD vendors. Some SSD implementors have chosen not to support
> PO2 zone sizes, which is a perfectly valid decision. But its implementors
> knowingly did that while knowing that the Linux kernel didn't support it.
> >
> >I want to turn the argument around to see it from the kernel developer's point
> of view. They have communicated the PO2 requirement clearly, there's good
> precedence working with PO2 zone sizes, and at last, holes can't be avoided
> and are part of the overall design of zoned storage devices. So why should the
> kernel developer's take on the long-term maintenance burden of NPO2 zone
> sizes?
> 
> You have a good point, and that is the question we need to help answer.
> As I see it, requirements evolve and the kernel changes with it as long as there
> are active upstream users for it.

True. There's also active users for SSDs which are custom (e.g., larger than 4KiB writes required) - but they aren't supported by the Linux kernel and isn't actively being worked on to my knowledge. Which is fine, as the customers anyway uses this in their own way, and don't need the Linux kernel support.
 
> 
> The main constraint for (1) PO2 is removed in the block layer, we have (2) Linux hosts
> stating that unmapped LBAs are a problem, and we have (3) HW supporting
> size=capacity.
> 
> I would be happy to hear what else you would like to see for this to be of use to
> the kernel community.

(Added numbers to your paragraph above)

1. The sysfs chunksize attribute was "misused" to also represent zone size. What has changed is that RAID controllers now can use a NPO2 chunk size. This wasn't meant to naturally extend to zones, which as shown in the current posted patchset, is a lot more work.
2. Bo mentioned that the software already manages holes. It took a bit of time to get right, but now it works. Thus, the software in question is already capable of working with holes. Thus, fixing this, would present itself as a minor optimization overall. I'm not convinced the work to do this in the kernel is proportional to the change it'll make to the applications.
3. I'm happy to hear that. However, I'll like to reiterate the point that the PO2 requirement have been known for years. That there's a drive doing NPO2 zones is great, but a decision was made by the SSD implementors to not support the Linux kernel given its current implementation. 

All that said - if there are people willing to do the work and it doesn't have a negative impact on performance, code quality, maintenance complexity, etc. then there isn't anything saying support can't be added - but it does seem like it’s a lot of work, for little overall benefits to applications and the host users.



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 12:32                               ` Matias Bjørling
@ 2022-03-15 13:05                                 ` Javier González
  2022-03-15 13:14                                   ` Matias Bjørling
  2022-03-16  0:00                                   ` Damien Le Moal
  0 siblings, 2 replies; 83+ messages in thread
From: Javier González @ 2022-03-15 13:05 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: Damien Le Moal, Christoph Hellwig, Luis Chamberlain, Keith Busch,
	Pankaj Raghav, Adam Manzanares, jiangbo.365, kanchan Joshi,
	Jens Axboe, Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
	linux-block, linux-nvme

On 15.03.2022 12:32, Matias Bjørling wrote:
>> >Given the above, applications have to be conscious of zones in general and
>> work within their boundaries. I don't understand how applications can work
>> without having per-zone knowledge. An application would have to know about
>> zones and their writeable capacity. To decide where and how data is written,
>> an application must manage writing across zones, specific offline zones, and
>> (currently) its writeable capacity. I.e., knowledge about zones and holes is
>> required for writing to zoned devices and isn't eliminated by removing the PO2
>> zone size requirement.
>>
>> Supporting offlines zones is optional in the ZNS spec? We are not considering
>> supporting this in the host. This will be handled by the device for exactly
>> maintaining the SW stack simpler.
>
>It isn't optional. The spec allows any zones to go to Read Only or Offline state at any point in time. A specific implementation might give some guarantees to when such transitions happens, but it must nevertheless must be managed by the host software.
>
>Given that, and the need to not issue writes that spans zones, an application would have to aware of such behaviors. The information to make those decisions are in a zone's attributes, and thus applications would pull those, it would also know the writeable capability of a zone. So, all in all, creating support for NPO2 is something that takes a lot of work, but might have little to no impact on the overall software design.

Thanks for the clarification. I can attest that we are giving the
guarantee to simplify the host stack. I believe we are making many
assumptions in Linux too to simplify ZNS support.

This said, I understand your point. I am not developing application
support. I will refer again to Bo's response on the use case on where
holes are problematic.


>
>> >
>> >For years, the PO2 requirement has been known in the Linux community and
>> by the ZNS SSD vendors. Some SSD implementors have chosen not to support
>> PO2 zone sizes, which is a perfectly valid decision. But its implementors
>> knowingly did that while knowing that the Linux kernel didn't support it.
>> >
>> >I want to turn the argument around to see it from the kernel developer's point
>> of view. They have communicated the PO2 requirement clearly, there's good
>> precedence working with PO2 zone sizes, and at last, holes can't be avoided
>> and are part of the overall design of zoned storage devices. So why should the
>> kernel developer's take on the long-term maintenance burden of NPO2 zone
>> sizes?
>>
>> You have a good point, and that is the question we need to help answer.
>> As I see it, requirements evolve and the kernel changes with it as long as there
>> are active upstream users for it.
>
>True. There's also active users for SSDs which are custom (e.g., larger than 4KiB writes required) - but they aren't supported by the Linux kernel and isn't actively being worked on to my knowledge. Which is fine, as the customers anyway uses this in their own way, and don't need the Linux kernel support.

Ask things become stable some might choose to push support for certain
features in the Kernel. In this case, the changes are not big in the
block layer. I believe it is a process and the features should be chosen
to maximize benefit and minimize maintenance cost.

>
>>
>> The main constraint for (1) PO2 is removed in the block layer, we have (2) Linux hosts
>> stating that unmapped LBAs are a problem, and we have (3) HW supporting
>> size=capacity.
>>
>> I would be happy to hear what else you would like to see for this to be of use to
>> the kernel community.
>
>(Added numbers to your paragraph above)
>
>1. The sysfs chunksize attribute was "misused" to also represent zone size. What has changed is that RAID controllers now can use a NPO2 chunk size. This wasn't meant to naturally extend to zones, which as shown in the current posted patchset, is a lot more work.

True. But this was the main constraint for PO2.

>2. Bo mentioned that the software already manages holes. It took a bit of time to get right, but now it works. Thus, the software in question is already capable of working with holes. Thus, fixing this, would present itself as a minor optimization overall. I'm not convinced the work to do this in the kernel is proportional to the change it'll make to the applications.

I will let Bo response himself to this.

>3. I'm happy to hear that. However, I'll like to reiterate the point that the PO2 requirement have been known for years. That there's a drive doing NPO2 zones is great, but a decision was made by the SSD implementors to not support the Linux kernel given its current implementation.

Zone devices has been supported for years in SMR, and I this is a strong
argument. However, ZNS is still very new and customers have several
requirements. I do not believe that a HDD stack should have such an
impact in NVMe.

Also, we will see new interfaces adding support for zoned devices in the
future.

We should think about the future and not the past.


>
>All that said - if there are people willing to do the work and it doesn't have a negative impact on performance, code quality, maintenance complexity, etc. then there isn't anything saying support can't be added - but it does seem like it’s a lot of work, for little overall benefits to applications and the host users.

Exactly.

Patches in the block layer are trivial. This is running in production
loads without issues. I have tried to highlight the benefits in previous
benefits and I believe you understand them.

Support for ZoneFS seems easy too. We have an early POC for btrfs and it
seems it can be done. We sign up for these 2.

As for F2FS and dm-zoned, I do not think these are targets at the
moment. If this is the path we follow, these will bail out at mkfs time.

If we can agree on the above, I believe we can start with the code that
enables the existing customers and build support for butrfs and ZoneFS
in the next few months.

What do you think?


^ permalink raw reply	[flat|nested] 83+ messages in thread

* RE: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 13:05                                 ` Javier González
@ 2022-03-15 13:14                                   ` Matias Bjørling
  2022-03-15 13:26                                     ` Javier González
  2022-03-16  0:00                                   ` Damien Le Moal
  1 sibling, 1 reply; 83+ messages in thread
From: Matias Bjørling @ 2022-03-15 13:14 UTC (permalink / raw)
  To: Javier González
  Cc: Damien Le Moal, Christoph Hellwig, Luis Chamberlain, Keith Busch,
	Pankaj Raghav, Adam Manzanares, jiangbo.365, kanchan Joshi,
	Jens Axboe, Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
	linux-block, linux-nvme

> >
> >All that said - if there are people willing to do the work and it doesn't have a
> negative impact on performance, code quality, maintenance complexity, etc.
> then there isn't anything saying support can't be added - but it does seem like
> it’s a lot of work, for little overall benefits to applications and the host users.
> 
> Exactly.
> 
> Patches in the block layer are trivial. This is running in production loads without
> issues. I have tried to highlight the benefits in previous benefits and I believe
> you understand them.
> 
> Support for ZoneFS seems easy too. We have an early POC for btrfs and it
> seems it can be done. We sign up for these 2.
> 
> As for F2FS and dm-zoned, I do not think these are targets at the moment. If
> this is the path we follow, these will bail out at mkfs time.
> 
> If we can agree on the above, I believe we can start with the code that enables
> the existing customers and build support for butrfs and ZoneFS in the next few
> months.
> 
> What do you think?

I would suggest to do it in a single shot, i.e., a single patchset, which enables all the internal users in the kernel (including f2fs and others). That way end-users do not have to worry about the difference of PO2/NPO2 zones and it'll help reduce the burden on long-term maintenance.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 13:14                                   ` Matias Bjørling
@ 2022-03-15 13:26                                     ` Javier González
  2022-03-15 13:30                                       ` Christoph Hellwig
  2022-03-15 13:39                                       ` Matias Bjørling
  0 siblings, 2 replies; 83+ messages in thread
From: Javier González @ 2022-03-15 13:26 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: Damien Le Moal, Christoph Hellwig, Luis Chamberlain, Keith Busch,
	Pankaj Raghav, Adam Manzanares, jiangbo.365, kanchan Joshi,
	Jens Axboe, Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
	linux-block, linux-nvme

On 15.03.2022 13:14, Matias Bjørling wrote:
>> >
>> >All that said - if there are people willing to do the work and it doesn't have a
>> negative impact on performance, code quality, maintenance complexity, etc.
>> then there isn't anything saying support can't be added - but it does seem like
>> it’s a lot of work, for little overall benefits to applications and the host users.
>>
>> Exactly.
>>
>> Patches in the block layer are trivial. This is running in production loads without
>> issues. I have tried to highlight the benefits in previous benefits and I believe
>> you understand them.
>>
>> Support for ZoneFS seems easy too. We have an early POC for btrfs and it
>> seems it can be done. We sign up for these 2.
>>
>> As for F2FS and dm-zoned, I do not think these are targets at the moment. If
>> this is the path we follow, these will bail out at mkfs time.
>>
>> If we can agree on the above, I believe we can start with the code that enables
>> the existing customers and build support for butrfs and ZoneFS in the next few
>> months.
>>
>> What do you think?
>
>I would suggest to do it in a single shot, i.e., a single patchset, which enables all the internal users in the kernel (including f2fs and others). That way end-users do not have to worry about the difference of PO2/NPO2 zones and it'll help reduce the burden on long-term maintenance.

Thanks for the suggestion Matias. Happy to see that you are open to
support this. I understand why a patchseries fixing all is attracgive,
but we do not see a usage for ZNS in F2FS, as it is a mobile
file-system. As other interfaces arrive, this work will become natural.

ZoneFS and butrfs are good targets for ZNS and these we can do. I would
still do the work in phases to make sure we have enough early feedback
from the community.

Since this thread has been very active, I will wait some time for
Christoph and others to catch up before we start sending code.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 13:26                                     ` Javier González
@ 2022-03-15 13:30                                       ` Christoph Hellwig
  2022-03-15 13:52                                         ` Javier González
  2022-03-15 17:00                                         ` Luis Chamberlain
  2022-03-15 13:39                                       ` Matias Bjørling
  1 sibling, 2 replies; 83+ messages in thread
From: Christoph Hellwig @ 2022-03-15 13:30 UTC (permalink / raw)
  To: Javier González
  Cc: Matias Bjørling, Damien Le Moal, Christoph Hellwig,
	Luis Chamberlain, Keith Busch, Pankaj Raghav, Adam Manzanares,
	jiangbo.365, kanchan Joshi, Jens Axboe, Sagi Grimberg,
	Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme

On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
> but we do not see a usage for ZNS in F2FS, as it is a mobile
> file-system. As other interfaces arrive, this work will become natural.
>
> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
> still do the work in phases to make sure we have enough early feedback
> from the community.
>
> Since this thread has been very active, I will wait some time for
> Christoph and others to catch up before we start sending code.

Can someone summarize where we stand?  Between the lack of quoting
from hell and overly long lines from corporate mail clients I've
mostly stopped reading this thread because it takes too much effort
actually extract the information.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* RE: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 13:26                                     ` Javier González
  2022-03-15 13:30                                       ` Christoph Hellwig
@ 2022-03-15 13:39                                       ` Matias Bjørling
  1 sibling, 0 replies; 83+ messages in thread
From: Matias Bjørling @ 2022-03-15 13:39 UTC (permalink / raw)
  To: Javier González
  Cc: Damien Le Moal, Christoph Hellwig, Luis Chamberlain, Keith Busch,
	Pankaj Raghav, Adam Manzanares, jiangbo.365, kanchan Joshi,
	Jens Axboe, Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
	linux-block, linux-nvme

> -----Original Message-----
> From: Javier González <javier@javigon.com>
> Sent: Tuesday, 15 March 2022 14.26
> To: Matias Bjørling <Matias.Bjorling@wdc.com>
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>; Christoph
> Hellwig <hch@lst.de>; Luis Chamberlain <mcgrof@kernel.org>; Keith Busch
> <kbusch@kernel.org>; Pankaj Raghav <p.raghav@samsung.com>; Adam
> Manzanares <a.manzanares@samsung.com>; jiangbo.365@bytedance.com;
> kanchan Joshi <joshi.k@samsung.com>; Jens Axboe <axboe@kernel.dk>; Sagi
> Grimberg <sagi@grimberg.me>; Pankaj Raghav <pankydev8@gmail.com>;
> Kanchan Joshi <joshiiitr@gmail.com>; linux-block@vger.kernel.org; linux-
> nvme@lists.infradead.org
> Subject: Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
> 
> On 15.03.2022 13:14, Matias Bjørling wrote:
> >> >
> >> >All that said - if there are people willing to do the work and it
> >> >doesn't have a
> >> negative impact on performance, code quality, maintenance complexity,
> etc.
> >> then there isn't anything saying support can't be added - but it does
> >> seem like it’s a lot of work, for little overall benefits to applications and the
> host users.
> >>
> >> Exactly.
> >>
> >> Patches in the block layer are trivial. This is running in production
> >> loads without issues. I have tried to highlight the benefits in
> >> previous benefits and I believe you understand them.
> >>
> >> Support for ZoneFS seems easy too. We have an early POC for btrfs and
> >> it seems it can be done. We sign up for these 2.
> >>
> >> As for F2FS and dm-zoned, I do not think these are targets at the
> >> moment. If this is the path we follow, these will bail out at mkfs time.
> >>
> >> If we can agree on the above, I believe we can start with the code
> >> that enables the existing customers and build support for butrfs and
> >> ZoneFS in the next few months.
> >>
> >> What do you think?
> >
> >I would suggest to do it in a single shot, i.e., a single patchset, which enables
> all the internal users in the kernel (including f2fs and others). That way end-
> users do not have to worry about the difference of PO2/NPO2 zones and it'll
> help reduce the burden on long-term maintenance.
> 
> Thanks for the suggestion Matias. Happy to see that you are open to support
> this. I understand why a patchseries fixing all is attracgive, but we do not see a
> usage for ZNS in F2FS, as it is a mobile file-system. As other interfaces arrive,
> this work will become natural.

We've seen uptake on ZNS on f2fs, so I would argue that its important to have support in as well.

> 
> ZoneFS and butrfs are good targets for ZNS and these we can do. I would still do
> the work in phases to make sure we have enough early feedback from the
> community.

Sure, continuous review is good. But not having support for all the kernel users creates fragmentation. Doing a full switch is greatly preferred, as it avoids this fragmentation, but will also lower the overall maintenance burden, which also was raised as a concern.




^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 13:30                                       ` Christoph Hellwig
@ 2022-03-15 13:52                                         ` Javier González
  2022-03-15 14:03                                           ` Matias Bjørling
  2022-03-15 14:14                                           ` Johannes Thumshirn
  2022-03-15 17:00                                         ` Luis Chamberlain
  1 sibling, 2 replies; 83+ messages in thread
From: Javier González @ 2022-03-15 13:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matias Bjørling, Damien Le Moal, Luis Chamberlain,
	Keith Busch, Pankaj Raghav, Adam Manzanares, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On 15.03.2022 14:30, Christoph Hellwig wrote:
>On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
>> but we do not see a usage for ZNS in F2FS, as it is a mobile
>> file-system. As other interfaces arrive, this work will become natural.
>>
>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
>> still do the work in phases to make sure we have enough early feedback
>> from the community.
>>
>> Since this thread has been very active, I will wait some time for
>> Christoph and others to catch up before we start sending code.
>
>Can someone summarize where we stand?  Between the lack of quoting
>from hell and overly long lines from corporate mail clients I've
>mostly stopped reading this thread because it takes too much effort
>actually extract the information.

Let me give it a try:

  - PO2 emulation in NVMe is a no-go. Drop this.

  - The arguments against supporting PO2 are:
      - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This
        can create confusion for users of both SMR and ZNS

      - Existing applications assume PO2 zone sizes, and probably do
        optimizations for these. These applications, if wanting to use
        ZNS will have to change the calculations

      - There is a fear for performance regressions.

      - It adds more work to you and other maintainers

  - The arguments in favour of PO2 are:
      - Unmapped LBAs create holes that applications need to deal with.
        This affects mapping and performance due to splits. Bo explained
        this in a thread from Bytedance's perspective.  I explained in an
        answer to Matias how we are not letting zones transition to
        offline in order to simplify the host stack. Not sure if this is
        something we want to bring to NVMe.

      - As ZNS adds more features and other protocols add support for
        zoned devices we will have more use-cases for the zoned block
        device. We will have to deal with these fragmentation at some
        point.

      - This is used in production workloads in Linux hosts. I would
        advocate for this not being off-tree as it will be a headache for
        all in the future.

  - If you agree that removing PO2 is an option, we can do the following:
      - Remove the constraint in the block layer and add ZoneFS support
        in a first patch.

      - Add btrfs support in a later patch

      - Make changes to tools once merged

Hope I have collected all points of view in such a short format.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* RE: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 13:52                                         ` Javier González
@ 2022-03-15 14:03                                           ` Matias Bjørling
  2022-03-15 14:14                                           ` Johannes Thumshirn
  1 sibling, 0 replies; 83+ messages in thread
From: Matias Bjørling @ 2022-03-15 14:03 UTC (permalink / raw)
  To: Javier González, Christoph Hellwig
  Cc: Damien Le Moal, Luis Chamberlain, Keith Busch, Pankaj Raghav,
	Adam Manzanares, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Pankaj Raghav, Kanchan Joshi, linux-block,
	linux-nvme

> -----Original Message-----
> From: Javier González <javier@javigon.com>
> Sent: Tuesday, 15 March 2022 14.53
> To: Christoph Hellwig <hch@lst.de>
> Cc: Matias Bjørling <Matias.Bjorling@wdc.com>; Damien Le Moal
> <damien.lemoal@opensource.wdc.com>; Luis Chamberlain
> <mcgrof@kernel.org>; Keith Busch <kbusch@kernel.org>; Pankaj Raghav
> <p.raghav@samsung.com>; Adam Manzanares
> <a.manzanares@samsung.com>; jiangbo.365@bytedance.com; kanchan Joshi
> <joshi.k@samsung.com>; Jens Axboe <axboe@kernel.dk>; Sagi Grimberg
> <sagi@grimberg.me>; Pankaj Raghav <pankydev8@gmail.com>; Kanchan Joshi
> <joshiiitr@gmail.com>; linux-block@vger.kernel.org; linux-
> nvme@lists.infradead.org
> Subject: Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
> 
> On 15.03.2022 14:30, Christoph Hellwig wrote:
> >On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
> >> but we do not see a usage for ZNS in F2FS, as it is a mobile
> >> file-system. As other interfaces arrive, this work will become natural.
> >>
> >> ZoneFS and butrfs are good targets for ZNS and these we can do. I
> >> would still do the work in phases to make sure we have enough early
> >> feedback from the community.
> >>
> >> Since this thread has been very active, I will wait some time for
> >> Christoph and others to catch up before we start sending code.
> >
> >Can someone summarize where we stand?  Between the lack of quoting from
> >hell and overly long lines from corporate mail clients I've mostly
> >stopped reading this thread because it takes too much effort actually
> >extract the information.
> 
> Let me give it a try:
> 
>   - PO2 emulation in NVMe is a no-go. Drop this.
> 
>   - The arguments against supporting PO2 are:
>       - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This
>         can create confusion for users of both SMR and ZNS
> 
>       - Existing applications assume PO2 zone sizes, and probably do
>         optimizations for these. These applications, if wanting to use
>         ZNS will have to change the calculations
> 
>       - There is a fear for performance regressions.
> 
>       - It adds more work to you and other maintainers
> 
>   - The arguments in favour of PO2 are:
>       - Unmapped LBAs create holes that applications need to deal with.
>         This affects mapping and performance due to splits. Bo explained
>         this in a thread from Bytedance's perspective.  I explained in an
>         answer to Matias how we are not letting zones transition to
>         offline in order to simplify the host stack. Not sure if this is
>         something we want to bring to NVMe.
> 
>       - As ZNS adds more features and other protocols add support for
>         zoned devices we will have more use-cases for the zoned block
>         device. We will have to deal with these fragmentation at some
>         point.
> 
>       - This is used in production workloads in Linux hosts. I would
>         advocate for this not being off-tree as it will be a headache for
>         all in the future.
> 
>   - If you agree that removing PO2 is an option, we can do the following:
>       - Remove the constraint in the block layer and add ZoneFS support
>         in a first patch.
> 
>       - Add btrfs support in a later patch
> 
>       - Make changes to tools once merged
> 
> Hope I have collected all points of view in such a short format.

+ Suggestion to enable all users in the kernel to limit fragmentation and maintainer burden.
+ Possible not a big issue as users already have added the necessary support and users already must manage offline zones and avoid writing across zones. 
+ Re: Bo's email, it sounds like this only affect a single vendor which knowingly made the decision to do NPO2 zone sizes. From Bo: "(What we discussed here has a precondition that is, we cannot determine if the SSD provider could change the FW to make it PO2 or not)").  

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 13:52                                         ` Javier González
  2022-03-15 14:03                                           ` Matias Bjørling
@ 2022-03-15 14:14                                           ` Johannes Thumshirn
  2022-03-15 14:27                                             ` David Sterba
                                                               ` (2 more replies)
  1 sibling, 3 replies; 83+ messages in thread
From: Johannes Thumshirn @ 2022-03-15 14:14 UTC (permalink / raw)
  To: Javier González, Christoph Hellwig
  Cc: Matias Bjørling, Damien Le Moal, Luis Chamberlain,
	Keith Busch, Pankaj Raghav, Adam Manzanares, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme,
	linux-btrfs @ vger . kernel . org

On 15/03/2022 14:52, Javier González wrote:
> On 15.03.2022 14:30, Christoph Hellwig wrote:
>> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
>>> but we do not see a usage for ZNS in F2FS, as it is a mobile
>>> file-system. As other interfaces arrive, this work will become natural.
>>>
>>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
>>> still do the work in phases to make sure we have enough early feedback
>>> from the community.
>>>
>>> Since this thread has been very active, I will wait some time for
>>> Christoph and others to catch up before we start sending code.
>>
>> Can someone summarize where we stand?  Between the lack of quoting
>>from hell and overly long lines from corporate mail clients I've
>> mostly stopped reading this thread because it takes too much effort
>> actually extract the information.
> 
> Let me give it a try:
> 
>   - PO2 emulation in NVMe is a no-go. Drop this.
> 
>   - The arguments against supporting PO2 are:
>       - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This
>         can create confusion for users of both SMR and ZNS
> 
>       - Existing applications assume PO2 zone sizes, and probably do
>         optimizations for these. These applications, if wanting to use
>         ZNS will have to change the calculations
> 
>       - There is a fear for performance regressions.
> 
>       - It adds more work to you and other maintainers
> 
>   - The arguments in favour of PO2 are:
>       - Unmapped LBAs create holes that applications need to deal with.
>         This affects mapping and performance due to splits. Bo explained
>         this in a thread from Bytedance's perspective.  I explained in an
>         answer to Matias how we are not letting zones transition to
>         offline in order to simplify the host stack. Not sure if this is
>         something we want to bring to NVMe.
> 
>       - As ZNS adds more features and other protocols add support for
>         zoned devices we will have more use-cases for the zoned block
>         device. We will have to deal with these fragmentation at some
>         point.
> 
>       - This is used in production workloads in Linux hosts. I would
>         advocate for this not being off-tree as it will be a headache for
>         all in the future.
> 
>   - If you agree that removing PO2 is an option, we can do the following:
>       - Remove the constraint in the block layer and add ZoneFS support
>         in a first patch.
> 
>       - Add btrfs support in a later patch

(+ linux-btrfs )

Please also make sure to support btrfs and not only throw some patches 
over the fence. Zoned device support in btrfs is complex enough and has 
quite some special casing vs regular btrfs, which we're working on getting
rid of. So having non-power-of-2 zone size, would also mean having NPO2
block-groups (and thus block-groups not aligned to the stripe size).

Just thinking of this and knowing I need to support it gives me a 
headache.

Also please consult the rest of the btrfs developers for thoughts on this.
After all btrfs has full zoned support (including ZNS, not saying it's 
perfect) and is also the default FS for at least two Linux distributions.

Thanks a lot,
	Johannes

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 14:14                                           ` Johannes Thumshirn
@ 2022-03-15 14:27                                             ` David Sterba
  2022-03-15 19:56                                               ` Pankaj Raghav
  2022-03-15 15:11                                             ` Javier González
  2022-03-15 18:51                                             ` Pankaj Raghav
  2 siblings, 1 reply; 83+ messages in thread
From: David Sterba @ 2022-03-15 14:27 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Javier González, Christoph Hellwig, Matias Bjørling,
	Damien Le Moal, Luis Chamberlain, Keith Busch, Pankaj Raghav,
	Adam Manzanares, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Pankaj Raghav, Kanchan Joshi, linux-block,
	linux-nvme, linux-btrfs @ vger . kernel . org

On Tue, Mar 15, 2022 at 02:14:23PM +0000, Johannes Thumshirn wrote:
> On 15/03/2022 14:52, Javier González wrote:
> > On 15.03.2022 14:30, Christoph Hellwig wrote:
> >> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
> >>> but we do not see a usage for ZNS in F2FS, as it is a mobile
> >>> file-system. As other interfaces arrive, this work will become natural.
> >>>
> >>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
> >>> still do the work in phases to make sure we have enough early feedback
> >>> from the community.
> >>>
> >>> Since this thread has been very active, I will wait some time for
> >>> Christoph and others to catch up before we start sending code.
> >>
> >> Can someone summarize where we stand?  Between the lack of quoting
> >> from hell and overly long lines from corporate mail clients I've
> >> mostly stopped reading this thread because it takes too much effort
> >> actually extract the information.
> > 
> > Let me give it a try:
> > 
> >   - PO2 emulation in NVMe is a no-go. Drop this.
> > 
> >   - The arguments against supporting PO2 are:
> >       - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This
> >         can create confusion for users of both SMR and ZNS
> > 
> >       - Existing applications assume PO2 zone sizes, and probably do
> >         optimizations for these. These applications, if wanting to use
> >         ZNS will have to change the calculations
> > 
> >       - There is a fear for performance regressions.
> > 
> >       - It adds more work to you and other maintainers
> > 
> >   - The arguments in favour of PO2 are:
> >       - Unmapped LBAs create holes that applications need to deal with.
> >         This affects mapping and performance due to splits. Bo explained
> >         this in a thread from Bytedance's perspective.  I explained in an
> >         answer to Matias how we are not letting zones transition to
> >         offline in order to simplify the host stack. Not sure if this is
> >         something we want to bring to NVMe.
> > 
> >       - As ZNS adds more features and other protocols add support for
> >         zoned devices we will have more use-cases for the zoned block
> >         device. We will have to deal with these fragmentation at some
> >         point.
> > 
> >       - This is used in production workloads in Linux hosts. I would
> >         advocate for this not being off-tree as it will be a headache for
> >         all in the future.
> > 
> >   - If you agree that removing PO2 is an option, we can do the following:
> >       - Remove the constraint in the block layer and add ZoneFS support
> >         in a first patch.
> > 
> >       - Add btrfs support in a later patch
> 
> (+ linux-btrfs )
> 
> Please also make sure to support btrfs and not only throw some patches 
> over the fence. Zoned device support in btrfs is complex enough and has 
> quite some special casing vs regular btrfs, which we're working on getting
> rid of. So having non-power-of-2 zone size, would also mean having NPO2
> block-groups (and thus block-groups not aligned to the stripe size).
> 
> Just thinking of this and knowing I need to support it gives me a 
> headache.

PO2 is really easy to work with and I guess allocation on the physical
device could also benefit from that, I'm still puzzled why the NPO2 is
even proposed.

We can possibly hide the calculations behind some API so I hope in the
end it should be bearable. The size of block groups is flexible we only
want some reasonable alignment.

> Also please consult the rest of the btrfs developers for thoughts on this.
> After all btrfs has full zoned support (including ZNS, not saying it's 
> perfect) and is also the default FS for at least two Linux distributions.

I haven't read the whole thread yet, my impression is that some hardware
is deliberately breaking existing assumptions about zoned devices and in
turn breaking btrfs support. I hope I'm wrong on that or at least that
it's possible to work around it.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 14:14                                           ` Johannes Thumshirn
  2022-03-15 14:27                                             ` David Sterba
@ 2022-03-15 15:11                                             ` Javier González
  2022-03-15 18:51                                             ` Pankaj Raghav
  2 siblings, 0 replies; 83+ messages in thread
From: Javier González @ 2022-03-15 15:11 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Matias Bjørling, Damien Le Moal,
	Luis Chamberlain, Keith Busch, Pankaj Raghav, Adam Manzanares,
	jiangbo.365, kanchan Joshi, Jens Axboe, Sagi Grimberg,
	Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme,
	linux-btrfs @ vger . kernel . org

On 15.03.2022 14:14, Johannes Thumshirn wrote:
>On 15/03/2022 14:52, Javier González wrote:
>> On 15.03.2022 14:30, Christoph Hellwig wrote:
>>> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
>>>> but we do not see a usage for ZNS in F2FS, as it is a mobile
>>>> file-system. As other interfaces arrive, this work will become natural.
>>>>
>>>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
>>>> still do the work in phases to make sure we have enough early feedback
>>>> from the community.
>>>>
>>>> Since this thread has been very active, I will wait some time for
>>>> Christoph and others to catch up before we start sending code.
>>>
>>> Can someone summarize where we stand?  Between the lack of quoting
>>>from hell and overly long lines from corporate mail clients I've
>>> mostly stopped reading this thread because it takes too much effort
>>> actually extract the information.
>>
>> Let me give it a try:
>>
>>   - PO2 emulation in NVMe is a no-go. Drop this.
>>
>>   - The arguments against supporting PO2 are:
>>       - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This
>>         can create confusion for users of both SMR and ZNS
>>
>>       - Existing applications assume PO2 zone sizes, and probably do
>>         optimizations for these. These applications, if wanting to use
>>         ZNS will have to change the calculations
>>
>>       - There is a fear for performance regressions.
>>
>>       - It adds more work to you and other maintainers
>>
>>   - The arguments in favour of PO2 are:
>>       - Unmapped LBAs create holes that applications need to deal with.
>>         This affects mapping and performance due to splits. Bo explained
>>         this in a thread from Bytedance's perspective.  I explained in an
>>         answer to Matias how we are not letting zones transition to
>>         offline in order to simplify the host stack. Not sure if this is
>>         something we want to bring to NVMe.
>>
>>       - As ZNS adds more features and other protocols add support for
>>         zoned devices we will have more use-cases for the zoned block
>>         device. We will have to deal with these fragmentation at some
>>         point.
>>
>>       - This is used in production workloads in Linux hosts. I would
>>         advocate for this not being off-tree as it will be a headache for
>>         all in the future.
>>
>>   - If you agree that removing PO2 is an option, we can do the following:
>>       - Remove the constraint in the block layer and add ZoneFS support
>>         in a first patch.
>>
>>       - Add btrfs support in a later patch
>
>(+ linux-btrfs )
>
>Please also make sure to support btrfs and not only throw some patches
>over the fence. Zoned device support in btrfs is complex enough and has
>quite some special casing vs regular btrfs, which we're working on getting
>rid of. So having non-power-of-2 zone size, would also mean having NPO2
>block-groups (and thus block-groups not aligned to the stripe size).

Thanks for mentioning this Johannes. If we say we will work with you in
supporting btrfs properly, we will.

I believe you have seen already a couple of patches fixing things for
zone support in btrfs in the last weeks.

>
>Just thinking of this and knowing I need to support it gives me a
>headache.

I hope we have help you with that. butrfs has no alignment to PO2
natively, so I am confident we can find a good solution.

>
>Also please consult the rest of the btrfs developers for thoughts on this.
>After all btrfs has full zoned support (including ZNS, not saying it's
>perfect) and is also the default FS for at least two Linux distributions.

Of course. We will work with you and other btrfs developers. Luis is
helping making sure that we have good tests for linux-next. This is in
part how we have found the problems with Append, which should be fixed
now.

>
>Thanks a lot,
>	Johannes


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 13:30                                       ` Christoph Hellwig
  2022-03-15 13:52                                         ` Javier González
@ 2022-03-15 17:00                                         ` Luis Chamberlain
  2022-03-16  0:07                                           ` Damien Le Moal
  1 sibling, 1 reply; 83+ messages in thread
From: Luis Chamberlain @ 2022-03-15 17:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Javier González, Matias Bjørling, Damien Le Moal,
	Keith Busch, Pankaj Raghav, Adam Manzanares, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On Tue, Mar 15, 2022 at 02:30:52PM +0100, Christoph Hellwig wrote:
> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
> > but we do not see a usage for ZNS in F2FS, as it is a mobile
> > file-system. As other interfaces arrive, this work will become natural.
> >
> > ZoneFS and butrfs are good targets for ZNS and these we can do. I would
> > still do the work in phases to make sure we have enough early feedback
> > from the community.
> >
> > Since this thread has been very active, I will wait some time for
> > Christoph and others to catch up before we start sending code.
> 
> Can someone summarize where we stand?

RFCs should be posted to help review and evaluate direct NPO2 support
(not emulation) given we have no vendor willing to take a position that
NPO2 will *never* be supported on ZNS, and its not clear yet how many
vendors other than Samsung actually require NPO2 support. The other
reason is existing NPO2 customers currently cake in hacks to Linux to
supoport NPO2 support, and so a fragmentation already exists. To help
address this it's best to evaluate what the world of NPO2 support would
look like and put the effort to do the work for that and review that.

  Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 14:14                                           ` Johannes Thumshirn
  2022-03-15 14:27                                             ` David Sterba
  2022-03-15 15:11                                             ` Javier González
@ 2022-03-15 18:51                                             ` Pankaj Raghav
  2022-03-16  8:37                                               ` Johannes Thumshirn
  2 siblings, 1 reply; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-15 18:51 UTC (permalink / raw)
  To: Johannes Thumshirn, Javier González, Christoph Hellwig
  Cc: Matias Bjørling, Damien Le Moal, Luis Chamberlain,
	Keith Busch, Adam Manzanares, jiangbo.365, kanchan Joshi,
	Jens Axboe, Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
	linux-block, linux-nvme, linux-btrfs @ vger . kernel . org

Hi Johannes,

On 2022-03-15 15:14, Johannes Thumshirn wrote:
> Please also make sure to support btrfs and not only throw some patches 
> over the fence. Zoned device support in btrfs is complex enough and has 
> quite some special casing vs regular btrfs, which we're working on getting
> rid of. So having non-power-of-2 zone size, would also mean having NPO2

I already made a simple btrfs npo2 poc and it involved mostly changing
the po2 calculation to be based on generic calculation. I understand
that changing the calculations from using log & shifts to division will
incur some performance penalty but I think we can wrap them with helpers
to minimize those impact.

> So having non-power-of-2 zone size, would also mean having NPO2
> block-groups (and thus block-groups not aligned to the stripe size).
>

I agree with your point that we risk not aligning to stripe size when we
move to npo2 zone size which I believe the minimum is 64K (please
correct me if I am wrong). As David Sterba mentioned in his email, we
could agree on some reasonable alignment, which I believe would be the
minimum stripe size of 64k to avoid added complexity to the existing
btrfs zoned support. And it is a much milder constraint that most
devices can naturally adhere compared to the po2 zone size requirement.

> Just thinking of this and knowing I need to support it gives me a 
> headache.
> 
This is definitely not some one off patch that we want upstream and
disappear. As Javier already pointed out, we would be more than happy
help you out here.
> Also please consult the rest of the btrfs developers for thoughts on this.
> After all btrfs has full zoned support (including ZNS, not saying it's 
> perfect) and is also the default FS for at least two Linux distributions.
> 
> Thanks a lot,
> 	Johannes

-- 
Regards,
Pankaj

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 14:27                                             ` David Sterba
@ 2022-03-15 19:56                                               ` Pankaj Raghav
  0 siblings, 0 replies; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-15 19:56 UTC (permalink / raw)
  To: dsterba, Johannes Thumshirn, Javier González,
	Christoph Hellwig, Matias Bjørling, Damien Le Moal,
	Luis Chamberlain, Keith Busch, Adam Manzanares, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme,
	linux-btrfs @ vger . kernel . org

Hi David,

On 2022-03-15 15:27, David Sterba wrote:
> 
> PO2 is really easy to work with and I guess allocation on the physical
> device could also benefit from that, I'm still puzzled why the NPO2 is
> even proposed.
> 
Quick recap:
Hardware NAND cannot naturally align to po2 zone sizes which led to
having a zone cap and zone size, where, zone cap is the actually storage
available in a zone. The main proposal is to remove the po2 constraint
to get rid of this LBA holes (generally speaking). That is why this
whole effort was started.

> We can possibly hide the calculations behind some API so I hope in the
> end it should be bearable. The size of block groups is flexible we only
> want some reasonable alignment.
>
I agree. I already replied to Johannes on what it might look like.
Reiterating here again, the reasonable alignment I was thinking while I
was doing a POC for btrfs with npo2 zone size is the minimum stripe size
that is required by btrfs (64K) to reduce the impact of this change on
the zoned support in btrfs.

> I haven't read the whole thread yet, my impression is that some hardware
> is deliberately breaking existing assumptions about zoned devices and in
> turn breaking btrfs support. I hope I'm wrong on that or at least that
> it's possible to work around it.
Based on the POC we did internally, it is definitely possible to support
it in btrfs. And making this change will not break the existing btrfs
support for zoned devices. Naive approach to making this change will
have some performance impact as we will be changing the po2 calculations
from log & shifts to division, multiplications. I definitely think we
can optimize it to minimize the impact on the existing deployments.

-- 
Regards,
Pankaj

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 13:05                                 ` Javier González
  2022-03-15 13:14                                   ` Matias Bjørling
@ 2022-03-16  0:00                                   ` Damien Le Moal
  2022-03-16  8:57                                     ` Javier González
  2022-03-16 16:18                                     ` Pankaj Raghav
  1 sibling, 2 replies; 83+ messages in thread
From: Damien Le Moal @ 2022-03-16  0:00 UTC (permalink / raw)
  To: Javier González, Matias Bjørling
  Cc: Christoph Hellwig, Luis Chamberlain, Keith Busch, Pankaj Raghav,
	Adam Manzanares, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Pankaj Raghav, Kanchan Joshi, linux-block,
	linux-nvme

On 3/15/22 22:05, Javier González wrote:
>>> The main constraint for (1) PO2 is removed in the block layer, we
>>> have (2) Linux hosts stating that unmapped LBAs are a problem,
>>> and we have (3) HW supporting size=capacity.
>>> 
>>> I would be happy to hear what else you would like to see for this
>>> to be of use to the kernel community.
>> 
>> (Added numbers to your paragraph above)
>> 
>> 1. The sysfs chunksize attribute was "misused" to also represent
>> zone size. What has changed is that RAID controllers now can use a
>> NPO2 chunk size. This wasn't meant to naturally extend to zones,
>> which as shown in the current posted patchset, is a lot more work.
> 
> True. But this was the main constraint for PO2.

And as I said, users asked for it.

>> 2. Bo mentioned that the software already manages holes. It took a
>> bit of time to get right, but now it works. Thus, the software in
>> question is already capable of working with holes. Thus, fixing
>> this, would present itself as a minor optimization overall. I'm not
>> convinced the work to do this in the kernel is proportional to the
>> change it'll make to the applications.
> 
> I will let Bo response himself to this.
> 
>> 3. I'm happy to hear that. However, I'll like to reiterate the
>> point that the PO2 requirement have been known for years. That
>> there's a drive doing NPO2 zones is great, but a decision was made
>> by the SSD implementors to not support the Linux kernel given its
>> current implementation.
> 
> Zone devices has been supported for years in SMR, and I this is a
> strong argument. However, ZNS is still very new and customers have
> several requirements. I do not believe that a HDD stack should have
> such an impact in NVMe.
> 
> Also, we will see new interfaces adding support for zoned devices in
> the future.
> 
> We should think about the future and not the past.

Backward compatibility ? We must not break userspace...

>> 
>> All that said - if there are people willing to do the work and it
>> doesn't have a negative impact on performance, code quality,
>> maintenance complexity, etc. then there isn't anything saying
>> support can't be added - but it does seem like it’s a lot of work,
>> for little overall benefits to applications and the host users.
> 
> Exactly.
> 
> Patches in the block layer are trivial. This is running in
> production loads without issues. I have tried to highlight the
> benefits in previous benefits and I believe you understand them.

The block layer is not the issue here. We all understand that one is easy.

> Support for ZoneFS seems easy too. We have an early POC for btrfs and
> it seems it can be done. We sign up for these 2.

zonefs can trivially support non power of 2 zone sizes, but as zonefs
creates a discrete view of the device capacity with its one file per
zone interface, an application accesses to a zone are forcibly limited
to that zone, as they should. With zonefs, pow2 and nonpow2 devices will
show the *same* interface to the application. Non power of 2 zone size
then have absolutely no benefits at all.

> As for F2FS and dm-zoned, I do not think these are targets at the 
> moment. If this is the path we follow, these will bail out at mkfs
> time.

And what makes you think that this is acceptable ? What guarantees do
you have that this will not be a problem for users out there ?



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 17:00                                         ` Luis Chamberlain
@ 2022-03-16  0:07                                           ` Damien Le Moal
  2022-03-16  0:23                                             ` Luis Chamberlain
  0 siblings, 1 reply; 83+ messages in thread
From: Damien Le Moal @ 2022-03-16  0:07 UTC (permalink / raw)
  To: Luis Chamberlain, Christoph Hellwig
  Cc: Javier González, Matias Bjørling, Keith Busch,
	Pankaj Raghav, Adam Manzanares, jiangbo.365, kanchan Joshi,
	Jens Axboe, Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
	linux-block, linux-nvme

On 3/16/22 02:00, Luis Chamberlain wrote:
> On Tue, Mar 15, 2022 at 02:30:52PM +0100, Christoph Hellwig wrote:
>> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
>>> but we do not see a usage for ZNS in F2FS, as it is a mobile
>>> file-system. As other interfaces arrive, this work will become natural.
>>>
>>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
>>> still do the work in phases to make sure we have enough early feedback
>>> from the community.
>>>
>>> Since this thread has been very active, I will wait some time for
>>> Christoph and others to catch up before we start sending code.
>>
>> Can someone summarize where we stand?
> 
> RFCs should be posted to help review and evaluate direct NPO2 support
> (not emulation) given we have no vendor willing to take a position that
> NPO2 will *never* be supported on ZNS, and its not clear yet how many
> vendors other than Samsung actually require NPO2 support. The other
> reason is existing NPO2 customers currently cake in hacks to Linux to
> supoport NPO2 support, and so a fragmentation already exists. To help
> address this it's best to evaluate what the world of NPO2 support would
> look like and put the effort to do the work for that and review that.

And again no mentions of all the applications supporting zones assuming
a power of 2 zone size that will break. Seriously. Please stop
considering the kernel only. If this were only about the kernel, we
would all be working on patches already.

Allowing non power of 2 zone size may prevent applications running today
to run properly on these non power of 2 zone size devices. *not* nice. I
have yet to see any convincing argument proving that this is not an issue.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-16  0:07                                           ` Damien Le Moal
@ 2022-03-16  0:23                                             ` Luis Chamberlain
  2022-03-16  0:46                                               ` Damien Le Moal
  2022-03-16  2:27                                               ` Martin K. Petersen
  0 siblings, 2 replies; 83+ messages in thread
From: Luis Chamberlain @ 2022-03-16  0:23 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Javier González, Matias Bjørling,
	Keith Busch, Pankaj Raghav, Adam Manzanares, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On Wed, Mar 16, 2022 at 09:07:18AM +0900, Damien Le Moal wrote:
> On 3/16/22 02:00, Luis Chamberlain wrote:
> > On Tue, Mar 15, 2022 at 02:30:52PM +0100, Christoph Hellwig wrote:
> >> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
> >>> but we do not see a usage for ZNS in F2FS, as it is a mobile
> >>> file-system. As other interfaces arrive, this work will become natural.
> >>>
> >>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
> >>> still do the work in phases to make sure we have enough early feedback
> >>> from the community.
> >>>
> >>> Since this thread has been very active, I will wait some time for
> >>> Christoph and others to catch up before we start sending code.
> >>
> >> Can someone summarize where we stand?
> > 
> > RFCs should be posted to help review and evaluate direct NPO2 support
> > (not emulation) given we have no vendor willing to take a position that
> > NPO2 will *never* be supported on ZNS, and its not clear yet how many
> > vendors other than Samsung actually require NPO2 support. The other
> > reason is existing NPO2 customers currently cake in hacks to Linux to
> > supoport NPO2 support, and so a fragmentation already exists. To help
> > address this it's best to evaluate what the world of NPO2 support would
> > look like and put the effort to do the work for that and review that.
> 
> And again no mentions of all the applications supporting zones assuming
> a power of 2 zone size that will break.

What applications? ZNS does not incur a PO2 requirement. So I really
want to know what applications make this assumption and would break
because all of a sudden say NPO2 is supported.

Why would that break those ZNS applications?

> Allowing non power of 2 zone size may prevent applications running today
> to run properly on these non power of 2 zone size devices. *not* nice.

Applications which want to support ZNS have to take into consideration
that NPO2 is posisble and there existing users of that world today.

You cannot negate their existance.

> I have yet to see any convincing argument proving that this is not an issue.

You are just saying things can break but not clarifying exactly what.
And you have not taken a position to say WD will not ever support NPO2
on ZNS. And so, you can't negate the prospect of that implied path for
support as a possibility, even if it means work towards the ecosystem
today.

 Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-16  0:23                                             ` Luis Chamberlain
@ 2022-03-16  0:46                                               ` Damien Le Moal
  2022-03-16  1:24                                                 ` Luis Chamberlain
  2022-03-16  2:27                                               ` Martin K. Petersen
  1 sibling, 1 reply; 83+ messages in thread
From: Damien Le Moal @ 2022-03-16  0:46 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, Javier González, Matias Bjørling,
	Keith Busch, Pankaj Raghav, Adam Manzanares, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On 3/16/22 09:23, Luis Chamberlain wrote:
> On Wed, Mar 16, 2022 at 09:07:18AM +0900, Damien Le Moal wrote:
>> On 3/16/22 02:00, Luis Chamberlain wrote:
>>> On Tue, Mar 15, 2022 at 02:30:52PM +0100, Christoph Hellwig wrote:
>>>> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
>>>>> but we do not see a usage for ZNS in F2FS, as it is a mobile
>>>>> file-system. As other interfaces arrive, this work will become natural.
>>>>>
>>>>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
>>>>> still do the work in phases to make sure we have enough early feedback
>>>>> from the community.
>>>>>
>>>>> Since this thread has been very active, I will wait some time for
>>>>> Christoph and others to catch up before we start sending code.
>>>>
>>>> Can someone summarize where we stand?
>>>
>>> RFCs should be posted to help review and evaluate direct NPO2 support
>>> (not emulation) given we have no vendor willing to take a position that
>>> NPO2 will *never* be supported on ZNS, and its not clear yet how many
>>> vendors other than Samsung actually require NPO2 support. The other
>>> reason is existing NPO2 customers currently cake in hacks to Linux to
>>> supoport NPO2 support, and so a fragmentation already exists. To help
>>> address this it's best to evaluate what the world of NPO2 support would
>>> look like and put the effort to do the work for that and review that.
>>
>> And again no mentions of all the applications supporting zones assuming
>> a power of 2 zone size that will break.
> 
> What applications? ZNS does not incur a PO2 requirement. So I really
> want to know what applications make this assumption and would break
> because all of a sudden say NPO2 is supported.

Exactly. What applications ? For ZNS, I cannot say as devices have not
been available for long. But neither can you.

> Why would that break those ZNS applications?

Please keep in mind that there are power of 2 zone sized ZNS devices out
there. Applications designed for these devices and optimized to do bit
shift arithmetic using the power of 2 size property will break. What the
plan for that case ? How will you address these users complaints ?

>> Allowing non power of 2 zone size may prevent applications running today
>> to run properly on these non power of 2 zone size devices. *not* nice.
> 
> Applications which want to support ZNS have to take into consideration
> that NPO2 is posisble and there existing users of that world today.

Which is really an ugly approach. The kernel zone user interface is
common to all zoned devices: SMR, ZNS, null_blk, DM (dm-crypt,
dm-linear). They all have one point in common: zone size is a power of
2. Zone capacity may differ, but hey, we also unified that by reporting
a zone capacity for *ALL* of them.

Applications correctly designed for SMR can thus also run on ZNS too.
With this in mind, the spectrum of applications that would break on non
power of 2 ZNS devices is suddenly much larger.

This has always been my concern from the start: allowing non power of 2
zone size fragments userspace support and has the potential to
complicate things for application developers.

> 
> You cannot negate their existance.
> 
>> I have yet to see any convincing argument proving that this is not an issue.
> 
> You are just saying things can break but not clarifying exactly what.
> And you have not taken a position to say WD will not ever support NPO2
> on ZNS. And so, you can't negate the prospect of that implied path for
> support as a possibility, even if it means work towards the ecosystem
> today.

Please do not bring in corporate strategy aspects in this discussion.
This is a technical discussion and I am not talking as a representative
of my employer nor should we ever dicsuss business plans on a public
mailing list. I am a kernel developer and maintainer. Keep it technical
please.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-16  0:46                                               ` Damien Le Moal
@ 2022-03-16  1:24                                                 ` Luis Chamberlain
  2022-03-16  1:44                                                   ` Damien Le Moal
  0 siblings, 1 reply; 83+ messages in thread
From: Luis Chamberlain @ 2022-03-16  1:24 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Javier González, Matias Bjørling,
	Keith Busch, Pankaj Raghav, Adam Manzanares, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On Wed, Mar 16, 2022 at 09:46:44AM +0900, Damien Le Moal wrote:
> On 3/16/22 09:23, Luis Chamberlain wrote:
> > What applications? ZNS does not incur a PO2 requirement. So I really
> > want to know what applications make this assumption and would break
> > because all of a sudden say NPO2 is supported.
> 
> Exactly. What applications ? For ZNS, I cannot say as devices have not
> been available for long. But neither can you.

I can tell you we there is an existing NPO2 ZNS customer which chimed on
the discussion and they described having to carry a delta to support
NPO2 ZNS. So if you cannot tell me of a ZNS application which is going to
break to add NPO2 support then your original point is not valid of
suggesting that there would be a break.

> > Why would that break those ZNS applications?
> 
> Please keep in mind that there are power of 2 zone sized ZNS devices out
> there.

No one is saying otherwise.

> Applications designed for these devices and optimized to do bit
> shift arithmetic using the power of 2 size property will break.

They must not be ZNS. So they can continue to chug on.

> What the
> plan for that case ? How will you address these users complaints ?

They are not ZNS so they don't have to worry about ZNS.

ZNS applications must be aware of that fact that NPO2 can exist.
ZNS applications must be aware of that fact that any vendor may one day
sell NPO2 devices.

> >> Allowing non power of 2 zone size may prevent applications running today
> >> to run properly on these non power of 2 zone size devices. *not* nice.
> > 
> > Applications which want to support ZNS have to take into consideration
> > that NPO2 is posisble and there existing users of that world today.
> 
> Which is really an ugly approach.

Ugly is relative and subjective. NAND does not force PO2.

> The kernel

<etc> And back you go to kernel talk. I thought you wanted to
focus on applications.

> Applications correctly designed for SMR can thus also run on ZNS too.

That seems to be an incorrect assumption given ZNS drives exist
with NPO2. So you can probably say that some SMR applications can work
with PO2 ZNS drives. That is a more correct statement.

> With this in mind, the spectrum of applications that would break on non
> power of 2 ZNS devices is suddenly much larger.

We already determined you cannot identify any ZNS specific application
which would break.

SMR != ZNS

If you really want to use SMR applications for ZNS that seems to be
a bit beyond the scope of this discussion, but it seems to me that those
SMR applications should simply learn that if a device is ZNS that NPO2 can
be expected.

As technologies evolve so do specifications.

> This has always been my concern from the start: allowing non power of 2
> zone size fragments userspace support and has the potential to
> complicate things for application developers.

It's a reality though. Devices exist, and so do users. And they're
carrying their own delta to support NPO2 ZNS today on Linux.

> > You cannot negate their existance.
> > 
> >> I have yet to see any convincing argument proving that this is not an issue.
> > 
> > You are just saying things can break but not clarifying exactly what.
> > And you have not taken a position to say WD will not ever support NPO2
> > on ZNS. And so, you can't negate the prospect of that implied path for
> > support as a possibility, even if it means work towards the ecosystem
> > today.
> 
> Please do not bring in corporate strategy aspects in this discussion.
> This is a technical discussion and I am not talking as a representative
> of my employer nor should we ever dicsuss business plans on a public
> mailing list. I am a kernel developer and maintainer. Keep it technical
> please.

This conversation is about the reality that ZNS NPO2 exist and how best to
support that. You seem to want to negate that reality and support on
Linux without even considering what the changes look like to to support
ZNS NPO2.

As a maintainer I think we need to *evaluate* supporting users as best
as possible. Not denying their existance. Even if it pains us.

  Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-16  1:24                                                 ` Luis Chamberlain
@ 2022-03-16  1:44                                                   ` Damien Le Moal
  2022-03-16  2:13                                                     ` Luis Chamberlain
  0 siblings, 1 reply; 83+ messages in thread
From: Damien Le Moal @ 2022-03-16  1:44 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, Javier González, Matias Bjørling,
	Keith Busch, Pankaj Raghav, Adam Manzanares, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On 3/16/22 10:24, Luis Chamberlain wrote:
> On Wed, Mar 16, 2022 at 09:46:44AM +0900, Damien Le Moal wrote:
>> On 3/16/22 09:23, Luis Chamberlain wrote:
>>> What applications? ZNS does not incur a PO2 requirement. So I really
>>> want to know what applications make this assumption and would break
>>> because all of a sudden say NPO2 is supported.
>>
>> Exactly. What applications ? For ZNS, I cannot say as devices have not
>> been available for long. But neither can you.
> 
> I can tell you we there is an existing NPO2 ZNS customer which chimed on
> the discussion and they described having to carry a delta to support
> NPO2 ZNS. So if you cannot tell me of a ZNS application which is going to
> break to add NPO2 support then your original point is not valid of
> suggesting that there would be a break.
> 
>>> Why would that break those ZNS applications?
>>
>> Please keep in mind that there are power of 2 zone sized ZNS devices out
>> there.
> 
> No one is saying otherwise.
> 
>> Applications designed for these devices and optimized to do bit
>> shift arithmetic using the power of 2 size property will break.
> 
> They must not be ZNS. So they can continue to chug on.
> 
>> What the
>> plan for that case ? How will you address these users complaints ?
> 
> They are not ZNS so they don't have to worry about ZNS.
> 
> ZNS applications must be aware of that fact that NPO2 can exist.
> ZNS applications must be aware of that fact that any vendor may one day
> sell NPO2 devices.
> 
>>>> Allowing non power of 2 zone size may prevent applications running today
>>>> to run properly on these non power of 2 zone size devices. *not* nice.
>>>
>>> Applications which want to support ZNS have to take into consideration
>>> that NPO2 is posisble and there existing users of that world today.
>>
>> Which is really an ugly approach.
> 
> Ugly is relative and subjective. NAND does not force PO2.
> 
>> The kernel
> 
> <etc> And back you go to kernel talk. I thought you wanted to
> focus on applications.
> 
>> Applications correctly designed for SMR can thus also run on ZNS too.
> 
> That seems to be an incorrect assumption given ZNS drives exist
> with NPO2. So you can probably say that some SMR applications can work
> with PO2 ZNS drives. That is a more correct statement.
> 
>> With this in mind, the spectrum of applications that would break on non
>> power of 2 ZNS devices is suddenly much larger.
> 
> We already determined you cannot identify any ZNS specific application
> which would break.
> 
> SMR != ZNS

Not for the block layer nor for any in-kernel users above it today. We
should not drive toward differentiating device types but unify them
under a common interface that works for everything, including
applications. That is why we have zone append emulation in the scsi disk
driver.

Considering the zone size requirement problem in the context of ZNS only
is thus far from ideal in my opinion, to say the least.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-16  1:44                                                   ` Damien Le Moal
@ 2022-03-16  2:13                                                     ` Luis Chamberlain
  0 siblings, 0 replies; 83+ messages in thread
From: Luis Chamberlain @ 2022-03-16  2:13 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Javier González, Matias Bjørling,
	Keith Busch, Pankaj Raghav, Adam Manzanares, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On Wed, Mar 16, 2022 at 10:44:56AM +0900, Damien Le Moal wrote:
> On 3/16/22 10:24, Luis Chamberlain wrote:
> > SMR != ZNS
> 
> Not for the block layer nor for any in-kernel <etc>

Back to kernel, I thought you wanted to focus on applications.

> Considering the zone size requirement problem in the context of ZNS only
> is thus far from ideal in my opinion, to say the least.

It's the reality for ZNS though.

  Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-16  0:23                                             ` Luis Chamberlain
  2022-03-16  0:46                                               ` Damien Le Moal
@ 2022-03-16  2:27                                               ` Martin K. Petersen
  2022-03-16  2:41                                                 ` Luis Chamberlain
  2022-03-16  8:44                                                 ` Javier González
  1 sibling, 2 replies; 83+ messages in thread
From: Martin K. Petersen @ 2022-03-16  2:27 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Damien Le Moal, Christoph Hellwig, Javier González,
	Matias Bjørling, Keith Busch, Pankaj Raghav,
	Adam Manzanares, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Pankaj Raghav, Kanchan Joshi, linux-block,
	linux-nvme


Luis,

> Applications which want to support ZNS have to take into consideration
> that NPO2 is posisble and there existing users of that world today.

Every time a new technology comes along vendors inevitably introduce
first gen devices that are implemented with little consideration for the
OS stacks they need to work with. This has happened for pretty much
every technology I have been involved with over the years. So the fact
that NPO2 devices exist is no argument. There are tons of devices out
there that Linux does not support and never will.

In early engagements SSD drive vendors proposed all sorts of weird NPO2
block sizes and alignments that it was argued were *incontestable*
requirements for building NAND devices. And yet a generation or two
later every SSD transparently handled 512-byte or 4096-byte logical
blocks just fine. Imagine if we had re-engineered the entire I/O stack
to accommodate these awful designs?

Similarly, many proponents suggested oddball NPO2 sizes for SMR
zones. And yet the market very quickly settled on PO2 once things
started shipping in volume.

Simplicity and long term maintainability of the kernel should always
take precedence as far as I'm concerned.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-16  2:27                                               ` Martin K. Petersen
@ 2022-03-16  2:41                                                 ` Luis Chamberlain
  2022-03-16  8:44                                                 ` Javier González
  1 sibling, 0 replies; 83+ messages in thread
From: Luis Chamberlain @ 2022-03-16  2:41 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Damien Le Moal, Christoph Hellwig, Javier González,
	Matias Bjørling, Keith Busch, Pankaj Raghav,
	Adam Manzanares, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Pankaj Raghav, Kanchan Joshi, linux-block,
	linux-nvme

On Tue, Mar 15, 2022 at 10:27:32PM -0400, Martin K. Petersen wrote:
> Simplicity and long term maintainability of the kernel should always
> take precedence as far as I'm concerned.

No one is arguing against that. It is not even clear what all the changes are.
So to argue that the sky will fall seems a bit too early without seeing
patches, don't you think?

  Luis

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-15 18:51                                             ` Pankaj Raghav
@ 2022-03-16  8:37                                               ` Johannes Thumshirn
  0 siblings, 0 replies; 83+ messages in thread
From: Johannes Thumshirn @ 2022-03-16  8:37 UTC (permalink / raw)
  To: Pankaj Raghav, Javier González, Christoph Hellwig
  Cc: Matias Bjørling, Damien Le Moal, Luis Chamberlain,
	Keith Busch, Adam Manzanares, jiangbo.365, kanchan Joshi,
	Jens Axboe, Sagi Grimberg, Pankaj Raghav, Kanchan Joshi,
	linux-block, linux-nvme, linux-btrfs @ vger . kernel . org

On 15/03/2022 19:51, Pankaj Raghav wrote:
>> ck-groups (and thus block-groups not aligned to the stripe size).
>>
> I agree with your point that we risk not aligning to stripe size when we
> move to npo2 zone size which I believe the minimum is 64K (please
> correct me if I am wrong). As David Sterba mentioned in his email, we
> could agree on some reasonable alignment, which I believe would be the
> minimum stripe size of 64k to avoid added complexity to the existing
> btrfs zoned support. And it is a much milder constraint that most
> devices can naturally adhere compared to the po2 zone size requirement.
> 

What could be done is rounding a zone down to the next po2 (64k aligned),
but then we need to explicitly finish the zones.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-16  2:27                                               ` Martin K. Petersen
  2022-03-16  2:41                                                 ` Luis Chamberlain
@ 2022-03-16  8:44                                                 ` Javier González
  1 sibling, 0 replies; 83+ messages in thread
From: Javier González @ 2022-03-16  8:44 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Luis Chamberlain, Damien Le Moal, Christoph Hellwig,
	Matias Bjørling, Keith Busch, Pankaj Raghav,
	Adam Manzanares, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Pankaj Raghav, Kanchan Joshi, linux-block,
	linux-nvme

On 15.03.2022 22:27, Martin K. Petersen wrote:
>
>Luis,
>
>> Applications which want to support ZNS have to take into consideration
>> that NPO2 is posisble and there existing users of that world today.
>
>Every time a new technology comes along vendors inevitably introduce
>first gen devices that are implemented with little consideration for the
>OS stacks they need to work with. This has happened for pretty much
>every technology I have been involved with over the years. So the fact
>that NPO2 devices exist is no argument. There are tons of devices out
>there that Linux does not support and never will.
>
>In early engagements SSD drive vendors proposed all sorts of weird NPO2
>block sizes and alignments that it was argued were *incontestable*
>requirements for building NAND devices. And yet a generation or two
>later every SSD transparently handled 512-byte or 4096-byte logical
>blocks just fine. Imagine if we had re-engineered the entire I/O stack
>to accommodate these awful designs?
>
>Similarly, many proponents suggested oddball NPO2 sizes for SMR
>zones. And yet the market very quickly settled on PO2 once things
>started shipping in volume.
>
>Simplicity and long term maintainability of the kernel should always
>take precedence as far as I'm concerned.

Martin, you are absolutely right.

The argument is not that there is available HW. The argument is that as
we tried to retrofit ZNS into the zoned block device, the gap between
zone size and capacity has brought adoption issues for some customers.

I would still like to wait and give some time to get some feedback on
the plan I proposed yesterday before we post patches. At this point, I
would very much like to hear your opinion on how the changes will incur
a maintainability problem. Nobody wants that.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-16  0:00                                   ` Damien Le Moal
@ 2022-03-16  8:57                                     ` Javier González
  2022-03-16 16:18                                     ` Pankaj Raghav
  1 sibling, 0 replies; 83+ messages in thread
From: Javier González @ 2022-03-16  8:57 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Matias Bjørling, Christoph Hellwig, Luis Chamberlain,
	Keith Busch, Pankaj Raghav, Adam Manzanares, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme

On 16.03.2022 09:00, Damien Le Moal wrote:
>On 3/15/22 22:05, Javier González wrote:
>>>> The main constraint for (1) PO2 is removed in the block layer, we
>>>> have (2) Linux hosts stating that unmapped LBAs are a problem,
>>>> and we have (3) HW supporting size=capacity.
>>>>
>>>> I would be happy to hear what else you would like to see for this
>>>> to be of use to the kernel community.
>>>
>>> (Added numbers to your paragraph above)
>>>
>>> 1. The sysfs chunksize attribute was "misused" to also represent
>>> zone size. What has changed is that RAID controllers now can use a
>>> NPO2 chunk size. This wasn't meant to naturally extend to zones,
>>> which as shown in the current posted patchset, is a lot more work.
>>
>> True. But this was the main constraint for PO2.
>
>And as I said, users asked for it.

Now users are asking for arbitrary zone sizes.

[...]

>>> 3. I'm happy to hear that. However, I'll like to reiterate the
>>> point that the PO2 requirement have been known for years. That
>>> there's a drive doing NPO2 zones is great, but a decision was made
>>> by the SSD implementors to not support the Linux kernel given its
>>> current implementation.
>>
>> Zone devices has been supported for years in SMR, and I this is a
>> strong argument. However, ZNS is still very new and customers have
>> several requirements. I do not believe that a HDD stack should have
>> such an impact in NVMe.
>>
>> Also, we will see new interfaces adding support for zoned devices in
>> the future.
>>
>> We should think about the future and not the past.
>
>Backward compatibility ? We must not break userspace...

This is not a user API change. If making changes to applications to
adopt new features and technologies is breaking user-space, then the
zoned block device already broke that when we introduced zone capacity.
Any existing zoned application working on ZNS _will have to_ make
changes to support ZNS.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-16  0:00                                   ` Damien Le Moal
  2022-03-16  8:57                                     ` Javier González
@ 2022-03-16 16:18                                     ` Pankaj Raghav
  1 sibling, 0 replies; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-16 16:18 UTC (permalink / raw)
  To: Damien Le Moal, Javier González, Matias Bjørling
  Cc: Christoph Hellwig, Luis Chamberlain, Keith Busch,
	Adam Manzanares, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Pankaj Raghav, Kanchan Joshi, linux-block,
	linux-nvme


Hi Damien,

On 2022-03-16 01:00, Damien Le Moal wrote:
>> As for F2FS and dm-zoned, I do not think these are targets at the 
>> moment. If this is the path we follow, these will bail out at mkfs
>> time.
> 
> And what makes you think that this is acceptable ? What guarantees do
> you have that this will not be a problem for users out there ?
> 
As you know, the architecture of F2FS ATM requires PO2 segments,
therefore, it might not be possible support nonPO2 ZNS drives.

So we could continue supporting PO2 ZNS drives for F2FS and bail out if
it is a Non PO2 ZNS drive during mkfs time (This is the current behavior
as well). This way we are not really breaking any ZNS drives that have
already been deployed for F2FS users.

-- 
Regards,
Pankaj

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-11 20:51           ` Keith Busch
  2022-03-11 21:04             ` Luis Chamberlain
  2022-03-11 22:23             ` Adam Manzanares
@ 2022-03-21 16:21             ` Jonathan Derrick
  2022-03-21 16:44               ` Keith Busch
  2 siblings, 1 reply; 83+ messages in thread
From: Jonathan Derrick @ 2022-03-21 16:21 UTC (permalink / raw)
  To: Keith Busch, Luis Chamberlain
  Cc: Christoph Hellwig, Pankaj Raghav, Adam Manzanares,
	Javier González, jiangbo.365, kanchan Joshi, Jens Axboe,
	Sagi Grimberg, Matias Bjørling, Pankaj Raghav,
	Kanchan Joshi, linux-block, linux-nvme, Damien Le Moal



On 3/11/2022 1:51 PM, Keith Busch wrote:
> On Fri, Mar 11, 2022 at 12:19:38PM -0800, Luis Chamberlain wrote:
>> NAND has no PO2 requirement. The emulation effort was only done to help
>> add support for !PO2 devices because there is no alternative. If we
>> however are ready instead to go down the avenue of removing those
>> restrictions well let's go there then instead. If that's not even
>> something we are willing to consider I'd really like folks who stand
>> behind the PO2 requirement to stick their necks out and clearly say that
>> their hw/fw teams are happy to deal with this requirement forever on ZNS.
> 
> Regardless of the merits of the current OS requirement, it's a trivial
> matter for firmware to round up their reported zone size to the next
> power of 2. This does not create a significant burden on their part, as
> far as I know.

Sure wonder why !PO2 keeps coming up if it's so trivial to fix in firmware as you claim.
I actually find the hubris of the Linux community wrt the whole PO2 requirement
pretty exhausting.

Consider that some SSD manufacturers are having to rely on a NAND shortage and
existing ASIC architecture limitations that may define the sizes of their erase blocks
and write units. A !PO2 implementation in the Linux kernel would enable consumers
to be able to choose more options in the marketplace for their Linux ZNS application.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
  2022-03-21 16:21             ` Jonathan Derrick
@ 2022-03-21 16:44               ` Keith Busch
  0 siblings, 0 replies; 83+ messages in thread
From: Keith Busch @ 2022-03-21 16:44 UTC (permalink / raw)
  To: Jonathan Derrick
  Cc: Luis Chamberlain, Christoph Hellwig, Pankaj Raghav,
	Adam Manzanares, Javier González, jiangbo.365,
	kanchan Joshi, Jens Axboe, Sagi Grimberg, Matias Bjørling,
	Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme,
	Damien Le Moal

On Mon, Mar 21, 2022 at 10:21:36AM -0600, Jonathan Derrick wrote:
> 
> 
> On 3/11/2022 1:51 PM, Keith Busch wrote:
> > On Fri, Mar 11, 2022 at 12:19:38PM -0800, Luis Chamberlain wrote:
> > > NAND has no PO2 requirement. The emulation effort was only done to help
> > > add support for !PO2 devices because there is no alternative. If we
> > > however are ready instead to go down the avenue of removing those
> > > restrictions well let's go there then instead. If that's not even
> > > something we are willing to consider I'd really like folks who stand
> > > behind the PO2 requirement to stick their necks out and clearly say that
> > > their hw/fw teams are happy to deal with this requirement forever on ZNS.
> > 
> > Regardless of the merits of the current OS requirement, it's a trivial
> > matter for firmware to round up their reported zone size to the next
> > power of 2. This does not create a significant burden on their part, as
> > far as I know.
> 
> Sure wonder why !PO2 keeps coming up if it's so trivial to fix in firmware as you claim.

The triviality to adjust alignment in firmware has nothing to do with
some users' desire to not see gaps in LBA space.

> I actually find the hubris of the Linux community wrt the whole PO2 requirement
> pretty exhausting.
>
> Consider that some SSD manufacturers are having to rely on a NAND shortage and
> existing ASIC architecture limitations that may define the sizes of their erase blocks
> and write units. A !PO2 implementation in the Linux kernel would enable consumers
> to be able to choose more options in the marketplace for their Linux ZNS application.

All zone block devices through the linux kernel use a common abstraction
interface. Users expect you can swap out one zone device for another and all
their previously used features will continue to work. That does not necessarily
hold with relaxing the long existing zone alignment. Fragmenting uses harms
adoption, so this discussion seems appropriate.

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2022-03-21 16:44 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20220308165414eucas1p106df0bd6a901931215cfab81660a4564@eucas1p1.samsung.com>
2022-03-08 16:53 ` [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices Pankaj Raghav
     [not found]   ` <CGME20220308165421eucas1p20575444f59702cd5478cb35fce8b72cd@eucas1p2.samsung.com>
2022-03-08 16:53     ` [PATCH 1/6] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size Pankaj Raghav
2022-03-08 17:14       ` Keith Busch
2022-03-08 17:43         ` Pankaj Raghav
2022-03-09  3:40       ` Damien Le Moal
2022-03-09 13:19         ` Pankaj Raghav
2022-03-09  3:44       ` Damien Le Moal
2022-03-09 13:35         ` Pankaj Raghav
     [not found]   ` <CGME20220308165428eucas1p14ea0a38eef47055c4fa41d695c5a249d@eucas1p1.samsung.com>
2022-03-08 16:53     ` [PATCH 2/6] block: Add npo2_zone_setup callback to block device fops Pankaj Raghav
2022-03-09  3:46       ` Damien Le Moal
2022-03-09 14:02         ` Pankaj Raghav
     [not found]   ` <CGME20220308165432eucas1p18b36a238ef3f5a812ee7f9b0e52599a5@eucas1p1.samsung.com>
2022-03-08 16:53     ` [PATCH 3/6] block: add a bool member to request_queue for power_of_2 emulation Pankaj Raghav
     [not found]   ` <CGME20220308165436eucas1p1b76f3cb5b4fa1f7d78b51a3b1b44d160@eucas1p1.samsung.com>
2022-03-08 16:53     ` [PATCH 4/6] nvme: zns: Add support for power_of_2 emulation to NVMe ZNS devices Pankaj Raghav
2022-03-09  4:04       ` Damien Le Moal
2022-03-09 14:33         ` Pankaj Raghav
2022-03-09 21:43           ` Damien Le Moal
2022-03-10 20:35             ` Luis Chamberlain
2022-03-10 23:50               ` Damien Le Moal
2022-03-11  0:56                 ` Luis Chamberlain
     [not found]   ` <CGME20220308165443eucas1p17e61670a5057f21a6c073711b284bfeb@eucas1p1.samsung.com>
2022-03-08 16:53     ` [PATCH 5/6] null_blk: forward the sector value from null_handle_memory_backend Pankaj Raghav
     [not found]   ` <CGME20220308165448eucas1p12c7c302a4b239db64b49d54cc3c1f0ac@eucas1p1.samsung.com>
2022-03-08 16:53     ` [PATCH 6/6] null_blk: Add support for power_of_2 emulation to the null blk device Pankaj Raghav
2022-03-09  4:09       ` Damien Le Moal
2022-03-09 14:42         ` Pankaj Raghav
2022-03-10  9:47   ` [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices Christoph Hellwig
2022-03-10 12:57     ` Pankaj Raghav
2022-03-10 13:07       ` Matias Bjørling
2022-03-10 13:14         ` Javier González
2022-03-10 14:58           ` Matias Bjørling
2022-03-10 15:07             ` Keith Busch
2022-03-10 15:16               ` Javier González
2022-03-10 23:44                 ` Damien Le Moal
2022-03-10 15:13             ` Javier González
2022-03-10 14:44       ` Christoph Hellwig
2022-03-11 20:19         ` Luis Chamberlain
2022-03-11 20:51           ` Keith Busch
2022-03-11 21:04             ` Luis Chamberlain
2022-03-11 21:31               ` Keith Busch
2022-03-11 22:24                 ` Luis Chamberlain
2022-03-12  7:58                   ` Damien Le Moal
2022-03-14  7:35                     ` Christoph Hellwig
2022-03-14  7:45                       ` Damien Le Moal
2022-03-14  7:58                         ` Christoph Hellwig
2022-03-14 10:49                         ` Javier González
2022-03-14 14:16                           ` Matias Bjørling
2022-03-14 16:23                             ` Luis Chamberlain
2022-03-14 19:30                               ` Matias Bjørling
2022-03-14 19:51                                 ` Luis Chamberlain
2022-03-15 10:45                                   ` Matias Bjørling
2022-03-14 19:55                             ` Javier González
2022-03-15 12:32                               ` Matias Bjørling
2022-03-15 13:05                                 ` Javier González
2022-03-15 13:14                                   ` Matias Bjørling
2022-03-15 13:26                                     ` Javier González
2022-03-15 13:30                                       ` Christoph Hellwig
2022-03-15 13:52                                         ` Javier González
2022-03-15 14:03                                           ` Matias Bjørling
2022-03-15 14:14                                           ` Johannes Thumshirn
2022-03-15 14:27                                             ` David Sterba
2022-03-15 19:56                                               ` Pankaj Raghav
2022-03-15 15:11                                             ` Javier González
2022-03-15 18:51                                             ` Pankaj Raghav
2022-03-16  8:37                                               ` Johannes Thumshirn
2022-03-15 17:00                                         ` Luis Chamberlain
2022-03-16  0:07                                           ` Damien Le Moal
2022-03-16  0:23                                             ` Luis Chamberlain
2022-03-16  0:46                                               ` Damien Le Moal
2022-03-16  1:24                                                 ` Luis Chamberlain
2022-03-16  1:44                                                   ` Damien Le Moal
2022-03-16  2:13                                                     ` Luis Chamberlain
2022-03-16  2:27                                               ` Martin K. Petersen
2022-03-16  2:41                                                 ` Luis Chamberlain
2022-03-16  8:44                                                 ` Javier González
2022-03-15 13:39                                       ` Matias Bjørling
2022-03-16  0:00                                   ` Damien Le Moal
2022-03-16  8:57                                     ` Javier González
2022-03-16 16:18                                     ` Pankaj Raghav
2022-03-14  8:36                     ` Matias Bjørling
2022-03-11 22:23             ` Adam Manzanares
2022-03-11 22:30               ` Keith Busch
2022-03-21 16:21             ` Jonathan Derrick
2022-03-21 16:44               ` Keith Busch
2022-03-10 17:38     ` Adam Manzanares
2022-03-14  7:36       ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.