linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] ZNS: Extra features for current patches
@ 2020-06-25 12:21 Javier González
  2020-06-25 12:21 ` [PATCH 1/6] block: introduce IOCTL for zone mgmt Javier González
                   ` (6 more replies)
  0 siblings, 7 replies; 70+ messages in thread
From: Javier González @ 2020-06-25 12:21 UTC (permalink / raw)
  To: linux-nvme; +Cc: linux-block, hch, kbusch, sagi, axboe, Javier González

From: Javier González <javier.gonz@samsung.com>

This patchset extends zoned device functionality on top of the existing
v3 ZNS patchset that Keith sent last week.

Patches 1-5 are zoned block interface and IOCTL additions to expose ZNS
values to user-space. One major change is the addition of a new zone
management IOCTL that allows to extend zone management commands with
flags. I recall a conversation in the mailing list from early this year
where a similar approach was proposed by Matias, but never made it
upstream. We extended the IOCTL here to align with the comments in that
thread. Here, we are happy to get sign-offs by anyone that contributed
to the thread - just comment here or on the patch.

Patch 6 is nvme-only and adds an extra check to the ZNS report code to
ensure consistency on the zone count.

The patches apply on top of Jens' block-5.8 + Keith's V3 ZNS patches.

Thanks,
Javier

Javier González (6):
  block: introduce IOCTL for zone mgmt
  block: add support for selecting all zones
  block: add support for zone offline transition
  block: introduce IOCTL to report dev properties
  block: add zone attr. to zone mgmt IOCTL struct
  nvme: Add consistency check for zone count

 block/blk-core.c              |   2 +
 block/blk-zoned.c             | 108 +++++++++++++++++++++++++++++++++-
 block/ioctl.c                 |   4 ++
 drivers/nvme/host/core.c      |   5 ++
 drivers/nvme/host/nvme.h      |  11 ++++
 drivers/nvme/host/zns.c       |  69 ++++++++++++++++++++++
 include/linux/blk_types.h     |   6 +-
 include/linux/blkdev.h        |  19 +++++-
 include/uapi/linux/blkzoned.h |  69 +++++++++++++++++++++-
 9 files changed, 289 insertions(+), 4 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 1/6] block: introduce IOCTL for zone mgmt
  2020-06-25 12:21 [PATCH 0/6] ZNS: Extra features for current patches Javier González
@ 2020-06-25 12:21 ` Javier González
  2020-06-26  1:17   ` Damien Le Moal
  2020-06-25 12:21 ` [PATCH 2/6] block: add support for selecting all zones Javier González
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 70+ messages in thread
From: Javier González @ 2020-06-25 12:21 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

From: Javier González <javier.gonz@samsung.com>

The current IOCTL interface for zone management is limited by struct
blk_zone_range, which is unfortunately not extensible. Specially, the
lack of flags is problematic for ZNS zoned devices.

This new IOCTL is designed to be a superset of the current one, with
support for flags and bits for extensibility.

Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 block/blk-zoned.c             | 56 ++++++++++++++++++++++++++++++++++-
 block/ioctl.c                 |  2 ++
 include/linux/blkdev.h        |  9 ++++++
 include/uapi/linux/blkzoned.h | 33 +++++++++++++++++++++
 4 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 81152a260354..e87c60004dc5 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -322,7 +322,7 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
  * BLKRESETZONE, BLKOPENZONE, BLKCLOSEZONE and BLKFINISHZONE ioctl processing.
  * Called from blkdev_ioctl.
  */
-int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
+int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
 			   unsigned int cmd, unsigned long arg)
 {
 	void __user *argp = (void __user *)arg;
@@ -370,6 +370,60 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
 				GFP_KERNEL);
 }
 
+/*
+ * Zone management ioctl processing. Extension of blkdev_zone_ops_ioctl(), with
+ * blk_zone_mgmt structure.
+ *
+ * Called from blkdev_ioctl.
+ */
+int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
+			   unsigned int cmd, unsigned long arg)
+{
+	void __user *argp = (void __user *)arg;
+	struct request_queue *q;
+	struct blk_zone_mgmt zmgmt;
+	enum req_opf op;
+
+	if (!argp)
+		return -EINVAL;
+
+	q = bdev_get_queue(bdev);
+	if (!q)
+		return -ENXIO;
+
+	if (!blk_queue_is_zoned(q))
+		return -ENOTTY;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (!(mode & FMODE_WRITE))
+		return -EBADF;
+
+	if (copy_from_user(&zmgmt, argp, sizeof(struct blk_zone_mgmt)))
+		return -EFAULT;
+
+	switch (zmgmt.action) {
+	case BLK_ZONE_MGMT_CLOSE:
+		op = REQ_OP_ZONE_CLOSE;
+		break;
+	case BLK_ZONE_MGMT_FINISH:
+		op = REQ_OP_ZONE_FINISH;
+		break;
+	case BLK_ZONE_MGMT_OPEN:
+		op = REQ_OP_ZONE_OPEN;
+		break;
+	case BLK_ZONE_MGMT_RESET:
+		op = REQ_OP_ZONE_RESET;
+		break;
+	default:
+		return -ENOTTY;
+	}
+
+	return blkdev_zone_mgmt(bdev, op, zmgmt.sector, zmgmt.nr_sectors,
+				GFP_KERNEL);
+}
+
 static inline unsigned long *blk_alloc_zone_bitmap(int node,
 						   unsigned int nr_zones)
 {
diff --git a/block/ioctl.c b/block/ioctl.c
index bdb3bbb253d9..0ea29754e7dd 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -514,6 +514,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
 	case BLKOPENZONE:
 	case BLKCLOSEZONE:
 	case BLKFINISHZONE:
+		return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
+	case BLKMGMTZONE:
 		return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
 	case BLKGETZONESZ:
 		return put_uint(argp, bdev_zone_sectors(bdev));
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8fd900998b4e..bd8521f94dc4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -368,6 +368,8 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
 
 extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
 				     unsigned int cmd, unsigned long arg);
+extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
+				  unsigned int cmd, unsigned long arg);
 extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
 				  unsigned int cmd, unsigned long arg);
 
@@ -385,6 +387,13 @@ static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
 	return -ENOTTY;
 }
 
+
+static inline int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
+					unsigned int cmd, unsigned long arg)
+{
+	return -ENOTTY;
+}
+
 static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
 					 fmode_t mode, unsigned int cmd,
 					 unsigned long arg)
diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
index 42c3366cc25f..07b5fde21d9f 100644
--- a/include/uapi/linux/blkzoned.h
+++ b/include/uapi/linux/blkzoned.h
@@ -142,6 +142,38 @@ struct blk_zone_range {
 	__u64		nr_sectors;
 };
 
+/**
+ * enum blk_zone_action - Zone state transitions managed from user-space
+ *
+ * @BLK_ZONE_MGMT_CLOSE: Transition to Closed state
+ * @BLK_ZONE_MGMT_FINISH: Transition to Finish state
+ * @BLK_ZONE_MGMT_OPEN: Transition to Open state
+ * @BLK_ZONE_MGMT_RESET: Transition to Reset state
+ */
+enum blk_zone_action {
+	BLK_ZONE_MGMT_CLOSE	= 0x1,
+	BLK_ZONE_MGMT_FINISH	= 0x2,
+	BLK_ZONE_MGMT_OPEN	= 0x3,
+	BLK_ZONE_MGMT_RESET	= 0x4,
+};
+
+/**
+ * struct blk_zone_mgmt - Extended zoned management
+ *
+ * @action: Zone action as in described in enum blk_zone_action
+ * @flags: Flags for the action
+ * @sector: Starting sector of the first zone to operate on
+ * @nr_sectors: Total number of sectors of all zones to operate on
+ */
+struct blk_zone_mgmt {
+	__u8		action;
+	__u8		resv3[3];
+	__u32		flags;
+	__u64		sector;
+	__u64		nr_sectors;
+	__u64		resv31;
+};
+
 /**
  * Zoned block device ioctl's:
  *
@@ -166,5 +198,6 @@ struct blk_zone_range {
 #define BLKOPENZONE	_IOW(0x12, 134, struct blk_zone_range)
 #define BLKCLOSEZONE	_IOW(0x12, 135, struct blk_zone_range)
 #define BLKFINISHZONE	_IOW(0x12, 136, struct blk_zone_range)
+#define BLKMGMTZONE	_IOR(0x12, 137, struct blk_zone_mgmt)
 
 #endif /* _UAPI_BLKZONED_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 2/6] block: add support for selecting all zones
  2020-06-25 12:21 [PATCH 0/6] ZNS: Extra features for current patches Javier González
  2020-06-25 12:21 ` [PATCH 1/6] block: introduce IOCTL for zone mgmt Javier González
@ 2020-06-25 12:21 ` Javier González
  2020-06-26  1:27   ` Damien Le Moal
  2020-06-25 12:21 ` [PATCH 3/6] block: add support for zone offline transition Javier González
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 70+ messages in thread
From: Javier González @ 2020-06-25 12:21 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

From: Javier González <javier.gonz@samsung.com>

Add flag to allow selecting all zones on a single zone management
operation

Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 block/blk-zoned.c             | 3 +++
 include/linux/blk_types.h     | 3 ++-
 include/uapi/linux/blkzoned.h | 9 +++++++++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index e87c60004dc5..29194388a1bb 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -420,6 +420,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
 		return -ENOTTY;
 	}
 
+	if (zmgmt.flags & BLK_ZONE_SELECT_ALL)
+		op |= REQ_ZONE_ALL;
+
 	return blkdev_zone_mgmt(bdev, op, zmgmt.sector, zmgmt.nr_sectors,
 				GFP_KERNEL);
 }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index ccb895f911b1..16b57fb2b99c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -351,6 +351,7 @@ enum req_flag_bits {
 	 * work item to avoid such priority inversions.
 	 */
 	__REQ_CGROUP_PUNT,
+	__REQ_ZONE_ALL,		/* apply zone operation to all zones */
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
@@ -378,7 +379,7 @@ enum req_flag_bits {
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 #define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
-
+#define REQ_ZONE_ALL		(1ULL << __REQ_ZONE_ALL)
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
 
diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
index 07b5fde21d9f..a8c89fe58f97 100644
--- a/include/uapi/linux/blkzoned.h
+++ b/include/uapi/linux/blkzoned.h
@@ -157,6 +157,15 @@ enum blk_zone_action {
 	BLK_ZONE_MGMT_RESET	= 0x4,
 };
 
+/**
+ * enum blk_zone_mgmt_flags - Flags for blk_zone_mgmt
+ *
+ * BLK_ZONE_SELECT_ALL: Select all zones for current zone action
+ */
+enum blk_zone_mgmt_flags {
+	BLK_ZONE_SELECT_ALL	= 1 << 0,
+};
+
 /**
  * struct blk_zone_mgmt - Extended zoned management
  *
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 3/6] block: add support for zone offline transition
  2020-06-25 12:21 [PATCH 0/6] ZNS: Extra features for current patches Javier González
  2020-06-25 12:21 ` [PATCH 1/6] block: introduce IOCTL for zone mgmt Javier González
  2020-06-25 12:21 ` [PATCH 2/6] block: add support for selecting all zones Javier González
@ 2020-06-25 12:21 ` Javier González
  2020-06-25 14:12   ` Matias Bjørling
  2020-06-26  1:34   ` Damien Le Moal
  2020-06-25 12:21 ` [PATCH 4/6] block: introduce IOCTL to report dev properties Javier González
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 70+ messages in thread
From: Javier González @ 2020-06-25 12:21 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

From: Javier González <javier.gonz@samsung.com>

Add support for offline transition on the zoned block device using the
new zone management IOCTL

Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 block/blk-core.c              | 2 ++
 block/blk-zoned.c             | 3 +++
 drivers/nvme/host/core.c      | 3 +++
 include/linux/blk_types.h     | 3 +++
 include/linux/blkdev.h        | 1 -
 include/uapi/linux/blkzoned.h | 1 +
 6 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 03252af8c82c..589cbdacc5ec 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -140,6 +140,7 @@ static const char *const blk_op_name[] = {
 	REQ_OP_NAME(ZONE_CLOSE),
 	REQ_OP_NAME(ZONE_FINISH),
 	REQ_OP_NAME(ZONE_APPEND),
+	REQ_OP_NAME(ZONE_OFFLINE),
 	REQ_OP_NAME(WRITE_SAME),
 	REQ_OP_NAME(WRITE_ZEROES),
 	REQ_OP_NAME(SCSI_IN),
@@ -1030,6 +1031,7 @@ generic_make_request_checks(struct bio *bio)
 	case REQ_OP_ZONE_OPEN:
 	case REQ_OP_ZONE_CLOSE:
 	case REQ_OP_ZONE_FINISH:
+	case REQ_OP_ZONE_OFFLINE:
 		if (!blk_queue_is_zoned(q))
 			goto not_supported;
 		break;
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 29194388a1bb..704fc15813d1 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -416,6 +416,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
 	case BLK_ZONE_MGMT_RESET:
 		op = REQ_OP_ZONE_RESET;
 		break;
+	case BLK_ZONE_MGMT_OFFLINE:
+		op = REQ_OP_ZONE_OFFLINE;
+		break;
 	default:
 		return -ENOTTY;
 	}
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f1215523792b..5b95c81d2a2d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
 	case REQ_OP_ZONE_FINISH:
 		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
 		break;
+	case REQ_OP_ZONE_OFFLINE:
+		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OFFLINE);
+		break;
 	case REQ_OP_WRITE_ZEROES:
 		ret = nvme_setup_write_zeroes(ns, req, cmd);
 		break;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 16b57fb2b99c..b3921263c3dd 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -316,6 +316,8 @@ enum req_opf {
 	REQ_OP_ZONE_FINISH	= 12,
 	/* write data at the current zone write pointer */
 	REQ_OP_ZONE_APPEND	= 13,
+	/* Transition a zone to offline */
+	REQ_OP_ZONE_OFFLINE	= 14,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,
@@ -456,6 +458,7 @@ static inline bool op_is_zone_mgmt(enum req_opf op)
 	case REQ_OP_ZONE_OPEN:
 	case REQ_OP_ZONE_CLOSE:
 	case REQ_OP_ZONE_FINISH:
+	case REQ_OP_ZONE_OFFLINE:
 		return true;
 	default:
 		return false;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bd8521f94dc4..8308d8a3720b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -372,7 +372,6 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
 				  unsigned int cmd, unsigned long arg);
 extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
 				  unsigned int cmd, unsigned long arg);
-
 #else /* CONFIG_BLK_DEV_ZONED */
 
 static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
index a8c89fe58f97..d0978ee10fc7 100644
--- a/include/uapi/linux/blkzoned.h
+++ b/include/uapi/linux/blkzoned.h
@@ -155,6 +155,7 @@ enum blk_zone_action {
 	BLK_ZONE_MGMT_FINISH	= 0x2,
 	BLK_ZONE_MGMT_OPEN	= 0x3,
 	BLK_ZONE_MGMT_RESET	= 0x4,
+	BLK_ZONE_MGMT_OFFLINE	= 0x5,
 };
 
 /**
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 4/6] block: introduce IOCTL to report dev properties
  2020-06-25 12:21 [PATCH 0/6] ZNS: Extra features for current patches Javier González
                   ` (2 preceding siblings ...)
  2020-06-25 12:21 ` [PATCH 3/6] block: add support for zone offline transition Javier González
@ 2020-06-25 12:21 ` Javier González
  2020-06-25 13:10   ` Matias Bjørling
  2020-06-26  1:38   ` Damien Le Moal
  2020-06-25 12:21 ` [PATCH 5/6] block: add zone attr. to zone mgmt IOCTL struct Javier González
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 70+ messages in thread
From: Javier González @ 2020-06-25 12:21 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

From: Javier González <javier.gonz@samsung.com>

With the addition of ZNS, a new set of properties have been added to the
zoned block device. This patch introduces a new IOCTL to expose these
rroperties to user space.

Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 block/blk-zoned.c             | 46 ++++++++++++++++++++++++++
 block/ioctl.c                 |  2 ++
 drivers/nvme/host/core.c      |  2 ++
 drivers/nvme/host/nvme.h      | 11 +++++++
 drivers/nvme/host/zns.c       | 61 +++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h        |  9 ++++++
 include/uapi/linux/blkzoned.h | 13 ++++++++
 7 files changed, 144 insertions(+)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 704fc15813d1..39ec72af9537 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -169,6 +169,17 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
 }
 EXPORT_SYMBOL_GPL(blkdev_report_zones);
 
+static int blkdev_report_zonedev_prop(struct block_device *bdev,
+				      struct blk_zone_dev *zprop)
+{
+	struct gendisk *disk = bdev->bd_disk;
+
+	if (WARN_ON_ONCE(!bdev->bd_disk->fops->report_zone_p))
+		return -EOPNOTSUPP;
+
+	return disk->fops->report_zone_p(disk, zprop);
+}
+
 static inline bool blkdev_allow_reset_all_zones(struct block_device *bdev,
 						sector_t sector,
 						sector_t nr_sectors)
@@ -430,6 +441,41 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
 				GFP_KERNEL);
 }
 
+int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
+			unsigned int cmd, unsigned long arg)
+{
+	void __user *argp = (void __user *)arg;
+	struct request_queue *q;
+	struct blk_zone_dev zprop;
+	int ret;
+
+	if (!argp)
+		return -EINVAL;
+
+	q = bdev_get_queue(bdev);
+	if (!q)
+		return -ENXIO;
+
+	if (!blk_queue_is_zoned(q))
+		return -ENOTTY;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (!(mode & FMODE_WRITE))
+		return -EBADF;
+
+	ret = blkdev_report_zonedev_prop(bdev, &zprop);
+	if (ret)
+		goto out;
+
+	if (copy_to_user(argp, &zprop, sizeof(struct blk_zone_dev)))
+		return -EFAULT;
+
+out:
+	return ret;
+}
+
 static inline unsigned long *blk_alloc_zone_bitmap(int node,
 						   unsigned int nr_zones)
 {
diff --git a/block/ioctl.c b/block/ioctl.c
index 0ea29754e7dd..f7b4e0f2dd4c 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -517,6 +517,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
 		return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
 	case BLKMGMTZONE:
 		return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
+	case BLKZONEDEVPROP:
+		return blkdev_zonedev_prop(bdev, mode, cmd, arg);
 	case BLKGETZONESZ:
 		return put_uint(argp, bdev_zone_sectors(bdev));
 	case BLKGETNRZONES:
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 5b95c81d2a2d..a32c909a915f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2254,6 +2254,7 @@ static const struct block_device_operations nvme_fops = {
 	.getgeo		= nvme_getgeo,
 	.revalidate_disk= nvme_revalidate_disk,
 	.report_zones	= nvme_report_zones,
+	.report_zone_p	= nvme_report_zone_prop,
 	.pr_ops		= &nvme_pr_ops,
 };
 
@@ -2280,6 +2281,7 @@ const struct block_device_operations nvme_ns_head_ops = {
 	.compat_ioctl	= nvme_compat_ioctl,
 	.getgeo		= nvme_getgeo,
 	.report_zones	= nvme_report_zones,
+	.report_zone_p	= nvme_report_zone_prop,
 	.pr_ops		= &nvme_pr_ops,
 };
 #endif /* CONFIG_NVME_MULTIPATH */
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ecf443efdf91..172e0531f37f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -407,6 +407,14 @@ struct nvme_ns {
 	u8 pi_type;
 #ifdef CONFIG_BLK_DEV_ZONED
 	u64 zsze;
+
+	u32 nr_zones;
+	u32 mar;
+	u32 mor;
+	u32 rrl;
+	u32 frl;
+	u16 zoc;
+	u16 ozcs;
 #endif
 	unsigned long features;
 	unsigned long flags;
@@ -704,11 +712,14 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
 int nvme_report_zones(struct gendisk *disk, sector_t sector,
 		      unsigned int nr_zones, report_zones_cb cb, void *data);
 
+int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev *zprop);
+
 blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
 				       struct nvme_command *cmnd,
 				       enum nvme_zone_mgmt_action action);
 #else
 #define nvme_report_zones NULL
+#define nvme_report_zone_prop NULL
 
 static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
 		struct request *req, struct nvme_command *cmnd,
diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 2e6512ac6f01..258d03610cc0 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -32,6 +32,28 @@ static int nvme_set_max_append(struct nvme_ctrl *ctrl)
 	return 0;
 }
 
+static u64 nvme_zns_nr_zones(struct nvme_ns *ns)
+{
+	struct nvme_command c = { };
+	struct nvme_zone_report report;
+	int buflen = sizeof(struct nvme_zone_report);
+	int ret;
+
+	c.zmr.opcode = nvme_cmd_zone_mgmt_recv;
+	c.zmr.nsid = cpu_to_le32(ns->head->ns_id);
+	c.zmr.slba = cpu_to_le64(0);
+	c.zmr.numd = cpu_to_le32(nvme_bytes_to_numd(buflen));
+	c.zmr.zra = NVME_ZRA_ZONE_REPORT;
+	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
+	c.zmr.pr = 0;
+
+	ret = nvme_submit_sync_cmd(ns->queue, &c, &report, buflen);
+	if (ret)
+		return ret;
+
+	return le64_to_cpu(report.nr_zones);
+}
+
 int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
 			  unsigned lbaf)
 {
@@ -87,6 +109,13 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
 		goto free_data;
 	}
 
+	ns->nr_zones = nvme_zns_nr_zones(ns);
+	ns->mar = le32_to_cpu(id->mar);
+	ns->mor = le32_to_cpu(id->mor);
+	ns->rrl = le32_to_cpu(id->rrl);
+	ns->frl = le32_to_cpu(id->frl);
+	ns->zoc = le16_to_cpu(id->zoc);
+
 	q->limits.zoned = BLK_ZONED_HM;
 	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
 free_data:
@@ -230,6 +259,38 @@ int nvme_report_zones(struct gendisk *disk, sector_t sector,
 	return ret;
 }
 
+static int nvme_ns_report_zone_prop(struct nvme_ns *ns, struct blk_zone_dev *zprop)
+{
+	zprop->nr_zones = ns->nr_zones;
+	zprop->zoc = ns->zoc;
+	zprop->ozcs = ns->ozcs;
+	zprop->mar = ns->mar;
+	zprop->mor = ns->mor;
+	zprop->rrl = ns->rrl;
+	zprop->frl = ns->frl;
+
+	return 0;
+}
+
+int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev *zprop)
+{
+	struct nvme_ns_head *head = NULL;
+	struct nvme_ns *ns;
+	int srcu_idx, ret;
+
+	ns = nvme_get_ns_from_disk(disk, &head, &srcu_idx);
+	if (unlikely(!ns))
+		return -EWOULDBLOCK;
+
+	if (ns->head->ids.csi == NVME_CSI_ZNS)
+		ret = nvme_ns_report_zone_prop(ns, zprop);
+	else
+		ret = -EINVAL;
+	nvme_put_ns_from_disk(head, srcu_idx);
+
+	return ret;
+}
+
 blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
 		struct nvme_command *c, enum nvme_zone_mgmt_action action)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8308d8a3720b..0c0faa58b7f4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -372,6 +372,8 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
 				  unsigned int cmd, unsigned long arg);
 extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
 				  unsigned int cmd, unsigned long arg);
+extern int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
+			unsigned int cmd, unsigned long arg);
 #else /* CONFIG_BLK_DEV_ZONED */
 
 static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
@@ -400,6 +402,12 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
 	return -ENOTTY;
 }
 
+static inline int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
+				      unsigned int cmd, unsigned long arg)
+{
+	return -ENOTTY;
+}
+
 #endif /* CONFIG_BLK_DEV_ZONED */
 
 struct request_queue {
@@ -1770,6 +1778,7 @@ struct block_device_operations {
 	int (*report_zones)(struct gendisk *, sector_t sector,
 			unsigned int nr_zones, report_zones_cb cb, void *data);
 	char *(*devnode)(struct gendisk *disk, umode_t *mode);
+	int (*report_zone_p)(struct gendisk *disk, struct blk_zone_dev *zprop);
 	struct module *owner;
 	const struct pr_ops *pr_ops;
 };
diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
index d0978ee10fc7..0c49a4b2ce5d 100644
--- a/include/uapi/linux/blkzoned.h
+++ b/include/uapi/linux/blkzoned.h
@@ -142,6 +142,18 @@ struct blk_zone_range {
 	__u64		nr_sectors;
 };
 
+struct blk_zone_dev {
+	__u32	nr_zones;
+	__u32	mar;
+	__u32	mor;
+	__u32	rrl;
+	__u32	frl;
+	__u16	zoc;
+	__u16	ozcs;
+	__u32	rsv31[2];
+	__u64	rsv63[4];
+};
+
 /**
  * enum blk_zone_action - Zone state transitions managed from user-space
  *
@@ -209,5 +221,6 @@ struct blk_zone_mgmt {
 #define BLKCLOSEZONE	_IOW(0x12, 135, struct blk_zone_range)
 #define BLKFINISHZONE	_IOW(0x12, 136, struct blk_zone_range)
 #define BLKMGMTZONE	_IOR(0x12, 137, struct blk_zone_mgmt)
+#define BLKZONEDEVPROP	_IOR(0x12, 138, struct blk_zone_dev)
 
 #endif /* _UAPI_BLKZONED_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 5/6] block: add zone attr. to zone mgmt IOCTL struct
  2020-06-25 12:21 [PATCH 0/6] ZNS: Extra features for current patches Javier González
                   ` (3 preceding siblings ...)
  2020-06-25 12:21 ` [PATCH 4/6] block: introduce IOCTL to report dev properties Javier González
@ 2020-06-25 12:21 ` Javier González
  2020-06-25 15:13   ` Matias Bjørling
                     ` (2 more replies)
  2020-06-25 12:21 ` [PATCH 6/6] nvme: Add consistency check for zone count Javier González
  2020-06-25 13:04 ` [PATCH 0/6] ZNS: Extra features for current patches Matias Bjørling
  6 siblings, 3 replies; 70+ messages in thread
From: Javier González @ 2020-06-25 12:21 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

From: Javier González <javier.gonz@samsung.com>

Add zone attributes field to the blk_zone structure. Use ZNS attributes
as base for zoned block devices in general.

Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 drivers/nvme/host/zns.c       |  1 +
 include/uapi/linux/blkzoned.h | 13 ++++++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 258d03610cc0..7d8381fe7665 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -195,6 +195,7 @@ static int nvme_zone_parse_entry(struct nvme_ns *ns,
 	zone.capacity = nvme_lba_to_sect(ns, le64_to_cpu(entry->zcap));
 	zone.start = nvme_lba_to_sect(ns, le64_to_cpu(entry->zslba));
 	zone.wp = nvme_lba_to_sect(ns, le64_to_cpu(entry->wp));
+	zone.attr = entry->za;
 
 	return cb(&zone, idx, data);
 }
diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
index 0c49a4b2ce5d..2e43a00e3425 100644
--- a/include/uapi/linux/blkzoned.h
+++ b/include/uapi/linux/blkzoned.h
@@ -82,6 +82,16 @@ enum blk_zone_report_flags {
 	BLK_ZONE_REP_CAPACITY	= (1 << 0),
 };
 
+/**
+ * Zone Attributes
+ */
+enum blk_zone_attr {
+	BLK_ZONE_ATTR_ZFC	= 1 << 0,
+	BLK_ZONE_ATTR_FZR	= 1 << 1,
+	BLK_ZONE_ATTR_RZR	= 1 << 2,
+	BLK_ZONE_ATTR_ZDEV	= 1 << 7,
+};
+
 /**
  * struct blk_zone - Zone descriptor for BLKREPORTZONE ioctl.
  *
@@ -108,7 +118,8 @@ struct blk_zone {
 	__u8	cond;		/* Zone condition */
 	__u8	non_seq;	/* Non-sequential write resources active */
 	__u8	reset;		/* Reset write pointer recommended */
-	__u8	resv[4];
+	__u8	attr;		/* Zone attributes */
+	__u8	resv[3];
 	__u64	capacity;	/* Zone capacity in number of sectors */
 	__u8	reserved[24];
 };
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 6/6] nvme: Add consistency check for zone count
  2020-06-25 12:21 [PATCH 0/6] ZNS: Extra features for current patches Javier González
                   ` (4 preceding siblings ...)
  2020-06-25 12:21 ` [PATCH 5/6] block: add zone attr. to zone mgmt IOCTL struct Javier González
@ 2020-06-25 12:21 ` Javier González
  2020-06-25 13:16   ` Matias Bjørling
                     ` (2 more replies)
  2020-06-25 13:04 ` [PATCH 0/6] ZNS: Extra features for current patches Matias Bjørling
  6 siblings, 3 replies; 70+ messages in thread
From: Javier González @ 2020-06-25 12:21 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

From: Javier González <javier.gonz@samsung.com>

Since the number of zones is calculated through the reported device
capacity and the ZNS specification allows to report the total number of
zones in the device, add an extra check to guarantee consistency between
the device and the kernel.

Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 drivers/nvme/host/zns.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 7d8381fe7665..de806788a184 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
 		sector += ns->zsze * nz;
 	}
 
+	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
+		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
+				zone_idx, ns->nr_zones);
+		ret = -EINVAL;
+		goto out_free;
+	}
+
 	ret = zone_idx;
 out_free:
 	kvfree(report);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/6] ZNS: Extra features for current patches
  2020-06-25 12:21 [PATCH 0/6] ZNS: Extra features for current patches Javier González
                   ` (5 preceding siblings ...)
  2020-06-25 12:21 ` [PATCH 6/6] nvme: Add consistency check for zone count Javier González
@ 2020-06-25 13:04 ` Matias Bjørling
  2020-06-25 14:48   ` Matias Bjørling
  6 siblings, 1 reply; 70+ messages in thread
From: Matias Bjørling @ 2020-06-25 13:04 UTC (permalink / raw)
  To: Javier González, linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González

On 25/06/2020 14.21, Javier González wrote:
> From: Javier González <javier.gonz@samsung.com>
>
> This patchset extends zoned device functionality on top of the existing
> v3 ZNS patchset that Keith sent last week.
>
> Patches 1-5 are zoned block interface and IOCTL additions to expose ZNS
> values to user-space. One major change is the addition of a new zone
> management IOCTL that allows to extend zone management commands with
> flags. I recall a conversation in the mailing list from early this year
> where a similar approach was proposed by Matias, but never made it
> upstream. We extended the IOCTL here to align with the comments in that
> thread. Here, we are happy to get sign-offs by anyone that contributed
> to the thread - just comment here or on the patch.

The original patchset is available here: https://lkml.org/lkml/2019/6/21/419

We wanted to wait posting our updated patches until the base patches 
were upstream. I guess the cat is out of the bag. :)

For the open/finish/reset patch, you'll want to take a look at the 
original patchset, and apply the feedback from that thread to your 
patch. Please also consider the users of these operations, e.g., dm, 
scsi, null_blk, etc. The original patchset has patches for that.





>
> Patch 6 is nvme-only and adds an extra check to the ZNS report code to
> ensure consistency on the zone count.
>
> The patches apply on top of Jens' block-5.8 + Keith's V3 ZNS patches.
>
> Thanks,
> Javier
>
> Javier González (6):
>    block: introduce IOCTL for zone mgmt
>    block: add support for selecting all zones
>    block: add support for zone offline transition
>    block: introduce IOCTL to report dev properties
>    block: add zone attr. to zone mgmt IOCTL struct
>    nvme: Add consistency check for zone count
>
>   block/blk-core.c              |   2 +
>   block/blk-zoned.c             | 108 +++++++++++++++++++++++++++++++++-
>   block/ioctl.c                 |   4 ++
>   drivers/nvme/host/core.c      |   5 ++
>   drivers/nvme/host/nvme.h      |  11 ++++
>   drivers/nvme/host/zns.c       |  69 ++++++++++++++++++++++
>   include/linux/blk_types.h     |   6 +-
>   include/linux/blkdev.h        |  19 +++++-
>   include/uapi/linux/blkzoned.h |  69 +++++++++++++++++++++-
>   9 files changed, 289 insertions(+), 4 deletions(-)
>


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/6] block: introduce IOCTL to report dev properties
  2020-06-25 12:21 ` [PATCH 4/6] block: introduce IOCTL to report dev properties Javier González
@ 2020-06-25 13:10   ` Matias Bjørling
  2020-06-25 19:42     ` Javier González
  2020-06-26  1:38   ` Damien Le Moal
  1 sibling, 1 reply; 70+ messages in thread
From: Matias Bjørling @ 2020-06-25 13:10 UTC (permalink / raw)
  To: Javier González, linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 25/06/2020 14.21, Javier González wrote:
> From: Javier González <javier.gonz@samsung.com>
>
> With the addition of ZNS, a new set of properties have been added to the
> zoned block device. This patch introduces a new IOCTL to expose these
> rroperties to user space.
>
> Signed-off-by: Javier González <javier.gonz@samsung.com>
> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>   block/blk-zoned.c             | 46 ++++++++++++++++++++++++++
>   block/ioctl.c                 |  2 ++
>   drivers/nvme/host/core.c      |  2 ++
>   drivers/nvme/host/nvme.h      | 11 +++++++
>   drivers/nvme/host/zns.c       | 61 +++++++++++++++++++++++++++++++++++
>   include/linux/blkdev.h        |  9 ++++++
>   include/uapi/linux/blkzoned.h | 13 ++++++++
>   7 files changed, 144 insertions(+)
>
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 704fc15813d1..39ec72af9537 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -169,6 +169,17 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
>   }
>   EXPORT_SYMBOL_GPL(blkdev_report_zones);
>   
> +static int blkdev_report_zonedev_prop(struct block_device *bdev,
> +				      struct blk_zone_dev *zprop)
> +{
> +	struct gendisk *disk = bdev->bd_disk;
> +
> +	if (WARN_ON_ONCE(!bdev->bd_disk->fops->report_zone_p))
> +		return -EOPNOTSUPP;
> +
> +	return disk->fops->report_zone_p(disk, zprop);
> +}
> +
>   static inline bool blkdev_allow_reset_all_zones(struct block_device *bdev,
>   						sector_t sector,
>   						sector_t nr_sectors)
> @@ -430,6 +441,41 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>   				GFP_KERNEL);
>   }
>   
> +int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
> +			unsigned int cmd, unsigned long arg)
> +{
> +	void __user *argp = (void __user *)arg;
> +	struct request_queue *q;
> +	struct blk_zone_dev zprop;
> +	int ret;
> +
> +	if (!argp)
> +		return -EINVAL;
> +
> +	q = bdev_get_queue(bdev);
> +	if (!q)
> +		return -ENXIO;
> +
> +	if (!blk_queue_is_zoned(q))
> +		return -ENOTTY;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EACCES;
> +
> +	if (!(mode & FMODE_WRITE))
> +		return -EBADF;
> +
> +	ret = blkdev_report_zonedev_prop(bdev, &zprop);
> +	if (ret)
> +		goto out;
> +
> +	if (copy_to_user(argp, &zprop, sizeof(struct blk_zone_dev)))
> +		return -EFAULT;
> +
> +out:
> +	return ret;
> +}
> +
>   static inline unsigned long *blk_alloc_zone_bitmap(int node,
>   						   unsigned int nr_zones)
>   {
> diff --git a/block/ioctl.c b/block/ioctl.c
> index 0ea29754e7dd..f7b4e0f2dd4c 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -517,6 +517,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>   		return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
>   	case BLKMGMTZONE:
>   		return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
> +	case BLKZONEDEVPROP:
> +		return blkdev_zonedev_prop(bdev, mode, cmd, arg);
>   	case BLKGETZONESZ:
>   		return put_uint(argp, bdev_zone_sectors(bdev));
>   	case BLKGETNRZONES:
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 5b95c81d2a2d..a32c909a915f 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2254,6 +2254,7 @@ static const struct block_device_operations nvme_fops = {
>   	.getgeo		= nvme_getgeo,
>   	.revalidate_disk= nvme_revalidate_disk,
>   	.report_zones	= nvme_report_zones,
> +	.report_zone_p	= nvme_report_zone_prop,
>   	.pr_ops		= &nvme_pr_ops,
>   };
>   
> @@ -2280,6 +2281,7 @@ const struct block_device_operations nvme_ns_head_ops = {
>   	.compat_ioctl	= nvme_compat_ioctl,
>   	.getgeo		= nvme_getgeo,
>   	.report_zones	= nvme_report_zones,
> +	.report_zone_p	= nvme_report_zone_prop,
>   	.pr_ops		= &nvme_pr_ops,
>   };
>   #endif /* CONFIG_NVME_MULTIPATH */
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index ecf443efdf91..172e0531f37f 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -407,6 +407,14 @@ struct nvme_ns {
>   	u8 pi_type;
>   #ifdef CONFIG_BLK_DEV_ZONED
>   	u64 zsze;
> +
> +	u32 nr_zones;
> +	u32 mar;
> +	u32 mor;
> +	u32 rrl;
> +	u32 frl;
> +	u16 zoc;
> +	u16 ozcs;
>   #endif
>   	unsigned long features;
>   	unsigned long flags;
> @@ -704,11 +712,14 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>   int nvme_report_zones(struct gendisk *disk, sector_t sector,
>   		      unsigned int nr_zones, report_zones_cb cb, void *data);
>   
> +int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev *zprop);
> +
>   blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>   				       struct nvme_command *cmnd,
>   				       enum nvme_zone_mgmt_action action);
>   #else
>   #define nvme_report_zones NULL
> +#define nvme_report_zone_prop NULL
>   
>   static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
>   		struct request *req, struct nvme_command *cmnd,
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 2e6512ac6f01..258d03610cc0 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -32,6 +32,28 @@ static int nvme_set_max_append(struct nvme_ctrl *ctrl)
>   	return 0;
>   }
>   
> +static u64 nvme_zns_nr_zones(struct nvme_ns *ns)
> +{
> +	struct nvme_command c = { };
> +	struct nvme_zone_report report;
> +	int buflen = sizeof(struct nvme_zone_report);
> +	int ret;
> +
> +	c.zmr.opcode = nvme_cmd_zone_mgmt_recv;
> +	c.zmr.nsid = cpu_to_le32(ns->head->ns_id);
> +	c.zmr.slba = cpu_to_le64(0);
> +	c.zmr.numd = cpu_to_le32(nvme_bytes_to_numd(buflen));
> +	c.zmr.zra = NVME_ZRA_ZONE_REPORT;
> +	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
> +	c.zmr.pr = 0;
> +
> +	ret = nvme_submit_sync_cmd(ns->queue, &c, &report, buflen);
> +	if (ret)
> +		return ret;
> +
> +	return le64_to_cpu(report.nr_zones);
> +}
> +
>   int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>   			  unsigned lbaf)
>   {
> @@ -87,6 +109,13 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>   		goto free_data;
>   	}
>   
> +	ns->nr_zones = nvme_zns_nr_zones(ns);
> +	ns->mar = le32_to_cpu(id->mar);
> +	ns->mor = le32_to_cpu(id->mor);
> +	ns->rrl = le32_to_cpu(id->rrl);
> +	ns->frl = le32_to_cpu(id->frl);
> +	ns->zoc = le16_to_cpu(id->zoc);
> +
>   	q->limits.zoned = BLK_ZONED_HM;
>   	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
>   free_data:
> @@ -230,6 +259,38 @@ int nvme_report_zones(struct gendisk *disk, sector_t sector,
>   	return ret;
>   }
>   
> +static int nvme_ns_report_zone_prop(struct nvme_ns *ns, struct blk_zone_dev *zprop)
> +{
> +	zprop->nr_zones = ns->nr_zones;
> +	zprop->zoc = ns->zoc;
> +	zprop->ozcs = ns->ozcs;
> +	zprop->mar = ns->mar;
> +	zprop->mor = ns->mor;
> +	zprop->rrl = ns->rrl;
> +	zprop->frl = ns->frl;
> +
> +	return 0;
> +}
> +
> +int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev *zprop)
> +{
> +	struct nvme_ns_head *head = NULL;
> +	struct nvme_ns *ns;
> +	int srcu_idx, ret;
> +
> +	ns = nvme_get_ns_from_disk(disk, &head, &srcu_idx);
> +	if (unlikely(!ns))
> +		return -EWOULDBLOCK;
> +
> +	if (ns->head->ids.csi == NVME_CSI_ZNS)
> +		ret = nvme_ns_report_zone_prop(ns, zprop);
> +	else
> +		ret = -EINVAL;
> +	nvme_put_ns_from_disk(head, srcu_idx);
> +
> +	return ret;
> +}
> +
>   blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>   		struct nvme_command *c, enum nvme_zone_mgmt_action action)
>   {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8308d8a3720b..0c0faa58b7f4 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -372,6 +372,8 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>   				  unsigned int cmd, unsigned long arg);
>   extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>   				  unsigned int cmd, unsigned long arg);
> +extern int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
> +			unsigned int cmd, unsigned long arg);
>   #else /* CONFIG_BLK_DEV_ZONED */
>   
>   static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
> @@ -400,6 +402,12 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
>   	return -ENOTTY;
>   }
>   
> +static inline int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
> +				      unsigned int cmd, unsigned long arg)
> +{
> +	return -ENOTTY;
> +}
> +
>   #endif /* CONFIG_BLK_DEV_ZONED */
>   
>   struct request_queue {
> @@ -1770,6 +1778,7 @@ struct block_device_operations {
>   	int (*report_zones)(struct gendisk *, sector_t sector,
>   			unsigned int nr_zones, report_zones_cb cb, void *data);
>   	char *(*devnode)(struct gendisk *disk, umode_t *mode);
> +	int (*report_zone_p)(struct gendisk *disk, struct blk_zone_dev *zprop);
>   	struct module *owner;
>   	const struct pr_ops *pr_ops;
>   };
> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
> index d0978ee10fc7..0c49a4b2ce5d 100644
> --- a/include/uapi/linux/blkzoned.h
> +++ b/include/uapi/linux/blkzoned.h
> @@ -142,6 +142,18 @@ struct blk_zone_range {
>   	__u64		nr_sectors;
>   };
>   
> +struct blk_zone_dev {
> +	__u32	nr_zones;
> +	__u32	mar;
> +	__u32	mor;
> +	__u32	rrl;
> +	__u32	frl;
> +	__u16	zoc;
> +	__u16	ozcs;
> +	__u32	rsv31[2];
> +	__u64	rsv63[4];
> +};
> +
>   /**
>    * enum blk_zone_action - Zone state transitions managed from user-space
>    *
> @@ -209,5 +221,6 @@ struct blk_zone_mgmt {
>   #define BLKCLOSEZONE	_IOW(0x12, 135, struct blk_zone_range)
>   #define BLKFINISHZONE	_IOW(0x12, 136, struct blk_zone_range)
>   #define BLKMGMTZONE	_IOR(0x12, 137, struct blk_zone_mgmt)
> +#define BLKZONEDEVPROP	_IOR(0x12, 138, struct blk_zone_dev)
>   
>   #endif /* _UAPI_BLKZONED_H */

Nak. These properties can already be retrieved using the nvme ioctl 
passthru command and support have also been added to nvme-cli.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/6] nvme: Add consistency check for zone count
  2020-06-25 12:21 ` [PATCH 6/6] nvme: Add consistency check for zone count Javier González
@ 2020-06-25 13:16   ` Matias Bjørling
  2020-06-25 19:45     ` Javier González
  2020-06-25 21:49   ` Keith Busch
  2020-06-26  9:16   ` Christoph Hellwig
  2 siblings, 1 reply; 70+ messages in thread
From: Matias Bjørling @ 2020-06-25 13:16 UTC (permalink / raw)
  To: Javier González, linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 25/06/2020 14.21, Javier González wrote:
> From: Javier González <javier.gonz@samsung.com>
>
> Since the number of zones is calculated through the reported device
> capacity and the ZNS specification allows to report the total number of
> zones in the device, add an extra check to guarantee consistency between
> the device and the kernel.
>
> Signed-off-by: Javier González <javier.gonz@samsung.com>
> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>   drivers/nvme/host/zns.c | 7 +++++++
>   1 file changed, 7 insertions(+)
>
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 7d8381fe7665..de806788a184 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>   		sector += ns->zsze * nz;
>   	}
>   
> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
> +				zone_idx, ns->nr_zones);
> +		ret = -EINVAL;
> +		goto out_free;
> +	}
> +
>   	ret = zone_idx;
>   out_free:
>   	kvfree(report);

Sounds like a check for a broken implementation. For implementations in 
the wild that exhibits this behavior, a quirk can be added. This kind of 
check is generally not needed. This can easily be checked by having a 
test case in a validation suite. The kernel should not have to check for it.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/6] block: add support for zone offline transition
  2020-06-25 12:21 ` [PATCH 3/6] block: add support for zone offline transition Javier González
@ 2020-06-25 14:12   ` Matias Bjørling
  2020-06-25 19:48     ` Javier González
  2020-06-26  9:07     ` Christoph Hellwig
  2020-06-26  1:34   ` Damien Le Moal
  1 sibling, 2 replies; 70+ messages in thread
From: Matias Bjørling @ 2020-06-25 14:12 UTC (permalink / raw)
  To: Javier González, linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 25/06/2020 14.21, Javier González wrote:
> From: Javier González <javier.gonz@samsung.com>
>
> Add support for offline transition on the zoned block device using the
> new zone management IOCTL
>
> Signed-off-by: Javier González <javier.gonz@samsung.com>
> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>   block/blk-core.c              | 2 ++
>   block/blk-zoned.c             | 3 +++
>   drivers/nvme/host/core.c      | 3 +++
>   include/linux/blk_types.h     | 3 +++
>   include/linux/blkdev.h        | 1 -
>   include/uapi/linux/blkzoned.h | 1 +
>   6 files changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 03252af8c82c..589cbdacc5ec 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -140,6 +140,7 @@ static const char *const blk_op_name[] = {
>   	REQ_OP_NAME(ZONE_CLOSE),
>   	REQ_OP_NAME(ZONE_FINISH),
>   	REQ_OP_NAME(ZONE_APPEND),
> +	REQ_OP_NAME(ZONE_OFFLINE),
>   	REQ_OP_NAME(WRITE_SAME),
>   	REQ_OP_NAME(WRITE_ZEROES),
>   	REQ_OP_NAME(SCSI_IN),
> @@ -1030,6 +1031,7 @@ generic_make_request_checks(struct bio *bio)
>   	case REQ_OP_ZONE_OPEN:
>   	case REQ_OP_ZONE_CLOSE:
>   	case REQ_OP_ZONE_FINISH:
> +	case REQ_OP_ZONE_OFFLINE:
>   		if (!blk_queue_is_zoned(q))
>   			goto not_supported;
>   		break;
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 29194388a1bb..704fc15813d1 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -416,6 +416,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>   	case BLK_ZONE_MGMT_RESET:
>   		op = REQ_OP_ZONE_RESET;
>   		break;
> +	case BLK_ZONE_MGMT_OFFLINE:
> +		op = REQ_OP_ZONE_OFFLINE;
> +		break;
>   	default:
>   		return -ENOTTY;
>   	}
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f1215523792b..5b95c81d2a2d 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
>   	case REQ_OP_ZONE_FINISH:
>   		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
>   		break;
> +	case REQ_OP_ZONE_OFFLINE:
> +		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OFFLINE);
> +		break;
>   	case REQ_OP_WRITE_ZEROES:
>   		ret = nvme_setup_write_zeroes(ns, req, cmd);
>   		break;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 16b57fb2b99c..b3921263c3dd 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -316,6 +316,8 @@ enum req_opf {
>   	REQ_OP_ZONE_FINISH	= 12,
>   	/* write data at the current zone write pointer */
>   	REQ_OP_ZONE_APPEND	= 13,
> +	/* Transition a zone to offline */
> +	REQ_OP_ZONE_OFFLINE	= 14,
>   
>   	/* SCSI passthrough using struct scsi_request */
>   	REQ_OP_SCSI_IN		= 32,
> @@ -456,6 +458,7 @@ static inline bool op_is_zone_mgmt(enum req_opf op)
>   	case REQ_OP_ZONE_OPEN:
>   	case REQ_OP_ZONE_CLOSE:
>   	case REQ_OP_ZONE_FINISH:
> +	case REQ_OP_ZONE_OFFLINE:
>   		return true;
>   	default:
>   		return false;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index bd8521f94dc4..8308d8a3720b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -372,7 +372,6 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>   				  unsigned int cmd, unsigned long arg);
>   extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>   				  unsigned int cmd, unsigned long arg);
> -
>   #else /* CONFIG_BLK_DEV_ZONED */
>   
>   static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
> index a8c89fe58f97..d0978ee10fc7 100644
> --- a/include/uapi/linux/blkzoned.h
> +++ b/include/uapi/linux/blkzoned.h
> @@ -155,6 +155,7 @@ enum blk_zone_action {
>   	BLK_ZONE_MGMT_FINISH	= 0x2,
>   	BLK_ZONE_MGMT_OPEN	= 0x3,
>   	BLK_ZONE_MGMT_RESET	= 0x4,
> +	BLK_ZONE_MGMT_OFFLINE	= 0x5,
>   };
>   
>   /**

I am not sure this makes sense to expose through the kernel zone api. 
One of the goals of the kernel zone API is to be a layer that provides 
an unified zone model across SMR HDDs and ZNS SSDs. The offline zone 
operation, as defined in the ZNS specification, does not have an 
equivalent in SMR HDDs (ZAC/ZBC).

This is different from the Zone Capacity change, where the zone capacity 
simply was zone size for SMR HDDs. Making it easy to support. That is 
not the same for ZAC/ZBC, that does not offer the offline operation to 
transition zones in read only state to offline state.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/6] ZNS: Extra features for current patches
  2020-06-25 13:04 ` [PATCH 0/6] ZNS: Extra features for current patches Matias Bjørling
@ 2020-06-25 14:48   ` Matias Bjørling
  2020-06-25 19:39     ` Javier González
  0 siblings, 1 reply; 70+ messages in thread
From: Matias Bjørling @ 2020-06-25 14:48 UTC (permalink / raw)
  To: Javier González, linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González

On 25/06/2020 15.04, Matias Bjørling wrote:
> On 25/06/2020 14.21, Javier González wrote:
>> From: Javier González <javier.gonz@samsung.com>
>>
>> This patchset extends zoned device functionality on top of the existing
>> v3 ZNS patchset that Keith sent last week.
>>
>> Patches 1-5 are zoned block interface and IOCTL additions to expose ZNS
>> values to user-space. One major change is the addition of a new zone
>> management IOCTL that allows to extend zone management commands with
>> flags. I recall a conversation in the mailing list from early this year
>> where a similar approach was proposed by Matias, but never made it
>> upstream. We extended the IOCTL here to align with the comments in that
>> thread. Here, we are happy to get sign-offs by anyone that contributed
>> to the thread - just comment here or on the patch.
>
> The original patchset is available here: 
> https://lkml.org/lkml/2019/6/21/419
>
> We wanted to wait posting our updated patches until the base patches 
> were upstream. I guess the cat is out of the bag. :)
>
> For the open/finish/reset patch, you'll want to take a look at the 
> original patchset, and apply the feedback from that thread to your 
> patch. Please also consider the users of these operations, e.g., dm, 
> scsi, null_blk, etc. The original patchset has patches for that.
>
Please disregard the above - I forgot that the original patchset 
actually went upstream.

You're right that we discussed (I at least discussed it internally with 
Damien, but I can't find the mail) having one mgmt issuing the commands. 
We didn't go ahead and added it at that point due to ZNS still being in 
a fluffy state.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/6] block: add zone attr. to zone mgmt IOCTL struct
  2020-06-25 12:21 ` [PATCH 5/6] block: add zone attr. to zone mgmt IOCTL struct Javier González
@ 2020-06-25 15:13   ` Matias Bjørling
  2020-06-25 19:51     ` Javier González
  2020-06-26  1:45   ` Damien Le Moal
  2020-06-26  9:14   ` Christoph Hellwig
  2 siblings, 1 reply; 70+ messages in thread
From: Matias Bjørling @ 2020-06-25 15:13 UTC (permalink / raw)
  To: Javier González, linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty, Damien Le Moal

On 25/06/2020 14.21, Javier González wrote:
> From: Javier González <javier.gonz@samsung.com>
>
> Add zone attributes field to the blk_zone structure. Use ZNS attributes
> as base for zoned block devices in general.
>
> Signed-off-by: Javier González <javier.gonz@samsung.com>
> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>   drivers/nvme/host/zns.c       |  1 +
>   include/uapi/linux/blkzoned.h | 13 ++++++++++++-
>   2 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 258d03610cc0..7d8381fe7665 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -195,6 +195,7 @@ static int nvme_zone_parse_entry(struct nvme_ns *ns,
>   	zone.capacity = nvme_lba_to_sect(ns, le64_to_cpu(entry->zcap));
>   	zone.start = nvme_lba_to_sect(ns, le64_to_cpu(entry->zslba));
>   	zone.wp = nvme_lba_to_sect(ns, le64_to_cpu(entry->wp));
> +	zone.attr = entry->za;
>   
>   	return cb(&zone, idx, data);
>   }
> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
> index 0c49a4b2ce5d..2e43a00e3425 100644
> --- a/include/uapi/linux/blkzoned.h
> +++ b/include/uapi/linux/blkzoned.h
> @@ -82,6 +82,16 @@ enum blk_zone_report_flags {
>   	BLK_ZONE_REP_CAPACITY	= (1 << 0),
>   };
>   
> +/**
> + * Zone Attributes
> + */
> +enum blk_zone_attr {
> +	BLK_ZONE_ATTR_ZFC	= 1 << 0,
> +	BLK_ZONE_ATTR_FZR	= 1 << 1,
> +	BLK_ZONE_ATTR_RZR	= 1 << 2,

The RZR bit is equivalent to the RESET bit accessible through the reset 
byte in struct blk_zone.

ZFC is tied to Zone Active Excursions, as there is no support in the 
kernel zone model for that. This should probably not be added until 
there is generic support.

Damien, could we overload the struct blk_zine reset variable for FZR? I 
don't believe we can due to backward compatibility. Is that your 
understanding as well?

> +	BLK_ZONE_ATTR_ZDEV	= 1 << 7,

There is not currently an equivalent for zone descriptor extensions in 
ZAC/ZBC. The addition of this field should probably be in a patch that 
adds generic support for zone descriptor extensions.

> +};
> +
>   /**
>    * struct blk_zone - Zone descriptor for BLKREPORTZONE ioctl.
>    *
> @@ -108,7 +118,8 @@ struct blk_zone {
>   	__u8	cond;		/* Zone condition */
>   	__u8	non_seq;	/* Non-sequential write resources active */
>   	__u8	reset;		/* Reset write pointer recommended */
> -	__u8	resv[4];
> +	__u8	attr;		/* Zone attributes */
> +	__u8	resv[3];
>   	__u64	capacity;	/* Zone capacity in number of sectors */
>   	__u8	reserved[24];
>   };



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/6] ZNS: Extra features for current patches
  2020-06-25 14:48   ` Matias Bjørling
@ 2020-06-25 19:39     ` Javier González
  2020-06-25 19:53       ` Matias Bjørling
  0 siblings, 1 reply; 70+ messages in thread
From: Javier González @ 2020-06-25 19:39 UTC (permalink / raw)
  To: Matias Bjørling; +Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe

On 25.06.2020 16:48, Matias Bjørling wrote:
>On 25/06/2020 15.04, Matias Bjørling wrote:
>>On 25/06/2020 14.21, Javier González wrote:
>>>From: Javier González <javier.gonz@samsung.com>
>>>
>>>This patchset extends zoned device functionality on top of the existing
>>>v3 ZNS patchset that Keith sent last week.
>>>
>>>Patches 1-5 are zoned block interface and IOCTL additions to expose ZNS
>>>values to user-space. One major change is the addition of a new zone
>>>management IOCTL that allows to extend zone management commands with
>>>flags. I recall a conversation in the mailing list from early this year
>>>where a similar approach was proposed by Matias, but never made it
>>>upstream. We extended the IOCTL here to align with the comments in that
>>>thread. Here, we are happy to get sign-offs by anyone that contributed
>>>to the thread - just comment here or on the patch.
>>
>>The original patchset is available here: 
>>https://lkml.org/lkml/2019/6/21/419
>>
>>We wanted to wait posting our updated patches until the base patches 
>>were upstream. I guess the cat is out of the bag. :)
>>
>>For the open/finish/reset patch, you'll want to take a look at the 
>>original patchset, and apply the feedback from that thread to your 
>>patch. Please also consider the users of these operations, e.g., dm, 
>>scsi, null_blk, etc. The original patchset has patches for that.
>>
>Please disregard the above - I forgot that the original patchset 
>actually went upstream.
>
>You're right that we discussed (I at least discussed it internally 
>with Damien, but I can't find the mail) having one mgmt issuing the 
>commands. We didn't go ahead and added it at that point due to ZNS 
>still being in a fluffy state.
>

Does the proposed IOCTL align with the use cases you have in mind? I'm
happy to take it in a different series if you want to add patches to it
for other drivers (scsi, null_blk, etc.).

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/6] block: introduce IOCTL to report dev properties
  2020-06-25 13:10   ` Matias Bjørling
@ 2020-06-25 19:42     ` Javier González
  2020-06-25 19:58       ` Matias Bjørling
                         ` (2 more replies)
  0 siblings, 3 replies; 70+ messages in thread
From: Javier González @ 2020-06-25 19:42 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 25.06.2020 15:10, Matias Bjørling wrote:
>On 25/06/2020 14.21, Javier González wrote:
>>From: Javier González <javier.gonz@samsung.com>
>>
>>With the addition of ZNS, a new set of properties have been added to the
>>zoned block device. This patch introduces a new IOCTL to expose these
>>rroperties to user space.
>>
>>Signed-off-by: Javier González <javier.gonz@samsung.com>
>>Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>---
>>  block/blk-zoned.c             | 46 ++++++++++++++++++++++++++
>>  block/ioctl.c                 |  2 ++
>>  drivers/nvme/host/core.c      |  2 ++
>>  drivers/nvme/host/nvme.h      | 11 +++++++
>>  drivers/nvme/host/zns.c       | 61 +++++++++++++++++++++++++++++++++++
>>  include/linux/blkdev.h        |  9 ++++++
>>  include/uapi/linux/blkzoned.h | 13 ++++++++
>>  7 files changed, 144 insertions(+)
>>
>>diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>index 704fc15813d1..39ec72af9537 100644
>>--- a/block/blk-zoned.c
>>+++ b/block/blk-zoned.c
>>@@ -169,6 +169,17 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
>>  }
>>  EXPORT_SYMBOL_GPL(blkdev_report_zones);
>>+static int blkdev_report_zonedev_prop(struct block_device *bdev,
>>+				      struct blk_zone_dev *zprop)
>>+{
>>+	struct gendisk *disk = bdev->bd_disk;
>>+
>>+	if (WARN_ON_ONCE(!bdev->bd_disk->fops->report_zone_p))
>>+		return -EOPNOTSUPP;
>>+
>>+	return disk->fops->report_zone_p(disk, zprop);
>>+}
>>+
>>  static inline bool blkdev_allow_reset_all_zones(struct block_device *bdev,
>>  						sector_t sector,
>>  						sector_t nr_sectors)
>>@@ -430,6 +441,41 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>  				GFP_KERNEL);
>>  }
>>+int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
>>+			unsigned int cmd, unsigned long arg)
>>+{
>>+	void __user *argp = (void __user *)arg;
>>+	struct request_queue *q;
>>+	struct blk_zone_dev zprop;
>>+	int ret;
>>+
>>+	if (!argp)
>>+		return -EINVAL;
>>+
>>+	q = bdev_get_queue(bdev);
>>+	if (!q)
>>+		return -ENXIO;
>>+
>>+	if (!blk_queue_is_zoned(q))
>>+		return -ENOTTY;
>>+
>>+	if (!capable(CAP_SYS_ADMIN))
>>+		return -EACCES;
>>+
>>+	if (!(mode & FMODE_WRITE))
>>+		return -EBADF;
>>+
>>+	ret = blkdev_report_zonedev_prop(bdev, &zprop);
>>+	if (ret)
>>+		goto out;
>>+
>>+	if (copy_to_user(argp, &zprop, sizeof(struct blk_zone_dev)))
>>+		return -EFAULT;
>>+
>>+out:
>>+	return ret;
>>+}
>>+
>>  static inline unsigned long *blk_alloc_zone_bitmap(int node,
>>  						   unsigned int nr_zones)
>>  {
>>diff --git a/block/ioctl.c b/block/ioctl.c
>>index 0ea29754e7dd..f7b4e0f2dd4c 100644
>>--- a/block/ioctl.c
>>+++ b/block/ioctl.c
>>@@ -517,6 +517,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>>  		return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
>>  	case BLKMGMTZONE:
>>  		return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
>>+	case BLKZONEDEVPROP:
>>+		return blkdev_zonedev_prop(bdev, mode, cmd, arg);
>>  	case BLKGETZONESZ:
>>  		return put_uint(argp, bdev_zone_sectors(bdev));
>>  	case BLKGETNRZONES:
>>diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>index 5b95c81d2a2d..a32c909a915f 100644
>>--- a/drivers/nvme/host/core.c
>>+++ b/drivers/nvme/host/core.c
>>@@ -2254,6 +2254,7 @@ static const struct block_device_operations nvme_fops = {
>>  	.getgeo		= nvme_getgeo,
>>  	.revalidate_disk= nvme_revalidate_disk,
>>  	.report_zones	= nvme_report_zones,
>>+	.report_zone_p	= nvme_report_zone_prop,
>>  	.pr_ops		= &nvme_pr_ops,
>>  };
>>@@ -2280,6 +2281,7 @@ const struct block_device_operations nvme_ns_head_ops = {
>>  	.compat_ioctl	= nvme_compat_ioctl,
>>  	.getgeo		= nvme_getgeo,
>>  	.report_zones	= nvme_report_zones,
>>+	.report_zone_p	= nvme_report_zone_prop,
>>  	.pr_ops		= &nvme_pr_ops,
>>  };
>>  #endif /* CONFIG_NVME_MULTIPATH */
>>diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>>index ecf443efdf91..172e0531f37f 100644
>>--- a/drivers/nvme/host/nvme.h
>>+++ b/drivers/nvme/host/nvme.h
>>@@ -407,6 +407,14 @@ struct nvme_ns {
>>  	u8 pi_type;
>>  #ifdef CONFIG_BLK_DEV_ZONED
>>  	u64 zsze;
>>+
>>+	u32 nr_zones;
>>+	u32 mar;
>>+	u32 mor;
>>+	u32 rrl;
>>+	u32 frl;
>>+	u16 zoc;
>>+	u16 ozcs;
>>  #endif
>>  	unsigned long features;
>>  	unsigned long flags;
>>@@ -704,11 +712,14 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>>  int nvme_report_zones(struct gendisk *disk, sector_t sector,
>>  		      unsigned int nr_zones, report_zones_cb cb, void *data);
>>+int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev *zprop);
>>+
>>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>>  				       struct nvme_command *cmnd,
>>  				       enum nvme_zone_mgmt_action action);
>>  #else
>>  #define nvme_report_zones NULL
>>+#define nvme_report_zone_prop NULL
>>  static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
>>  		struct request *req, struct nvme_command *cmnd,
>>diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>index 2e6512ac6f01..258d03610cc0 100644
>>--- a/drivers/nvme/host/zns.c
>>+++ b/drivers/nvme/host/zns.c
>>@@ -32,6 +32,28 @@ static int nvme_set_max_append(struct nvme_ctrl *ctrl)
>>  	return 0;
>>  }
>>+static u64 nvme_zns_nr_zones(struct nvme_ns *ns)
>>+{
>>+	struct nvme_command c = { };
>>+	struct nvme_zone_report report;
>>+	int buflen = sizeof(struct nvme_zone_report);
>>+	int ret;
>>+
>>+	c.zmr.opcode = nvme_cmd_zone_mgmt_recv;
>>+	c.zmr.nsid = cpu_to_le32(ns->head->ns_id);
>>+	c.zmr.slba = cpu_to_le64(0);
>>+	c.zmr.numd = cpu_to_le32(nvme_bytes_to_numd(buflen));
>>+	c.zmr.zra = NVME_ZRA_ZONE_REPORT;
>>+	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>>+	c.zmr.pr = 0;
>>+
>>+	ret = nvme_submit_sync_cmd(ns->queue, &c, &report, buflen);
>>+	if (ret)
>>+		return ret;
>>+
>>+	return le64_to_cpu(report.nr_zones);
>>+}
>>+
>>  int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>>  			  unsigned lbaf)
>>  {
>>@@ -87,6 +109,13 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>>  		goto free_data;
>>  	}
>>+	ns->nr_zones = nvme_zns_nr_zones(ns);
>>+	ns->mar = le32_to_cpu(id->mar);
>>+	ns->mor = le32_to_cpu(id->mor);
>>+	ns->rrl = le32_to_cpu(id->rrl);
>>+	ns->frl = le32_to_cpu(id->frl);
>>+	ns->zoc = le16_to_cpu(id->zoc);
>>+
>>  	q->limits.zoned = BLK_ZONED_HM;
>>  	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
>>  free_data:
>>@@ -230,6 +259,38 @@ int nvme_report_zones(struct gendisk *disk, sector_t sector,
>>  	return ret;
>>  }
>>+static int nvme_ns_report_zone_prop(struct nvme_ns *ns, struct blk_zone_dev *zprop)
>>+{
>>+	zprop->nr_zones = ns->nr_zones;
>>+	zprop->zoc = ns->zoc;
>>+	zprop->ozcs = ns->ozcs;
>>+	zprop->mar = ns->mar;
>>+	zprop->mor = ns->mor;
>>+	zprop->rrl = ns->rrl;
>>+	zprop->frl = ns->frl;
>>+
>>+	return 0;
>>+}
>>+
>>+int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev *zprop)
>>+{
>>+	struct nvme_ns_head *head = NULL;
>>+	struct nvme_ns *ns;
>>+	int srcu_idx, ret;
>>+
>>+	ns = nvme_get_ns_from_disk(disk, &head, &srcu_idx);
>>+	if (unlikely(!ns))
>>+		return -EWOULDBLOCK;
>>+
>>+	if (ns->head->ids.csi == NVME_CSI_ZNS)
>>+		ret = nvme_ns_report_zone_prop(ns, zprop);
>>+	else
>>+		ret = -EINVAL;
>>+	nvme_put_ns_from_disk(head, srcu_idx);
>>+
>>+	return ret;
>>+}
>>+
>>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>>  		struct nvme_command *c, enum nvme_zone_mgmt_action action)
>>  {
>>diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>index 8308d8a3720b..0c0faa58b7f4 100644
>>--- a/include/linux/blkdev.h
>>+++ b/include/linux/blkdev.h
>>@@ -372,6 +372,8 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>  				  unsigned int cmd, unsigned long arg);
>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>  				  unsigned int cmd, unsigned long arg);
>>+extern int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
>>+			unsigned int cmd, unsigned long arg);
>>  #else /* CONFIG_BLK_DEV_ZONED */
>>  static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
>>@@ -400,6 +402,12 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
>>  	return -ENOTTY;
>>  }
>>+static inline int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
>>+				      unsigned int cmd, unsigned long arg)
>>+{
>>+	return -ENOTTY;
>>+}
>>+
>>  #endif /* CONFIG_BLK_DEV_ZONED */
>>  struct request_queue {
>>@@ -1770,6 +1778,7 @@ struct block_device_operations {
>>  	int (*report_zones)(struct gendisk *, sector_t sector,
>>  			unsigned int nr_zones, report_zones_cb cb, void *data);
>>  	char *(*devnode)(struct gendisk *disk, umode_t *mode);
>>+	int (*report_zone_p)(struct gendisk *disk, struct blk_zone_dev *zprop);
>>  	struct module *owner;
>>  	const struct pr_ops *pr_ops;
>>  };
>>diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>index d0978ee10fc7..0c49a4b2ce5d 100644
>>--- a/include/uapi/linux/blkzoned.h
>>+++ b/include/uapi/linux/blkzoned.h
>>@@ -142,6 +142,18 @@ struct blk_zone_range {
>>  	__u64		nr_sectors;
>>  };
>>+struct blk_zone_dev {
>>+	__u32	nr_zones;
>>+	__u32	mar;
>>+	__u32	mor;
>>+	__u32	rrl;
>>+	__u32	frl;
>>+	__u16	zoc;
>>+	__u16	ozcs;
>>+	__u32	rsv31[2];
>>+	__u64	rsv63[4];
>>+};
>>+
>>  /**
>>   * enum blk_zone_action - Zone state transitions managed from user-space
>>   *
>>@@ -209,5 +221,6 @@ struct blk_zone_mgmt {
>>  #define BLKCLOSEZONE	_IOW(0x12, 135, struct blk_zone_range)
>>  #define BLKFINISHZONE	_IOW(0x12, 136, struct blk_zone_range)
>>  #define BLKMGMTZONE	_IOR(0x12, 137, struct blk_zone_mgmt)
>>+#define BLKZONEDEVPROP	_IOR(0x12, 138, struct blk_zone_dev)
>>  #endif /* _UAPI_BLKZONED_H */
>
>Nak. These properties can already be retrieved using the nvme ioctl 
>passthru command and support have also been added to nvme-cli.
>

These properties are intended to be consumed by an application, so
nvme-cli is of not much use. I would also like to avoid sysfs variables.

We can use nvme passthru, but this bypasses the zoned block abstraction.
Why not representing ZNS features in the standard zoned block API? I am
happy to iterate on the actual implementation if you have feedback.

Javier


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/6] nvme: Add consistency check for zone count
  2020-06-25 13:16   ` Matias Bjørling
@ 2020-06-25 19:45     ` Javier González
  0 siblings, 0 replies; 70+ messages in thread
From: Javier González @ 2020-06-25 19:45 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 25.06.2020 15:16, Matias Bjørling wrote:
>On 25/06/2020 14.21, Javier González wrote:
>>From: Javier González <javier.gonz@samsung.com>
>>
>>Since the number of zones is calculated through the reported device
>>capacity and the ZNS specification allows to report the total number of
>>zones in the device, add an extra check to guarantee consistency between
>>the device and the kernel.
>>
>>Signed-off-by: Javier González <javier.gonz@samsung.com>
>>Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>---
>>  drivers/nvme/host/zns.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>>diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>index 7d8381fe7665..de806788a184 100644
>>--- a/drivers/nvme/host/zns.c
>>+++ b/drivers/nvme/host/zns.c
>>@@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>  		sector += ns->zsze * nz;
>>  	}
>>+	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>>+		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>>+				zone_idx, ns->nr_zones);
>>+		ret = -EINVAL;
>>+		goto out_free;
>>+	}
>>+
>>  	ret = zone_idx;
>>  out_free:
>>  	kvfree(report);
>
>Sounds like a check for a broken implementation. For implementations 
>in the wild that exhibits this behavior, a quirk can be added. This 
>kind of check is generally not needed. This can easily be checked by 
>having a test case in a validation suite. The kernel should not have 
>to check for it.
>

I don't believe it hurts to validate as ZNS provides a method to
retrieve the actual number of zones. It can help people detecting an
issue that can hide for some time.

If the general opinion is that this belongs to a test suite, we can add
it to blktests (already have it there internally). We can also have it
in both places.

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/6] block: add support for zone offline transition
  2020-06-25 14:12   ` Matias Bjørling
@ 2020-06-25 19:48     ` Javier González
  2020-06-26  1:14       ` Damien Le Moal
  2020-06-26  9:07     ` Christoph Hellwig
  1 sibling, 1 reply; 70+ messages in thread
From: Javier González @ 2020-06-25 19:48 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 25.06.2020 16:12, Matias Bjørling wrote:
>On 25/06/2020 14.21, Javier González wrote:
>>From: Javier González <javier.gonz@samsung.com>
>>
>>Add support for offline transition on the zoned block device using the
>>new zone management IOCTL
>>
>>Signed-off-by: Javier González <javier.gonz@samsung.com>
>>Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>---
>>  block/blk-core.c              | 2 ++
>>  block/blk-zoned.c             | 3 +++
>>  drivers/nvme/host/core.c      | 3 +++
>>  include/linux/blk_types.h     | 3 +++
>>  include/linux/blkdev.h        | 1 -
>>  include/uapi/linux/blkzoned.h | 1 +
>>  6 files changed, 12 insertions(+), 1 deletion(-)
>>
>>diff --git a/block/blk-core.c b/block/blk-core.c
>>index 03252af8c82c..589cbdacc5ec 100644
>>--- a/block/blk-core.c
>>+++ b/block/blk-core.c
>>@@ -140,6 +140,7 @@ static const char *const blk_op_name[] = {
>>  	REQ_OP_NAME(ZONE_CLOSE),
>>  	REQ_OP_NAME(ZONE_FINISH),
>>  	REQ_OP_NAME(ZONE_APPEND),
>>+	REQ_OP_NAME(ZONE_OFFLINE),
>>  	REQ_OP_NAME(WRITE_SAME),
>>  	REQ_OP_NAME(WRITE_ZEROES),
>>  	REQ_OP_NAME(SCSI_IN),
>>@@ -1030,6 +1031,7 @@ generic_make_request_checks(struct bio *bio)
>>  	case REQ_OP_ZONE_OPEN:
>>  	case REQ_OP_ZONE_CLOSE:
>>  	case REQ_OP_ZONE_FINISH:
>>+	case REQ_OP_ZONE_OFFLINE:
>>  		if (!blk_queue_is_zoned(q))
>>  			goto not_supported;
>>  		break;
>>diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>index 29194388a1bb..704fc15813d1 100644
>>--- a/block/blk-zoned.c
>>+++ b/block/blk-zoned.c
>>@@ -416,6 +416,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>  	case BLK_ZONE_MGMT_RESET:
>>  		op = REQ_OP_ZONE_RESET;
>>  		break;
>>+	case BLK_ZONE_MGMT_OFFLINE:
>>+		op = REQ_OP_ZONE_OFFLINE;
>>+		break;
>>  	default:
>>  		return -ENOTTY;
>>  	}
>>diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>index f1215523792b..5b95c81d2a2d 100644
>>--- a/drivers/nvme/host/core.c
>>+++ b/drivers/nvme/host/core.c
>>@@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
>>  	case REQ_OP_ZONE_FINISH:
>>  		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
>>  		break;
>>+	case REQ_OP_ZONE_OFFLINE:
>>+		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OFFLINE);
>>+		break;
>>  	case REQ_OP_WRITE_ZEROES:
>>  		ret = nvme_setup_write_zeroes(ns, req, cmd);
>>  		break;
>>diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>index 16b57fb2b99c..b3921263c3dd 100644
>>--- a/include/linux/blk_types.h
>>+++ b/include/linux/blk_types.h
>>@@ -316,6 +316,8 @@ enum req_opf {
>>  	REQ_OP_ZONE_FINISH	= 12,
>>  	/* write data at the current zone write pointer */
>>  	REQ_OP_ZONE_APPEND	= 13,
>>+	/* Transition a zone to offline */
>>+	REQ_OP_ZONE_OFFLINE	= 14,
>>  	/* SCSI passthrough using struct scsi_request */
>>  	REQ_OP_SCSI_IN		= 32,
>>@@ -456,6 +458,7 @@ static inline bool op_is_zone_mgmt(enum req_opf op)
>>  	case REQ_OP_ZONE_OPEN:
>>  	case REQ_OP_ZONE_CLOSE:
>>  	case REQ_OP_ZONE_FINISH:
>>+	case REQ_OP_ZONE_OFFLINE:
>>  		return true;
>>  	default:
>>  		return false;
>>diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>index bd8521f94dc4..8308d8a3720b 100644
>>--- a/include/linux/blkdev.h
>>+++ b/include/linux/blkdev.h
>>@@ -372,7 +372,6 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>  				  unsigned int cmd, unsigned long arg);
>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>  				  unsigned int cmd, unsigned long arg);
>>-
>>  #else /* CONFIG_BLK_DEV_ZONED */
>>  static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
>>diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>index a8c89fe58f97..d0978ee10fc7 100644
>>--- a/include/uapi/linux/blkzoned.h
>>+++ b/include/uapi/linux/blkzoned.h
>>@@ -155,6 +155,7 @@ enum blk_zone_action {
>>  	BLK_ZONE_MGMT_FINISH	= 0x2,
>>  	BLK_ZONE_MGMT_OPEN	= 0x3,
>>  	BLK_ZONE_MGMT_RESET	= 0x4,
>>+	BLK_ZONE_MGMT_OFFLINE	= 0x5,
>>  };
>>  /**
>
>I am not sure this makes sense to expose through the kernel zone api. 
>One of the goals of the kernel zone API is to be a layer that provides 
>an unified zone model across SMR HDDs and ZNS SSDs. The offline zone 
>operation, as defined in the ZNS specification, does not have an 
>equivalent in SMR HDDs (ZAC/ZBC).
>
>This is different from the Zone Capacity change, where the zone 
>capacity simply was zone size for SMR HDDs. Making it easy to support. 
>That is not the same for ZAC/ZBC, that does not offer the offline 
>operation to transition zones in read only state to offline state.

I agree that an unified interface is desirable. However, the truth is
that ZAC/ZBC are different, and will differ more and more and time goes
by. We can deal with the differences at the driver level or with checks
at the API level, but limiting ZNS with ZAC/ZBC is a hard constraint.

Note too that I chose to only support this particular transition on the
new management IOCTL to avoid confusion for existing ZAC/ZBC users.

It would be good to clarify what is the plan for kernel APIs moving
forward, as I believe there is a general desire to support new ZNS
features, which will not necessarily be replicated in SMR drives.

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/6] block: add zone attr. to zone mgmt IOCTL struct
  2020-06-25 15:13   ` Matias Bjørling
@ 2020-06-25 19:51     ` Javier González
  0 siblings, 0 replies; 70+ messages in thread
From: Javier González @ 2020-06-25 19:51 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty, Damien Le Moal

On 25.06.2020 17:13, Matias Bjørling wrote:
>On 25/06/2020 14.21, Javier González wrote:
>>From: Javier González <javier.gonz@samsung.com>
>>
>>Add zone attributes field to the blk_zone structure. Use ZNS attributes
>>as base for zoned block devices in general.
>>
>>Signed-off-by: Javier González <javier.gonz@samsung.com>
>>Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>---
>>  drivers/nvme/host/zns.c       |  1 +
>>  include/uapi/linux/blkzoned.h | 13 ++++++++++++-
>>  2 files changed, 13 insertions(+), 1 deletion(-)
>>
>>diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>index 258d03610cc0..7d8381fe7665 100644
>>--- a/drivers/nvme/host/zns.c
>>+++ b/drivers/nvme/host/zns.c
>>@@ -195,6 +195,7 @@ static int nvme_zone_parse_entry(struct nvme_ns *ns,
>>  	zone.capacity = nvme_lba_to_sect(ns, le64_to_cpu(entry->zcap));
>>  	zone.start = nvme_lba_to_sect(ns, le64_to_cpu(entry->zslba));
>>  	zone.wp = nvme_lba_to_sect(ns, le64_to_cpu(entry->wp));
>>+	zone.attr = entry->za;
>>  	return cb(&zone, idx, data);
>>  }
>>diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>index 0c49a4b2ce5d..2e43a00e3425 100644
>>--- a/include/uapi/linux/blkzoned.h
>>+++ b/include/uapi/linux/blkzoned.h
>>@@ -82,6 +82,16 @@ enum blk_zone_report_flags {
>>  	BLK_ZONE_REP_CAPACITY	= (1 << 0),
>>  };
>>+/**
>>+ * Zone Attributes
>>+ */
>>+enum blk_zone_attr {
>>+	BLK_ZONE_ATTR_ZFC	= 1 << 0,
>>+	BLK_ZONE_ATTR_FZR	= 1 << 1,
>>+	BLK_ZONE_ATTR_RZR	= 1 << 2,
>
>The RZR bit is equivalent to the RESET bit accessible through the 
>reset byte in struct blk_zone.
>
>ZFC is tied to Zone Active Excursions, as there is no support in the 
>kernel zone model for that. This should probably not be added until 
>there is generic support.
>
>Damien, could we overload the struct blk_zine reset variable for FZR? 
>I don't believe we can due to backward compatibility. Is that your 
>understanding as well?
>
>>+	BLK_ZONE_ATTR_ZDEV	= 1 << 7,
>
>There is not currently an equivalent for zone descriptor extensions in 
>ZAC/ZBC. The addition of this field should probably be in a patch that 
>adds generic support for zone descriptor extensions.

The intention was to just report the values on the IOCTL and make sure
that changes to struct blk_zone are OK.

>>+};
>>+
>>  /**
>>   * struct blk_zone - Zone descriptor for BLKREPORTZONE ioctl.
>>   *
>>@@ -108,7 +118,8 @@ struct blk_zone {
>>  	__u8	cond;		/* Zone condition */
>>  	__u8	non_seq;	/* Non-sequential write resources active */
>>  	__u8	reset;		/* Reset write pointer recommended */
>>-	__u8	resv[4];
>>+	__u8	attr;		/* Zone attributes */
>>+	__u8	resv[3];
>>  	__u64	capacity;	/* Zone capacity in number of sectors */
>>  	__u8	reserved[24];
>>  };
>
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/6] ZNS: Extra features for current patches
  2020-06-25 19:39     ` Javier González
@ 2020-06-25 19:53       ` Matias Bjørling
  2020-06-26  6:26         ` Javier González
  0 siblings, 1 reply; 70+ messages in thread
From: Matias Bjørling @ 2020-06-25 19:53 UTC (permalink / raw)
  To: Javier González; +Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe

On 25/06/2020 21.39, Javier González wrote:
> On 25.06.2020 16:48, Matias Bjørling wrote:
>> On 25/06/2020 15.04, Matias Bjørling wrote:
>>> On 25/06/2020 14.21, Javier González wrote:
>>>> From: Javier González <javier.gonz@samsung.com>
>>>>
>>>> This patchset extends zoned device functionality on top of the 
>>>> existing
>>>> v3 ZNS patchset that Keith sent last week.
>>>>
>>>> Patches 1-5 are zoned block interface and IOCTL additions to expose 
>>>> ZNS
>>>> values to user-space. One major change is the addition of a new zone
>>>> management IOCTL that allows to extend zone management commands with
>>>> flags. I recall a conversation in the mailing list from early this 
>>>> year
>>>> where a similar approach was proposed by Matias, but never made it
>>>> upstream. We extended the IOCTL here to align with the comments in 
>>>> that
>>>> thread. Here, we are happy to get sign-offs by anyone that contributed
>>>> to the thread - just comment here or on the patch.
>>>
>>> The original patchset is available here: 
>>> https://lkml.org/lkml/2019/6/21/419
>>>
>>> We wanted to wait posting our updated patches until the base patches 
>>> were upstream. I guess the cat is out of the bag. :)
>>>
>>> For the open/finish/reset patch, you'll want to take a look at the 
>>> original patchset, and apply the feedback from that thread to your 
>>> patch. Please also consider the users of these operations, e.g., dm, 
>>> scsi, null_blk, etc. The original patchset has patches for that.
>>>
>> Please disregard the above - I forgot that the original patchset 
>> actually went upstream.
>>
>> You're right that we discussed (I at least discussed it internally 
>> with Damien, but I can't find the mail) having one mgmt issuing the 
>> commands. We didn't go ahead and added it at that point due to ZNS 
>> still being in a fluffy state.
>>
>
> Does the proposed IOCTL align with the use cases you have in mind? I'm
> happy to take it in a different series if you want to add patches to it
> for other drivers (scsi, null_blk, etc.).

I think the ioctl makes sense. I wanted to have it like that originally. 
I'm still thinking through if it covers the short-term cases for the 
upcoming TPs.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/6] block: introduce IOCTL to report dev properties
  2020-06-25 19:42     ` Javier González
@ 2020-06-25 19:58       ` Matias Bjørling
  2020-06-26  6:24         ` Javier González
  2020-06-25 20:25       ` Keith Busch
  2020-06-26  0:57       ` Damien Le Moal
  2 siblings, 1 reply; 70+ messages in thread
From: Matias Bjørling @ 2020-06-25 19:58 UTC (permalink / raw)
  To: Javier González
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 25/06/2020 21.42, Javier González wrote:
> On 25.06.2020 15:10, Matias Bjørling wrote:
>> On 25/06/2020 14.21, Javier González wrote:
>>> From: Javier González <javier.gonz@samsung.com>
>>>
>>> With the addition of ZNS, a new set of properties have been added to 
>>> the
>>> zoned block device. This patch introduces a new IOCTL to expose these
>>> rroperties to user space.
>>>
>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>> ---
>>>  block/blk-zoned.c             | 46 ++++++++++++++++++++++++++
>>>  block/ioctl.c                 |  2 ++
>>>  drivers/nvme/host/core.c      |  2 ++
>>>  drivers/nvme/host/nvme.h      | 11 +++++++
>>>  drivers/nvme/host/zns.c       | 61 +++++++++++++++++++++++++++++++++++
>>>  include/linux/blkdev.h        |  9 ++++++
>>>  include/uapi/linux/blkzoned.h | 13 ++++++++
>>>  7 files changed, 144 insertions(+)
>>>
>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>> index 704fc15813d1..39ec72af9537 100644
>>> --- a/block/blk-zoned.c
>>> +++ b/block/blk-zoned.c
>>> @@ -169,6 +169,17 @@ int blkdev_report_zones(struct block_device 
>>> *bdev, sector_t sector,
>>>  }
>>>  EXPORT_SYMBOL_GPL(blkdev_report_zones);
>>> +static int blkdev_report_zonedev_prop(struct block_device *bdev,
>>> +                      struct blk_zone_dev *zprop)
>>> +{
>>> +    struct gendisk *disk = bdev->bd_disk;
>>> +
>>> +    if (WARN_ON_ONCE(!bdev->bd_disk->fops->report_zone_p))
>>> +        return -EOPNOTSUPP;
>>> +
>>> +    return disk->fops->report_zone_p(disk, zprop);
>>> +}
>>> +
>>>  static inline bool blkdev_allow_reset_all_zones(struct block_device 
>>> *bdev,
>>>                          sector_t sector,
>>>                          sector_t nr_sectors)
>>> @@ -430,6 +441,41 @@ int blkdev_zone_mgmt_ioctl(struct block_device 
>>> *bdev, fmode_t mode,
>>>                  GFP_KERNEL);
>>>  }
>>> +int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
>>> +            unsigned int cmd, unsigned long arg)
>>> +{
>>> +    void __user *argp = (void __user *)arg;
>>> +    struct request_queue *q;
>>> +    struct blk_zone_dev zprop;
>>> +    int ret;
>>> +
>>> +    if (!argp)
>>> +        return -EINVAL;
>>> +
>>> +    q = bdev_get_queue(bdev);
>>> +    if (!q)
>>> +        return -ENXIO;
>>> +
>>> +    if (!blk_queue_is_zoned(q))
>>> +        return -ENOTTY;
>>> +
>>> +    if (!capable(CAP_SYS_ADMIN))
>>> +        return -EACCES;
>>> +
>>> +    if (!(mode & FMODE_WRITE))
>>> +        return -EBADF;
>>> +
>>> +    ret = blkdev_report_zonedev_prop(bdev, &zprop);
>>> +    if (ret)
>>> +        goto out;
>>> +
>>> +    if (copy_to_user(argp, &zprop, sizeof(struct blk_zone_dev)))
>>> +        return -EFAULT;
>>> +
>>> +out:
>>> +    return ret;
>>> +}
>>> +
>>>  static inline unsigned long *blk_alloc_zone_bitmap(int node,
>>>                             unsigned int nr_zones)
>>>  {
>>> diff --git a/block/ioctl.c b/block/ioctl.c
>>> index 0ea29754e7dd..f7b4e0f2dd4c 100644
>>> --- a/block/ioctl.c
>>> +++ b/block/ioctl.c
>>> @@ -517,6 +517,8 @@ static int blkdev_common_ioctl(struct 
>>> block_device *bdev, fmode_t mode,
>>>          return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
>>>      case BLKMGMTZONE:
>>>          return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
>>> +    case BLKZONEDEVPROP:
>>> +        return blkdev_zonedev_prop(bdev, mode, cmd, arg);
>>>      case BLKGETZONESZ:
>>>          return put_uint(argp, bdev_zone_sectors(bdev));
>>>      case BLKGETNRZONES:
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 5b95c81d2a2d..a32c909a915f 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -2254,6 +2254,7 @@ static const struct block_device_operations 
>>> nvme_fops = {
>>>      .getgeo        = nvme_getgeo,
>>>      .revalidate_disk= nvme_revalidate_disk,
>>>      .report_zones    = nvme_report_zones,
>>> +    .report_zone_p    = nvme_report_zone_prop,
>>>      .pr_ops        = &nvme_pr_ops,
>>>  };
>>> @@ -2280,6 +2281,7 @@ const struct block_device_operations 
>>> nvme_ns_head_ops = {
>>>      .compat_ioctl    = nvme_compat_ioctl,
>>>      .getgeo        = nvme_getgeo,
>>>      .report_zones    = nvme_report_zones,
>>> +    .report_zone_p    = nvme_report_zone_prop,
>>>      .pr_ops        = &nvme_pr_ops,
>>>  };
>>>  #endif /* CONFIG_NVME_MULTIPATH */
>>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>>> index ecf443efdf91..172e0531f37f 100644
>>> --- a/drivers/nvme/host/nvme.h
>>> +++ b/drivers/nvme/host/nvme.h
>>> @@ -407,6 +407,14 @@ struct nvme_ns {
>>>      u8 pi_type;
>>>  #ifdef CONFIG_BLK_DEV_ZONED
>>>      u64 zsze;
>>> +
>>> +    u32 nr_zones;
>>> +    u32 mar;
>>> +    u32 mor;
>>> +    u32 rrl;
>>> +    u32 frl;
>>> +    u16 zoc;
>>> +    u16 ozcs;
>>>  #endif
>>>      unsigned long features;
>>>      unsigned long flags;
>>> @@ -704,11 +712,14 @@ int nvme_update_zone_info(struct gendisk 
>>> *disk, struct nvme_ns *ns,
>>>  int nvme_report_zones(struct gendisk *disk, sector_t sector,
>>>                unsigned int nr_zones, report_zones_cb cb, void *data);
>>> +int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev 
>>> *zprop);
>>> +
>>>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct 
>>> request *req,
>>>                         struct nvme_command *cmnd,
>>>                         enum nvme_zone_mgmt_action action);
>>>  #else
>>>  #define nvme_report_zones NULL
>>> +#define nvme_report_zone_prop NULL
>>>  static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns 
>>> *ns,
>>>          struct request *req, struct nvme_command *cmnd,
>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>> index 2e6512ac6f01..258d03610cc0 100644
>>> --- a/drivers/nvme/host/zns.c
>>> +++ b/drivers/nvme/host/zns.c
>>> @@ -32,6 +32,28 @@ static int nvme_set_max_append(struct nvme_ctrl 
>>> *ctrl)
>>>      return 0;
>>>  }
>>> +static u64 nvme_zns_nr_zones(struct nvme_ns *ns)
>>> +{
>>> +    struct nvme_command c = { };
>>> +    struct nvme_zone_report report;
>>> +    int buflen = sizeof(struct nvme_zone_report);
>>> +    int ret;
>>> +
>>> +    c.zmr.opcode = nvme_cmd_zone_mgmt_recv;
>>> +    c.zmr.nsid = cpu_to_le32(ns->head->ns_id);
>>> +    c.zmr.slba = cpu_to_le64(0);
>>> +    c.zmr.numd = cpu_to_le32(nvme_bytes_to_numd(buflen));
>>> +    c.zmr.zra = NVME_ZRA_ZONE_REPORT;
>>> +    c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>>> +    c.zmr.pr = 0;
>>> +
>>> +    ret = nvme_submit_sync_cmd(ns->queue, &c, &report, buflen);
>>> +    if (ret)
>>> +        return ret;
>>> +
>>> +    return le64_to_cpu(report.nr_zones);
>>> +}
>>> +
>>>  int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>>>                unsigned lbaf)
>>>  {
>>> @@ -87,6 +109,13 @@ int nvme_update_zone_info(struct gendisk *disk, 
>>> struct nvme_ns *ns,
>>>          goto free_data;
>>>      }
>>> +    ns->nr_zones = nvme_zns_nr_zones(ns);
>>> +    ns->mar = le32_to_cpu(id->mar);
>>> +    ns->mor = le32_to_cpu(id->mor);
>>> +    ns->rrl = le32_to_cpu(id->rrl);
>>> +    ns->frl = le32_to_cpu(id->frl);
>>> +    ns->zoc = le16_to_cpu(id->zoc);
>>> +
>>>      q->limits.zoned = BLK_ZONED_HM;
>>>      blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
>>>  free_data:
>>> @@ -230,6 +259,38 @@ int nvme_report_zones(struct gendisk *disk, 
>>> sector_t sector,
>>>      return ret;
>>>  }
>>> +static int nvme_ns_report_zone_prop(struct nvme_ns *ns, struct 
>>> blk_zone_dev *zprop)
>>> +{
>>> +    zprop->nr_zones = ns->nr_zones;
>>> +    zprop->zoc = ns->zoc;
>>> +    zprop->ozcs = ns->ozcs;
>>> +    zprop->mar = ns->mar;
>>> +    zprop->mor = ns->mor;
>>> +    zprop->rrl = ns->rrl;
>>> +    zprop->frl = ns->frl;
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev 
>>> *zprop)
>>> +{
>>> +    struct nvme_ns_head *head = NULL;
>>> +    struct nvme_ns *ns;
>>> +    int srcu_idx, ret;
>>> +
>>> +    ns = nvme_get_ns_from_disk(disk, &head, &srcu_idx);
>>> +    if (unlikely(!ns))
>>> +        return -EWOULDBLOCK;
>>> +
>>> +    if (ns->head->ids.csi == NVME_CSI_ZNS)
>>> +        ret = nvme_ns_report_zone_prop(ns, zprop);
>>> +    else
>>> +        ret = -EINVAL;
>>> +    nvme_put_ns_from_disk(head, srcu_idx);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct 
>>> request *req,
>>>          struct nvme_command *c, enum nvme_zone_mgmt_action action)
>>>  {
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index 8308d8a3720b..0c0faa58b7f4 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -372,6 +372,8 @@ extern int blkdev_zone_ops_ioctl(struct 
>>> block_device *bdev, fmode_t mode,
>>>                    unsigned int cmd, unsigned long arg);
>>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, 
>>> fmode_t mode,
>>>                    unsigned int cmd, unsigned long arg);
>>> +extern int blkdev_zonedev_prop(struct block_device *bdev, fmode_t 
>>> mode,
>>> +            unsigned int cmd, unsigned long arg);
>>>  #else /* CONFIG_BLK_DEV_ZONED */
>>>  static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
>>> @@ -400,6 +402,12 @@ static inline int blkdev_zone_mgmt_ioctl(struct 
>>> block_device *bdev,
>>>      return -ENOTTY;
>>>  }
>>> +static inline int blkdev_zonedev_prop(struct block_device *bdev, 
>>> fmode_t mode,
>>> +                      unsigned int cmd, unsigned long arg)
>>> +{
>>> +    return -ENOTTY;
>>> +}
>>> +
>>>  #endif /* CONFIG_BLK_DEV_ZONED */
>>>  struct request_queue {
>>> @@ -1770,6 +1778,7 @@ struct block_device_operations {
>>>      int (*report_zones)(struct gendisk *, sector_t sector,
>>>              unsigned int nr_zones, report_zones_cb cb, void *data);
>>>      char *(*devnode)(struct gendisk *disk, umode_t *mode);
>>> +    int (*report_zone_p)(struct gendisk *disk, struct blk_zone_dev 
>>> *zprop);
>>>      struct module *owner;
>>>      const struct pr_ops *pr_ops;
>>>  };
>>> diff --git a/include/uapi/linux/blkzoned.h 
>>> b/include/uapi/linux/blkzoned.h
>>> index d0978ee10fc7..0c49a4b2ce5d 100644
>>> --- a/include/uapi/linux/blkzoned.h
>>> +++ b/include/uapi/linux/blkzoned.h
>>> @@ -142,6 +142,18 @@ struct blk_zone_range {
>>>      __u64        nr_sectors;
>>>  };
>>> +struct blk_zone_dev {
>>> +    __u32    nr_zones;
>>> +    __u32    mar;
>>> +    __u32    mor;
>>> +    __u32    rrl;
>>> +    __u32    frl;
>>> +    __u16    zoc;
>>> +    __u16    ozcs;
>>> +    __u32    rsv31[2];
>>> +    __u64    rsv63[4];
>>> +};
>>> +
>>>  /**
>>>   * enum blk_zone_action - Zone state transitions managed from 
>>> user-space
>>>   *
>>> @@ -209,5 +221,6 @@ struct blk_zone_mgmt {
>>>  #define BLKCLOSEZONE    _IOW(0x12, 135, struct blk_zone_range)
>>>  #define BLKFINISHZONE    _IOW(0x12, 136, struct blk_zone_range)
>>>  #define BLKMGMTZONE    _IOR(0x12, 137, struct blk_zone_mgmt)
>>> +#define BLKZONEDEVPROP    _IOR(0x12, 138, struct blk_zone_dev)
>>>  #endif /* _UAPI_BLKZONED_H */
>>
>> Nak. These properties can already be retrieved using the nvme ioctl 
>> passthru command and support have also been added to nvme-cli.
>>
>
> These properties are intended to be consumed by an application, so
> nvme-cli is of not much use. I would also like to avoid sysfs variables.
>
I can recommend libnvme https://github.com/linux-nvme/libnvme

It provides an easy way to retrieve the options.

> We can use nvme passthru, but this bypasses the zoned block abstraction.
> Why not representing ZNS features in the standard zoned block API? I am
> happy to iterate on the actual implementation if you have feedback.
>
> Javier
>


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/6] block: introduce IOCTL to report dev properties
  2020-06-25 19:42     ` Javier González
  2020-06-25 19:58       ` Matias Bjørling
@ 2020-06-25 20:25       ` Keith Busch
  2020-06-26  6:28         ` Javier González
  2020-06-26  0:57       ` Damien Le Moal
  2 siblings, 1 reply; 70+ messages in thread
From: Keith Busch @ 2020-06-25 20:25 UTC (permalink / raw)
  To: Javier González
  Cc: Matias Bjørling, linux-nvme, linux-block, hch, sagi, axboe,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On Thu, Jun 25, 2020 at 09:42:48PM +0200, Javier González wrote:
> We can use nvme passthru, but this bypasses the zoned block abstraction.
> Why not representing ZNS features in the standard zoned block API?

This looks too nvme zns specific to want the block layer in the middle.
Just use the driver's passthrough interface.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/6] nvme: Add consistency check for zone count
  2020-06-25 12:21 ` [PATCH 6/6] nvme: Add consistency check for zone count Javier González
  2020-06-25 13:16   ` Matias Bjørling
@ 2020-06-25 21:49   ` Keith Busch
  2020-06-26  0:04     ` Damien Le Moal
  2020-06-26  9:16   ` Christoph Hellwig
  2 siblings, 1 reply; 70+ messages in thread
From: Keith Busch @ 2020-06-25 21:49 UTC (permalink / raw)
  To: Javier González
  Cc: linux-nvme, linux-block, hch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>  drivers/nvme/host/zns.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 7d8381fe7665..de806788a184 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>  		sector += ns->zsze * nz;
>  	}
>  
> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
> +				zone_idx, ns->nr_zones);
> +		ret = -EINVAL;
> +		goto out_free;
> +	}
> +
>  	ret = zone_idx;

nr_zones is unsigned, so it's never < 0.

The API we're providing doesn't require zone_idx equal the namespace's
nr_zones at the end, though. A subset of the total number of zones can
be requested here.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/6] nvme: Add consistency check for zone count
  2020-06-25 21:49   ` Keith Busch
@ 2020-06-26  0:04     ` Damien Le Moal
  2020-06-26  6:13       ` Javier González
  0 siblings, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  0:04 UTC (permalink / raw)
  To: Keith Busch, Javier González
  Cc: linux-nvme, linux-block, hch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 2020/06/26 6:49, Keith Busch wrote:
> On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>>  drivers/nvme/host/zns.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>> index 7d8381fe7665..de806788a184 100644
>> --- a/drivers/nvme/host/zns.c
>> +++ b/drivers/nvme/host/zns.c
>> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>  		sector += ns->zsze * nz;
>>  	}
>>  
>> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>> +				zone_idx, ns->nr_zones);
>> +		ret = -EINVAL;
>> +		goto out_free;
>> +	}
>> +
>>  	ret = zone_idx;
> 
> nr_zones is unsigned, so it's never < 0.
> 
> The API we're providing doesn't require zone_idx equal the namespace's
> nr_zones at the end, though. A subset of the total number of zones can
> be requested here.
> 

Yes, absolutely. zone_idx is not an absolute zone number. It is the index of the
reported zone descriptor in the current report range requested by the user,
which is not necessarily for the entire drive (i.e., provided nr zones is less
than the total number of zones of the disk and/or start sector is > 0). So
zone_idx indicates the actual number of zones reported, it is not the total
number of zones of the drive.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/6] block: introduce IOCTL to report dev properties
  2020-06-25 19:42     ` Javier González
  2020-06-25 19:58       ` Matias Bjørling
  2020-06-25 20:25       ` Keith Busch
@ 2020-06-26  0:57       ` Damien Le Moal
  2020-06-26  6:27         ` Javier González
  2 siblings, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  0:57 UTC (permalink / raw)
  To: Javier González, Matias Bjørling
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 2020/06/26 4:42, Javier González wrote:
> On 25.06.2020 15:10, Matias Bjørling wrote:
>> On 25/06/2020 14.21, Javier González wrote:
>>> From: Javier González <javier.gonz@samsung.com>
>>>
>>> With the addition of ZNS, a new set of properties have been added to the
>>> zoned block device. This patch introduces a new IOCTL to expose these
>>> rroperties to user space.
>>>
>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>> ---
>>>  block/blk-zoned.c             | 46 ++++++++++++++++++++++++++
>>>  block/ioctl.c                 |  2 ++
>>>  drivers/nvme/host/core.c      |  2 ++
>>>  drivers/nvme/host/nvme.h      | 11 +++++++
>>>  drivers/nvme/host/zns.c       | 61 +++++++++++++++++++++++++++++++++++
>>>  include/linux/blkdev.h        |  9 ++++++
>>>  include/uapi/linux/blkzoned.h | 13 ++++++++
>>>  7 files changed, 144 insertions(+)
>>>
>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>> index 704fc15813d1..39ec72af9537 100644
>>> --- a/block/blk-zoned.c
>>> +++ b/block/blk-zoned.c
>>> @@ -169,6 +169,17 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
>>>  }
>>>  EXPORT_SYMBOL_GPL(blkdev_report_zones);
>>> +static int blkdev_report_zonedev_prop(struct block_device *bdev,
>>> +				      struct blk_zone_dev *zprop)
>>> +{
>>> +	struct gendisk *disk = bdev->bd_disk;
>>> +
>>> +	if (WARN_ON_ONCE(!bdev->bd_disk->fops->report_zone_p))
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	return disk->fops->report_zone_p(disk, zprop);
>>> +}
>>> +
>>>  static inline bool blkdev_allow_reset_all_zones(struct block_device *bdev,
>>>  						sector_t sector,
>>>  						sector_t nr_sectors)
>>> @@ -430,6 +441,41 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>  				GFP_KERNEL);
>>>  }
>>> +int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
>>> +			unsigned int cmd, unsigned long arg)
>>> +{
>>> +	void __user *argp = (void __user *)arg;
>>> +	struct request_queue *q;
>>> +	struct blk_zone_dev zprop;
>>> +	int ret;
>>> +
>>> +	if (!argp)
>>> +		return -EINVAL;
>>> +
>>> +	q = bdev_get_queue(bdev);
>>> +	if (!q)
>>> +		return -ENXIO;
>>> +
>>> +	if (!blk_queue_is_zoned(q))
>>> +		return -ENOTTY;
>>> +
>>> +	if (!capable(CAP_SYS_ADMIN))
>>> +		return -EACCES;
>>> +
>>> +	if (!(mode & FMODE_WRITE))
>>> +		return -EBADF;
>>> +
>>> +	ret = blkdev_report_zonedev_prop(bdev, &zprop);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	if (copy_to_user(argp, &zprop, sizeof(struct blk_zone_dev)))
>>> +		return -EFAULT;
>>> +
>>> +out:
>>> +	return ret;
>>> +}
>>> +
>>>  static inline unsigned long *blk_alloc_zone_bitmap(int node,
>>>  						   unsigned int nr_zones)
>>>  {
>>> diff --git a/block/ioctl.c b/block/ioctl.c
>>> index 0ea29754e7dd..f7b4e0f2dd4c 100644
>>> --- a/block/ioctl.c
>>> +++ b/block/ioctl.c
>>> @@ -517,6 +517,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>>>  		return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
>>>  	case BLKMGMTZONE:
>>>  		return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
>>> +	case BLKZONEDEVPROP:
>>> +		return blkdev_zonedev_prop(bdev, mode, cmd, arg);
>>>  	case BLKGETZONESZ:
>>>  		return put_uint(argp, bdev_zone_sectors(bdev));
>>>  	case BLKGETNRZONES:
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 5b95c81d2a2d..a32c909a915f 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -2254,6 +2254,7 @@ static const struct block_device_operations nvme_fops = {
>>>  	.getgeo		= nvme_getgeo,
>>>  	.revalidate_disk= nvme_revalidate_disk,
>>>  	.report_zones	= nvme_report_zones,
>>> +	.report_zone_p	= nvme_report_zone_prop,
>>>  	.pr_ops		= &nvme_pr_ops,
>>>  };
>>> @@ -2280,6 +2281,7 @@ const struct block_device_operations nvme_ns_head_ops = {
>>>  	.compat_ioctl	= nvme_compat_ioctl,
>>>  	.getgeo		= nvme_getgeo,
>>>  	.report_zones	= nvme_report_zones,
>>> +	.report_zone_p	= nvme_report_zone_prop,
>>>  	.pr_ops		= &nvme_pr_ops,
>>>  };
>>>  #endif /* CONFIG_NVME_MULTIPATH */
>>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>>> index ecf443efdf91..172e0531f37f 100644
>>> --- a/drivers/nvme/host/nvme.h
>>> +++ b/drivers/nvme/host/nvme.h
>>> @@ -407,6 +407,14 @@ struct nvme_ns {
>>>  	u8 pi_type;
>>>  #ifdef CONFIG_BLK_DEV_ZONED
>>>  	u64 zsze;
>>> +
>>> +	u32 nr_zones;
>>> +	u32 mar;
>>> +	u32 mor;
>>> +	u32 rrl;
>>> +	u32 frl;
>>> +	u16 zoc;
>>> +	u16 ozcs;
>>>  #endif
>>>  	unsigned long features;
>>>  	unsigned long flags;
>>> @@ -704,11 +712,14 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>>>  int nvme_report_zones(struct gendisk *disk, sector_t sector,
>>>  		      unsigned int nr_zones, report_zones_cb cb, void *data);
>>> +int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev *zprop);
>>> +
>>>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>>>  				       struct nvme_command *cmnd,
>>>  				       enum nvme_zone_mgmt_action action);
>>>  #else
>>>  #define nvme_report_zones NULL
>>> +#define nvme_report_zone_prop NULL
>>>  static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
>>>  		struct request *req, struct nvme_command *cmnd,
>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>> index 2e6512ac6f01..258d03610cc0 100644
>>> --- a/drivers/nvme/host/zns.c
>>> +++ b/drivers/nvme/host/zns.c
>>> @@ -32,6 +32,28 @@ static int nvme_set_max_append(struct nvme_ctrl *ctrl)
>>>  	return 0;
>>>  }
>>> +static u64 nvme_zns_nr_zones(struct nvme_ns *ns)
>>> +{
>>> +	struct nvme_command c = { };
>>> +	struct nvme_zone_report report;
>>> +	int buflen = sizeof(struct nvme_zone_report);
>>> +	int ret;
>>> +
>>> +	c.zmr.opcode = nvme_cmd_zone_mgmt_recv;
>>> +	c.zmr.nsid = cpu_to_le32(ns->head->ns_id);
>>> +	c.zmr.slba = cpu_to_le64(0);
>>> +	c.zmr.numd = cpu_to_le32(nvme_bytes_to_numd(buflen));
>>> +	c.zmr.zra = NVME_ZRA_ZONE_REPORT;
>>> +	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>>> +	c.zmr.pr = 0;
>>> +
>>> +	ret = nvme_submit_sync_cmd(ns->queue, &c, &report, buflen);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	return le64_to_cpu(report.nr_zones);
>>> +}
>>> +
>>>  int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>>>  			  unsigned lbaf)
>>>  {
>>> @@ -87,6 +109,13 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>>>  		goto free_data;
>>>  	}
>>> +	ns->nr_zones = nvme_zns_nr_zones(ns);
>>> +	ns->mar = le32_to_cpu(id->mar);
>>> +	ns->mor = le32_to_cpu(id->mor);
>>> +	ns->rrl = le32_to_cpu(id->rrl);
>>> +	ns->frl = le32_to_cpu(id->frl);
>>> +	ns->zoc = le16_to_cpu(id->zoc);
>>> +
>>>  	q->limits.zoned = BLK_ZONED_HM;
>>>  	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
>>>  free_data:
>>> @@ -230,6 +259,38 @@ int nvme_report_zones(struct gendisk *disk, sector_t sector,
>>>  	return ret;
>>>  }
>>> +static int nvme_ns_report_zone_prop(struct nvme_ns *ns, struct blk_zone_dev *zprop)
>>> +{
>>> +	zprop->nr_zones = ns->nr_zones;
>>> +	zprop->zoc = ns->zoc;
>>> +	zprop->ozcs = ns->ozcs;
>>> +	zprop->mar = ns->mar;
>>> +	zprop->mor = ns->mor;
>>> +	zprop->rrl = ns->rrl;
>>> +	zprop->frl = ns->frl;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev *zprop)
>>> +{
>>> +	struct nvme_ns_head *head = NULL;
>>> +	struct nvme_ns *ns;
>>> +	int srcu_idx, ret;
>>> +
>>> +	ns = nvme_get_ns_from_disk(disk, &head, &srcu_idx);
>>> +	if (unlikely(!ns))
>>> +		return -EWOULDBLOCK;
>>> +
>>> +	if (ns->head->ids.csi == NVME_CSI_ZNS)
>>> +		ret = nvme_ns_report_zone_prop(ns, zprop);
>>> +	else
>>> +		ret = -EINVAL;
>>> +	nvme_put_ns_from_disk(head, srcu_idx);
>>> +
>>> +	return ret;
>>> +}
>>> +
>>>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>>>  		struct nvme_command *c, enum nvme_zone_mgmt_action action)
>>>  {
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index 8308d8a3720b..0c0faa58b7f4 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -372,6 +372,8 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>  				  unsigned int cmd, unsigned long arg);
>>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>  				  unsigned int cmd, unsigned long arg);
>>> +extern int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
>>> +			unsigned int cmd, unsigned long arg);
>>>  #else /* CONFIG_BLK_DEV_ZONED */
>>>  static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
>>> @@ -400,6 +402,12 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
>>>  	return -ENOTTY;
>>>  }
>>> +static inline int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
>>> +				      unsigned int cmd, unsigned long arg)
>>> +{
>>> +	return -ENOTTY;
>>> +}
>>> +
>>>  #endif /* CONFIG_BLK_DEV_ZONED */
>>>  struct request_queue {
>>> @@ -1770,6 +1778,7 @@ struct block_device_operations {
>>>  	int (*report_zones)(struct gendisk *, sector_t sector,
>>>  			unsigned int nr_zones, report_zones_cb cb, void *data);
>>>  	char *(*devnode)(struct gendisk *disk, umode_t *mode);
>>> +	int (*report_zone_p)(struct gendisk *disk, struct blk_zone_dev *zprop);
>>>  	struct module *owner;
>>>  	const struct pr_ops *pr_ops;
>>>  };
>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>> index d0978ee10fc7..0c49a4b2ce5d 100644
>>> --- a/include/uapi/linux/blkzoned.h
>>> +++ b/include/uapi/linux/blkzoned.h
>>> @@ -142,6 +142,18 @@ struct blk_zone_range {
>>>  	__u64		nr_sectors;
>>>  };
>>> +struct blk_zone_dev {
>>> +	__u32	nr_zones;
>>> +	__u32	mar;
>>> +	__u32	mor;
>>> +	__u32	rrl;
>>> +	__u32	frl;
>>> +	__u16	zoc;
>>> +	__u16	ozcs;
>>> +	__u32	rsv31[2];
>>> +	__u64	rsv63[4];
>>> +};
>>> +
>>>  /**
>>>   * enum blk_zone_action - Zone state transitions managed from user-space
>>>   *
>>> @@ -209,5 +221,6 @@ struct blk_zone_mgmt {
>>>  #define BLKCLOSEZONE	_IOW(0x12, 135, struct blk_zone_range)
>>>  #define BLKFINISHZONE	_IOW(0x12, 136, struct blk_zone_range)
>>>  #define BLKMGMTZONE	_IOR(0x12, 137, struct blk_zone_mgmt)
>>> +#define BLKZONEDEVPROP	_IOR(0x12, 138, struct blk_zone_dev)
>>>  #endif /* _UAPI_BLKZONED_H */
>>
>> Nak. These properties can already be retrieved using the nvme ioctl 
>> passthru command and support have also been added to nvme-cli.
>>
> 
> These properties are intended to be consumed by an application, so
> nvme-cli is of not much use. I would also like to avoid sysfs variables.

Why not sysfs ? These are device properties, they can be defined as sysfs device
attributes. If there is an equivalent for ZBC/ZAC drives, you could even
consider defining them as queue attributes as long as you also patch sd.c, but
that may be pushing things too far.

In any case, sysfs seems a much better approach to me as that would be limited
to the NVMe driver rather than all this additional code in the block layer.

> 
> We can use nvme passthru, but this bypasses the zoned block abstraction.
> Why not representing ZNS features in the standard zoned block API? I am
> happy to iterate on the actual implementation if you have feedback.
> 
> Javier
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/6] block: add support for zone offline transition
  2020-06-25 19:48     ` Javier González
@ 2020-06-26  1:14       ` Damien Le Moal
  2020-06-26  6:18         ` Javier González
  2020-06-26  9:11         ` hch
  0 siblings, 2 replies; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  1:14 UTC (permalink / raw)
  To: Javier González, Matias Bjørling
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 2020/06/26 4:48, Javier González wrote:
> On 25.06.2020 16:12, Matias Bjørling wrote:
>> On 25/06/2020 14.21, Javier González wrote:
>>> From: Javier González <javier.gonz@samsung.com>
>>>
>>> Add support for offline transition on the zoned block device using the
>>> new zone management IOCTL
>>>
>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>> ---
>>>  block/blk-core.c              | 2 ++
>>>  block/blk-zoned.c             | 3 +++
>>>  drivers/nvme/host/core.c      | 3 +++
>>>  include/linux/blk_types.h     | 3 +++
>>>  include/linux/blkdev.h        | 1 -
>>>  include/uapi/linux/blkzoned.h | 1 +
>>>  6 files changed, 12 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 03252af8c82c..589cbdacc5ec 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -140,6 +140,7 @@ static const char *const blk_op_name[] = {
>>>  	REQ_OP_NAME(ZONE_CLOSE),
>>>  	REQ_OP_NAME(ZONE_FINISH),
>>>  	REQ_OP_NAME(ZONE_APPEND),
>>> +	REQ_OP_NAME(ZONE_OFFLINE),
>>>  	REQ_OP_NAME(WRITE_SAME),
>>>  	REQ_OP_NAME(WRITE_ZEROES),
>>>  	REQ_OP_NAME(SCSI_IN),
>>> @@ -1030,6 +1031,7 @@ generic_make_request_checks(struct bio *bio)
>>>  	case REQ_OP_ZONE_OPEN:
>>>  	case REQ_OP_ZONE_CLOSE:
>>>  	case REQ_OP_ZONE_FINISH:
>>> +	case REQ_OP_ZONE_OFFLINE:
>>>  		if (!blk_queue_is_zoned(q))
>>>  			goto not_supported;
>>>  		break;
>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>> index 29194388a1bb..704fc15813d1 100644
>>> --- a/block/blk-zoned.c
>>> +++ b/block/blk-zoned.c
>>> @@ -416,6 +416,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>  	case BLK_ZONE_MGMT_RESET:
>>>  		op = REQ_OP_ZONE_RESET;
>>>  		break;
>>> +	case BLK_ZONE_MGMT_OFFLINE:
>>> +		op = REQ_OP_ZONE_OFFLINE;
>>> +		break;
>>>  	default:
>>>  		return -ENOTTY;
>>>  	}
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index f1215523792b..5b95c81d2a2d 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
>>>  	case REQ_OP_ZONE_FINISH:
>>>  		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
>>>  		break;
>>> +	case REQ_OP_ZONE_OFFLINE:
>>> +		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OFFLINE);
>>> +		break;
>>>  	case REQ_OP_WRITE_ZEROES:
>>>  		ret = nvme_setup_write_zeroes(ns, req, cmd);
>>>  		break;
>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>> index 16b57fb2b99c..b3921263c3dd 100644
>>> --- a/include/linux/blk_types.h
>>> +++ b/include/linux/blk_types.h
>>> @@ -316,6 +316,8 @@ enum req_opf {
>>>  	REQ_OP_ZONE_FINISH	= 12,
>>>  	/* write data at the current zone write pointer */
>>>  	REQ_OP_ZONE_APPEND	= 13,
>>> +	/* Transition a zone to offline */
>>> +	REQ_OP_ZONE_OFFLINE	= 14,
>>>  	/* SCSI passthrough using struct scsi_request */
>>>  	REQ_OP_SCSI_IN		= 32,
>>> @@ -456,6 +458,7 @@ static inline bool op_is_zone_mgmt(enum req_opf op)
>>>  	case REQ_OP_ZONE_OPEN:
>>>  	case REQ_OP_ZONE_CLOSE:
>>>  	case REQ_OP_ZONE_FINISH:
>>> +	case REQ_OP_ZONE_OFFLINE:
>>>  		return true;
>>>  	default:
>>>  		return false;
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index bd8521f94dc4..8308d8a3720b 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -372,7 +372,6 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>  				  unsigned int cmd, unsigned long arg);
>>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>  				  unsigned int cmd, unsigned long arg);
>>> -
>>>  #else /* CONFIG_BLK_DEV_ZONED */
>>>  static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>> index a8c89fe58f97..d0978ee10fc7 100644
>>> --- a/include/uapi/linux/blkzoned.h
>>> +++ b/include/uapi/linux/blkzoned.h
>>> @@ -155,6 +155,7 @@ enum blk_zone_action {
>>>  	BLK_ZONE_MGMT_FINISH	= 0x2,
>>>  	BLK_ZONE_MGMT_OPEN	= 0x3,
>>>  	BLK_ZONE_MGMT_RESET	= 0x4,
>>> +	BLK_ZONE_MGMT_OFFLINE	= 0x5,
>>>  };
>>>  /**
>>
>> I am not sure this makes sense to expose through the kernel zone api. 
>> One of the goals of the kernel zone API is to be a layer that provides 
>> an unified zone model across SMR HDDs and ZNS SSDs. The offline zone 
>> operation, as defined in the ZNS specification, does not have an 
>> equivalent in SMR HDDs (ZAC/ZBC).
>>
>> This is different from the Zone Capacity change, where the zone 
>> capacity simply was zone size for SMR HDDs. Making it easy to support. 
>> That is not the same for ZAC/ZBC, that does not offer the offline 
>> operation to transition zones in read only state to offline state.
> 
> I agree that an unified interface is desirable. However, the truth is
> that ZAC/ZBC are different, and will differ more and more and time goes
> by. We can deal with the differences at the driver level or with checks
> at the API level, but limiting ZNS with ZAC/ZBC is a hard constraint.

As long as you keep ZNS namespace report itself as being "host-managed" like
ZBC/ZAC disks, we need the consistency and common interface. If you break that,
the meaning of the zoned model queue attribute disappears and an application or
in-kernel user cannot rely on this model anymore to know how the drive will behave.

> Note too that I chose to only support this particular transition on the
> new management IOCTL to avoid confusion for existing ZAC/ZBC users.
> 
> It would be good to clarify what is the plan for kernel APIs moving
> forward, as I believe there is a general desire to support new ZNS
> features, which will not necessarily be replicated in SMR drives.

What the drive is supposed to support and its behavior is determined by the
zoned model. ZNS standard was written so that most things have an equivalent
with ZBC/ZAC, e.g. the zone state machine is nearly identical. Differences are
either emulated (e.g. zone append scsi emulation), or not supported (e.g. zone
capacity change) so that the kernel follows the same pattern and maintains a
coherent behavior between device protocols for the host-managed model.

Think of a file system, or any other in-kernel user. If they have to change
their code based on the device type (NVMe vs SCSI), then the zoned block device
interface is broken. Right now, that is not the case, everything works equally
well on ZNS and SCSI, modulo the need by a user for conventional zones that ZNS
do not define. But that is still consistent with the host-managed model since
conventional zones are optional.

For this particular patch, there is currently no in-kernel user, and it is not
clear how this will be useful to applications. At least please clarify this. And
most likely, similarly to discard etc operations that are optional, having a
sysfs attribute and in-kernel API indicating if the drive supports offlining
zones will be needed. Otherwise, the caller will have to play with error codes
to understand if the drive does not support the command or if it is supported
but the command failed. Not nice. Better to know before issuing the command.


> 
> Javier
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/6] block: introduce IOCTL for zone mgmt
  2020-06-25 12:21 ` [PATCH 1/6] block: introduce IOCTL for zone mgmt Javier González
@ 2020-06-26  1:17   ` Damien Le Moal
  2020-06-26  6:01     ` Javier González
  0 siblings, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  1:17 UTC (permalink / raw)
  To: Javier González, linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 2020/06/25 21:22, Javier González wrote:
> From: Javier González <javier.gonz@samsung.com>
> 
> The current IOCTL interface for zone management is limited by struct
> blk_zone_range, which is unfortunately not extensible. Specially, the
> lack of flags is problematic for ZNS zoned devices.
> 
> This new IOCTL is designed to be a superset of the current one, with
> support for flags and bits for extensibility.
> 
> Signed-off-by: Javier González <javier.gonz@samsung.com>
> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>  block/blk-zoned.c             | 56 ++++++++++++++++++++++++++++++++++-
>  block/ioctl.c                 |  2 ++
>  include/linux/blkdev.h        |  9 ++++++
>  include/uapi/linux/blkzoned.h | 33 +++++++++++++++++++++
>  4 files changed, 99 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 81152a260354..e87c60004dc5 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -322,7 +322,7 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
>   * BLKRESETZONE, BLKOPENZONE, BLKCLOSEZONE and BLKFINISHZONE ioctl processing.
>   * Called from blkdev_ioctl.
>   */
> -int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
> +int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>  			   unsigned int cmd, unsigned long arg)
>  {
>  	void __user *argp = (void __user *)arg;
> @@ -370,6 +370,60 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>  				GFP_KERNEL);
>  }
>  
> +/*
> + * Zone management ioctl processing. Extension of blkdev_zone_ops_ioctl(), with
> + * blk_zone_mgmt structure.
> + *
> + * Called from blkdev_ioctl.
> + */
> +int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
> +			   unsigned int cmd, unsigned long arg)
> +{
> +	void __user *argp = (void __user *)arg;
> +	struct request_queue *q;
> +	struct blk_zone_mgmt zmgmt;
> +	enum req_opf op;
> +
> +	if (!argp)
> +		return -EINVAL;
> +
> +	q = bdev_get_queue(bdev);
> +	if (!q)
> +		return -ENXIO;
> +
> +	if (!blk_queue_is_zoned(q))
> +		return -ENOTTY;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EACCES;
> +
> +	if (!(mode & FMODE_WRITE))
> +		return -EBADF;
> +
> +	if (copy_from_user(&zmgmt, argp, sizeof(struct blk_zone_mgmt)))
> +		return -EFAULT;
> +
> +	switch (zmgmt.action) {
> +	case BLK_ZONE_MGMT_CLOSE:
> +		op = REQ_OP_ZONE_CLOSE;
> +		break;
> +	case BLK_ZONE_MGMT_FINISH:
> +		op = REQ_OP_ZONE_FINISH;
> +		break;
> +	case BLK_ZONE_MGMT_OPEN:
> +		op = REQ_OP_ZONE_OPEN;
> +		break;
> +	case BLK_ZONE_MGMT_RESET:
> +		op = REQ_OP_ZONE_RESET;
> +		break;
> +	default:
> +		return -ENOTTY;
> +	}
> +
> +	return blkdev_zone_mgmt(bdev, op, zmgmt.sector, zmgmt.nr_sectors,
> +				GFP_KERNEL);
> +}
> +
>  static inline unsigned long *blk_alloc_zone_bitmap(int node,
>  						   unsigned int nr_zones)
>  {
> diff --git a/block/ioctl.c b/block/ioctl.c
> index bdb3bbb253d9..0ea29754e7dd 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -514,6 +514,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>  	case BLKOPENZONE:
>  	case BLKCLOSEZONE:
>  	case BLKFINISHZONE:
> +		return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
> +	case BLKMGMTZONE:
>  		return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
>  	case BLKGETZONESZ:
>  		return put_uint(argp, bdev_zone_sectors(bdev));
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8fd900998b4e..bd8521f94dc4 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -368,6 +368,8 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
>  
>  extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
>  				     unsigned int cmd, unsigned long arg);
> +extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
> +				  unsigned int cmd, unsigned long arg);
>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>  				  unsigned int cmd, unsigned long arg);
>  
> @@ -385,6 +387,13 @@ static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
>  	return -ENOTTY;
>  }
>  
> +
> +static inline int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
> +					unsigned int cmd, unsigned long arg)
> +{
> +	return -ENOTTY;
> +}
> +
>  static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
>  					 fmode_t mode, unsigned int cmd,
>  					 unsigned long arg)
> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
> index 42c3366cc25f..07b5fde21d9f 100644
> --- a/include/uapi/linux/blkzoned.h
> +++ b/include/uapi/linux/blkzoned.h
> @@ -142,6 +142,38 @@ struct blk_zone_range {
>  	__u64		nr_sectors;
>  };
>  
> +/**
> + * enum blk_zone_action - Zone state transitions managed from user-space
> + *
> + * @BLK_ZONE_MGMT_CLOSE: Transition to Closed state
> + * @BLK_ZONE_MGMT_FINISH: Transition to Finish state
> + * @BLK_ZONE_MGMT_OPEN: Transition to Open state
> + * @BLK_ZONE_MGMT_RESET: Transition to Reset state
> + */
> +enum blk_zone_action {
> +	BLK_ZONE_MGMT_CLOSE	= 0x1,
> +	BLK_ZONE_MGMT_FINISH	= 0x2,
> +	BLK_ZONE_MGMT_OPEN	= 0x3,
> +	BLK_ZONE_MGMT_RESET	= 0x4,
> +};
> +
> +/**
> + * struct blk_zone_mgmt - Extended zoned management
> + *
> + * @action: Zone action as in described in enum blk_zone_action
> + * @flags: Flags for the action
> + * @sector: Starting sector of the first zone to operate on
> + * @nr_sectors: Total number of sectors of all zones to operate on
> + */
> +struct blk_zone_mgmt {
> +	__u8		action;
> +	__u8		resv3[3];
> +	__u32		flags;
> +	__u64		sector;
> +	__u64		nr_sectors;
> +	__u64		resv31;
> +};
> +
>  /**
>   * Zoned block device ioctl's:
>   *
> @@ -166,5 +198,6 @@ struct blk_zone_range {
>  #define BLKOPENZONE	_IOW(0x12, 134, struct blk_zone_range)
>  #define BLKCLOSEZONE	_IOW(0x12, 135, struct blk_zone_range)
>  #define BLKFINISHZONE	_IOW(0x12, 136, struct blk_zone_range)
> +#define BLKMGMTZONE	_IOR(0x12, 137, struct blk_zone_mgmt)
>  
>  #endif /* _UAPI_BLKZONED_H */
> 

Without defining what the flags can be, it is hard to judge what will change
from the current distinct ioctls.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/6] block: add support for selecting all zones
  2020-06-25 12:21 ` [PATCH 2/6] block: add support for selecting all zones Javier González
@ 2020-06-26  1:27   ` Damien Le Moal
  2020-06-26  5:58     ` Javier González
  0 siblings, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  1:27 UTC (permalink / raw)
  To: Javier González, linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 2020/06/25 21:22, Javier González wrote:
> From: Javier González <javier.gonz@samsung.com>
> 
> Add flag to allow selecting all zones on a single zone management
> operation
> 
> Signed-off-by: Javier González <javier.gonz@samsung.com>
> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>  block/blk-zoned.c             | 3 +++
>  include/linux/blk_types.h     | 3 ++-
>  include/uapi/linux/blkzoned.h | 9 +++++++++
>  3 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index e87c60004dc5..29194388a1bb 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -420,6 +420,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>  		return -ENOTTY;
>  	}
>  
> +	if (zmgmt.flags & BLK_ZONE_SELECT_ALL)
> +		op |= REQ_ZONE_ALL;
> +
>  	return blkdev_zone_mgmt(bdev, op, zmgmt.sector, zmgmt.nr_sectors,
>  				GFP_KERNEL);
>  }
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index ccb895f911b1..16b57fb2b99c 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -351,6 +351,7 @@ enum req_flag_bits {
>  	 * work item to avoid such priority inversions.
>  	 */
>  	__REQ_CGROUP_PUNT,
> +	__REQ_ZONE_ALL,		/* apply zone operation to all zones */
>  
>  	/* command specific flags for REQ_OP_WRITE_ZEROES: */
>  	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
> @@ -378,7 +379,7 @@ enum req_flag_bits {
>  #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
>  #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
>  #define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
> -
> +#define REQ_ZONE_ALL		(1ULL << __REQ_ZONE_ALL)
>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>  #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
>  
> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
> index 07b5fde21d9f..a8c89fe58f97 100644
> --- a/include/uapi/linux/blkzoned.h
> +++ b/include/uapi/linux/blkzoned.h
> @@ -157,6 +157,15 @@ enum blk_zone_action {
>  	BLK_ZONE_MGMT_RESET	= 0x4,
>  };
>  
> +/**
> + * enum blk_zone_mgmt_flags - Flags for blk_zone_mgmt
> + *
> + * BLK_ZONE_SELECT_ALL: Select all zones for current zone action
> + */
> +enum blk_zone_mgmt_flags {
> +	BLK_ZONE_SELECT_ALL	= 1 << 0,
> +};
> +
>  /**
>   * struct blk_zone_mgmt - Extended zoned management
>   *
> 

NACK.

Details:
1) REQ_OP_ZONE_RESET together with REQ_ZONE_ALL is the same as
REQ_OP_ZONE_RESET_ALL, isn't it ? You are duplicating a functionality that
already exists.
2) The patch introduces REQ_ZONE_ALL at the block layer only without defining
how it ties into SCSI and NVMe driver use of it. Is REQ_ZONE_ALL indicating that
the zone management commands are to be executed with the ALL bit set ? If yes,
that will break device-mapper. See the special code for handling
REQ_OP_ZONE_RESET_ALL. That code is in place for a reason: the target block
device may not be an entire physical device. In that case, applying a zone
management command to all zones of the physical drive is wrong.
3) REQ_ZONE_ALL seems completely equivalent to specifying a sector range of [0
.. drive capacity]. So what is the point ? The current interface handles that.
That is how we chose between REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL right now.
4) Without any in-kernel user, I do not see the point. And for applications, I
do not see any good use case for doing open all, close all, offline all or
finish all. If you have any such good use case, please elaborate.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/6] block: add support for zone offline transition
  2020-06-25 12:21 ` [PATCH 3/6] block: add support for zone offline transition Javier González
  2020-06-25 14:12   ` Matias Bjørling
@ 2020-06-26  1:34   ` Damien Le Moal
  2020-06-26  6:08     ` Javier González
  1 sibling, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  1:34 UTC (permalink / raw)
  To: Javier González, linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 2020/06/25 21:22, Javier González wrote:
> From: Javier González <javier.gonz@samsung.com>
> 
> Add support for offline transition on the zoned block device using the
> new zone management IOCTL
> 
> Signed-off-by: Javier González <javier.gonz@samsung.com>
> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>  block/blk-core.c              | 2 ++
>  block/blk-zoned.c             | 3 +++
>  drivers/nvme/host/core.c      | 3 +++
>  include/linux/blk_types.h     | 3 +++
>  include/linux/blkdev.h        | 1 -
>  include/uapi/linux/blkzoned.h | 1 +
>  6 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 03252af8c82c..589cbdacc5ec 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -140,6 +140,7 @@ static const char *const blk_op_name[] = {
>  	REQ_OP_NAME(ZONE_CLOSE),
>  	REQ_OP_NAME(ZONE_FINISH),
>  	REQ_OP_NAME(ZONE_APPEND),
> +	REQ_OP_NAME(ZONE_OFFLINE),
>  	REQ_OP_NAME(WRITE_SAME),
>  	REQ_OP_NAME(WRITE_ZEROES),
>  	REQ_OP_NAME(SCSI_IN),
> @@ -1030,6 +1031,7 @@ generic_make_request_checks(struct bio *bio)
>  	case REQ_OP_ZONE_OPEN:
>  	case REQ_OP_ZONE_CLOSE:
>  	case REQ_OP_ZONE_FINISH:
> +	case REQ_OP_ZONE_OFFLINE:
>  		if (!blk_queue_is_zoned(q))
>  			goto not_supported;
>  		break;
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 29194388a1bb..704fc15813d1 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -416,6 +416,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>  	case BLK_ZONE_MGMT_RESET:
>  		op = REQ_OP_ZONE_RESET;
>  		break;
> +	case BLK_ZONE_MGMT_OFFLINE:
> +		op = REQ_OP_ZONE_OFFLINE;
> +		break;
>  	default:
>  		return -ENOTTY;
>  	}
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f1215523792b..5b95c81d2a2d 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
>  	case REQ_OP_ZONE_FINISH:
>  		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
>  		break;
> +	case REQ_OP_ZONE_OFFLINE:
> +		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OFFLINE);
> +		break;
>  	case REQ_OP_WRITE_ZEROES:
>  		ret = nvme_setup_write_zeroes(ns, req, cmd);
>  		break;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 16b57fb2b99c..b3921263c3dd 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -316,6 +316,8 @@ enum req_opf {
>  	REQ_OP_ZONE_FINISH	= 12,
>  	/* write data at the current zone write pointer */
>  	REQ_OP_ZONE_APPEND	= 13,
> +	/* Transition a zone to offline */
> +	REQ_OP_ZONE_OFFLINE	= 14,
>  
>  	/* SCSI passthrough using struct scsi_request */
>  	REQ_OP_SCSI_IN		= 32,
> @@ -456,6 +458,7 @@ static inline bool op_is_zone_mgmt(enum req_opf op)
>  	case REQ_OP_ZONE_OPEN:
>  	case REQ_OP_ZONE_CLOSE:
>  	case REQ_OP_ZONE_FINISH:
> +	case REQ_OP_ZONE_OFFLINE:
>  		return true;
>  	default:
>  		return false;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index bd8521f94dc4..8308d8a3720b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -372,7 +372,6 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>  				  unsigned int cmd, unsigned long arg);
>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>  				  unsigned int cmd, unsigned long arg);
> -
>  #else /* CONFIG_BLK_DEV_ZONED */
>  
>  static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
> index a8c89fe58f97..d0978ee10fc7 100644
> --- a/include/uapi/linux/blkzoned.h
> +++ b/include/uapi/linux/blkzoned.h
> @@ -155,6 +155,7 @@ enum blk_zone_action {
>  	BLK_ZONE_MGMT_FINISH	= 0x2,
>  	BLK_ZONE_MGMT_OPEN	= 0x3,
>  	BLK_ZONE_MGMT_RESET	= 0x4,
> +	BLK_ZONE_MGMT_OFFLINE	= 0x5,
>  };
>  
>  /**
> 

As mentioned in previous email, the usefulness of this is dubious. Please
elaborate in the commit message. Sure NVMe ZNS defines this and we can support
it. But without a good use case, what is the point ?

scsi SD driver will return BLK_STS_NOTSUPP if this offlining is sent to a
ZBC/ZAC drive. Not nice. Having a sysfs attribute "max_offline_zone_sectors" or
the like to indicate support by the device or not would be nicer.

Does offling ALL zones make any sense ? Because this patch does not prevent the
use of the REQ_ZONE_ALL flags introduced in patch 2. Probably not a good idea to
allow offlining all zones, no ?

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/6] block: introduce IOCTL to report dev properties
  2020-06-25 12:21 ` [PATCH 4/6] block: introduce IOCTL to report dev properties Javier González
  2020-06-25 13:10   ` Matias Bjørling
@ 2020-06-26  1:38   ` Damien Le Moal
  2020-06-26  6:22     ` Javier González
  1 sibling, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  1:38 UTC (permalink / raw)
  To: Javier González, linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 2020/06/25 21:22, Javier González wrote:
> From: Javier González <javier.gonz@samsung.com>
> 
> With the addition of ZNS, a new set of properties have been added to the
> zoned block device. This patch introduces a new IOCTL to expose these
> rroperties to user space.
> 
> Signed-off-by: Javier González <javier.gonz@samsung.com>
> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>  block/blk-zoned.c             | 46 ++++++++++++++++++++++++++
>  block/ioctl.c                 |  2 ++
>  drivers/nvme/host/core.c      |  2 ++
>  drivers/nvme/host/nvme.h      | 11 +++++++
>  drivers/nvme/host/zns.c       | 61 +++++++++++++++++++++++++++++++++++
>  include/linux/blkdev.h        |  9 ++++++
>  include/uapi/linux/blkzoned.h | 13 ++++++++
>  7 files changed, 144 insertions(+)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 704fc15813d1..39ec72af9537 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -169,6 +169,17 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
>  }
>  EXPORT_SYMBOL_GPL(blkdev_report_zones);
>  
> +static int blkdev_report_zonedev_prop(struct block_device *bdev,
> +				      struct blk_zone_dev *zprop)
> +{
> +	struct gendisk *disk = bdev->bd_disk;
> +
> +	if (WARN_ON_ONCE(!bdev->bd_disk->fops->report_zone_p))
> +		return -EOPNOTSUPP;
> +
> +	return disk->fops->report_zone_p(disk, zprop);
> +}
> +
>  static inline bool blkdev_allow_reset_all_zones(struct block_device *bdev,
>  						sector_t sector,
>  						sector_t nr_sectors)
> @@ -430,6 +441,41 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>  				GFP_KERNEL);
>  }
>  
> +int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
> +			unsigned int cmd, unsigned long arg)
> +{
> +	void __user *argp = (void __user *)arg;
> +	struct request_queue *q;
> +	struct blk_zone_dev zprop;
> +	int ret;
> +
> +	if (!argp)
> +		return -EINVAL;
> +
> +	q = bdev_get_queue(bdev);
> +	if (!q)
> +		return -ENXIO;
> +
> +	if (!blk_queue_is_zoned(q))
> +		return -ENOTTY;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EACCES;
> +
> +	if (!(mode & FMODE_WRITE))
> +		return -EBADF;
> +
> +	ret = blkdev_report_zonedev_prop(bdev, &zprop);
> +	if (ret)
> +		goto out;
> +
> +	if (copy_to_user(argp, &zprop, sizeof(struct blk_zone_dev)))
> +		return -EFAULT;
> +
> +out:
> +	return ret;
> +}
> +
>  static inline unsigned long *blk_alloc_zone_bitmap(int node,
>  						   unsigned int nr_zones)
>  {
> diff --git a/block/ioctl.c b/block/ioctl.c
> index 0ea29754e7dd..f7b4e0f2dd4c 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -517,6 +517,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>  		return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
>  	case BLKMGMTZONE:
>  		return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
> +	case BLKZONEDEVPROP:
> +		return blkdev_zonedev_prop(bdev, mode, cmd, arg);
>  	case BLKGETZONESZ:
>  		return put_uint(argp, bdev_zone_sectors(bdev));
>  	case BLKGETNRZONES:
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 5b95c81d2a2d..a32c909a915f 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2254,6 +2254,7 @@ static const struct block_device_operations nvme_fops = {
>  	.getgeo		= nvme_getgeo,
>  	.revalidate_disk= nvme_revalidate_disk,
>  	.report_zones	= nvme_report_zones,
> +	.report_zone_p	= nvme_report_zone_prop,
>  	.pr_ops		= &nvme_pr_ops,
>  };
>  
> @@ -2280,6 +2281,7 @@ const struct block_device_operations nvme_ns_head_ops = {
>  	.compat_ioctl	= nvme_compat_ioctl,
>  	.getgeo		= nvme_getgeo,
>  	.report_zones	= nvme_report_zones,
> +	.report_zone_p	= nvme_report_zone_prop,
>  	.pr_ops		= &nvme_pr_ops,
>  };
>  #endif /* CONFIG_NVME_MULTIPATH */
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index ecf443efdf91..172e0531f37f 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -407,6 +407,14 @@ struct nvme_ns {
>  	u8 pi_type;
>  #ifdef CONFIG_BLK_DEV_ZONED
>  	u64 zsze;
> +
> +	u32 nr_zones;
> +	u32 mar;
> +	u32 mor;
> +	u32 rrl;
> +	u32 frl;
> +	u16 zoc;
> +	u16 ozcs;
>  #endif
>  	unsigned long features;
>  	unsigned long flags;
> @@ -704,11 +712,14 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>  int nvme_report_zones(struct gendisk *disk, sector_t sector,
>  		      unsigned int nr_zones, report_zones_cb cb, void *data);
>  
> +int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev *zprop);
> +
>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>  				       struct nvme_command *cmnd,
>  				       enum nvme_zone_mgmt_action action);
>  #else
>  #define nvme_report_zones NULL
> +#define nvme_report_zone_prop NULL
>  
>  static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
>  		struct request *req, struct nvme_command *cmnd,
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 2e6512ac6f01..258d03610cc0 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -32,6 +32,28 @@ static int nvme_set_max_append(struct nvme_ctrl *ctrl)
>  	return 0;
>  }
>  
> +static u64 nvme_zns_nr_zones(struct nvme_ns *ns)
> +{
> +	struct nvme_command c = { };
> +	struct nvme_zone_report report;
> +	int buflen = sizeof(struct nvme_zone_report);
> +	int ret;
> +
> +	c.zmr.opcode = nvme_cmd_zone_mgmt_recv;
> +	c.zmr.nsid = cpu_to_le32(ns->head->ns_id);
> +	c.zmr.slba = cpu_to_le64(0);
> +	c.zmr.numd = cpu_to_le32(nvme_bytes_to_numd(buflen));
> +	c.zmr.zra = NVME_ZRA_ZONE_REPORT;
> +	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
> +	c.zmr.pr = 0;
> +
> +	ret = nvme_submit_sync_cmd(ns->queue, &c, &report, buflen);
> +	if (ret)
> +		return ret;
> +
> +	return le64_to_cpu(report.nr_zones);
> +}
> +
>  int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>  			  unsigned lbaf)
>  {
> @@ -87,6 +109,13 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>  		goto free_data;
>  	}
>  
> +	ns->nr_zones = nvme_zns_nr_zones(ns);
> +	ns->mar = le32_to_cpu(id->mar);
> +	ns->mor = le32_to_cpu(id->mor);
> +	ns->rrl = le32_to_cpu(id->rrl);
> +	ns->frl = le32_to_cpu(id->frl);
> +	ns->zoc = le16_to_cpu(id->zoc);
> +
>  	q->limits.zoned = BLK_ZONED_HM;
>  	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
>  free_data:
> @@ -230,6 +259,38 @@ int nvme_report_zones(struct gendisk *disk, sector_t sector,
>  	return ret;
>  }
>  
> +static int nvme_ns_report_zone_prop(struct nvme_ns *ns, struct blk_zone_dev *zprop)
> +{
> +	zprop->nr_zones = ns->nr_zones;
> +	zprop->zoc = ns->zoc;
> +	zprop->ozcs = ns->ozcs;
> +	zprop->mar = ns->mar;
> +	zprop->mor = ns->mor;
> +	zprop->rrl = ns->rrl;
> +	zprop->frl = ns->frl;
> +
> +	return 0;
> +}
> +
> +int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev *zprop)
> +{
> +	struct nvme_ns_head *head = NULL;
> +	struct nvme_ns *ns;
> +	int srcu_idx, ret;
> +
> +	ns = nvme_get_ns_from_disk(disk, &head, &srcu_idx);
> +	if (unlikely(!ns))
> +		return -EWOULDBLOCK;
> +
> +	if (ns->head->ids.csi == NVME_CSI_ZNS)
> +		ret = nvme_ns_report_zone_prop(ns, zprop);
> +	else
> +		ret = -EINVAL;
> +	nvme_put_ns_from_disk(head, srcu_idx);
> +
> +	return ret;
> +}
> +
>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>  		struct nvme_command *c, enum nvme_zone_mgmt_action action)
>  {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8308d8a3720b..0c0faa58b7f4 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -372,6 +372,8 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>  				  unsigned int cmd, unsigned long arg);
>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>  				  unsigned int cmd, unsigned long arg);
> +extern int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
> +			unsigned int cmd, unsigned long arg);
>  #else /* CONFIG_BLK_DEV_ZONED */
>  
>  static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
> @@ -400,6 +402,12 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
>  	return -ENOTTY;
>  }
>  
> +static inline int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
> +				      unsigned int cmd, unsigned long arg)
> +{
> +	return -ENOTTY;
> +}
> +
>  #endif /* CONFIG_BLK_DEV_ZONED */
>  
>  struct request_queue {
> @@ -1770,6 +1778,7 @@ struct block_device_operations {
>  	int (*report_zones)(struct gendisk *, sector_t sector,
>  			unsigned int nr_zones, report_zones_cb cb, void *data);
>  	char *(*devnode)(struct gendisk *disk, umode_t *mode);
> +	int (*report_zone_p)(struct gendisk *disk, struct blk_zone_dev *zprop);
>  	struct module *owner;
>  	const struct pr_ops *pr_ops;
>  };
> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
> index d0978ee10fc7..0c49a4b2ce5d 100644
> --- a/include/uapi/linux/blkzoned.h
> +++ b/include/uapi/linux/blkzoned.h
> @@ -142,6 +142,18 @@ struct blk_zone_range {
>  	__u64		nr_sectors;
>  };
>  
> +struct blk_zone_dev {
> +	__u32	nr_zones;
> +	__u32	mar;
> +	__u32	mor;
> +	__u32	rrl;
> +	__u32	frl;
> +	__u16	zoc;
> +	__u16	ozcs;
> +	__u32	rsv31[2];
> +	__u64	rsv63[4];
> +};
> +
>  /**
>   * enum blk_zone_action - Zone state transitions managed from user-space
>   *
> @@ -209,5 +221,6 @@ struct blk_zone_mgmt {
>  #define BLKCLOSEZONE	_IOW(0x12, 135, struct blk_zone_range)
>  #define BLKFINISHZONE	_IOW(0x12, 136, struct blk_zone_range)
>  #define BLKMGMTZONE	_IOR(0x12, 137, struct blk_zone_mgmt)
> +#define BLKZONEDEVPROP	_IOR(0x12, 138, struct blk_zone_dev)
>  
>  #endif /* _UAPI_BLKZONED_H */
> 

As commented already, NVMe passthrough or sysfs device attributes would be much
better. See scsi: there is no IOCTL defined to obtain every single log page or
mode page defined. Passthrough is the interface to do that. For frequently used
log pages giving device information, sysfs is used as a cache. See all the
"vpd_pgXX" entries under /sys/block/sdX/device. All of this is done by the
driver. Not the block layer.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/6] block: add zone attr. to zone mgmt IOCTL struct
  2020-06-25 12:21 ` [PATCH 5/6] block: add zone attr. to zone mgmt IOCTL struct Javier González
  2020-06-25 15:13   ` Matias Bjørling
@ 2020-06-26  1:45   ` Damien Le Moal
  2020-06-26  6:03     ` Javier González
  2020-06-26  9:14   ` Christoph Hellwig
  2 siblings, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  1:45 UTC (permalink / raw)
  To: Javier González, linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Javier González,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 2020/06/25 21:22, Javier González wrote:
> From: Javier González <javier.gonz@samsung.com>
> 
> Add zone attributes field to the blk_zone structure. Use ZNS attributes
> as base for zoned block devices in general.
> 
> Signed-off-by: Javier González <javier.gonz@samsung.com>
> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
>  drivers/nvme/host/zns.c       |  1 +
>  include/uapi/linux/blkzoned.h | 13 ++++++++++++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 258d03610cc0..7d8381fe7665 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -195,6 +195,7 @@ static int nvme_zone_parse_entry(struct nvme_ns *ns,
>  	zone.capacity = nvme_lba_to_sect(ns, le64_to_cpu(entry->zcap));
>  	zone.start = nvme_lba_to_sect(ns, le64_to_cpu(entry->zslba));
>  	zone.wp = nvme_lba_to_sect(ns, le64_to_cpu(entry->wp));
> +	zone.attr = entry->za;
>  
>  	return cb(&zone, idx, data);
>  }
> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
> index 0c49a4b2ce5d..2e43a00e3425 100644
> --- a/include/uapi/linux/blkzoned.h
> +++ b/include/uapi/linux/blkzoned.h
> @@ -82,6 +82,16 @@ enum blk_zone_report_flags {
>  	BLK_ZONE_REP_CAPACITY	= (1 << 0),
>  };
>  
> +/**
> + * Zone Attributes

This is a user interface file. Please document the meaning of each attribute.

> + */
> +enum blk_zone_attr {
> +	BLK_ZONE_ATTR_ZFC	= 1 << 0,
> +	BLK_ZONE_ATTR_FZR	= 1 << 1,
> +	BLK_ZONE_ATTR_RZR	= 1 << 2,
> +	BLK_ZONE_ATTR_ZDEV	= 1 << 7,
These are ZNS specific, right ? Integrating the 2 ZBC/ZAC attributes in this
list would be nice, namely non_seq and reset. That will imply patching sd.c.

> +};
> +
>  /**
>   * struct blk_zone - Zone descriptor for BLKREPORTZONE ioctl.
>   *
> @@ -108,7 +118,8 @@ struct blk_zone {
>  	__u8	cond;		/* Zone condition */
>  	__u8	non_seq;	/* Non-sequential write resources active */
>  	__u8	reset;		/* Reset write pointer recommended */
> -	__u8	resv[4];
> +	__u8	attr;		/* Zone attributes */
> +	__u8	resv[3];
>  	__u64	capacity;	/* Zone capacity in number of sectors */
>  	__u8	reserved[24];
>  };
> 

You are missing a BLK_ZONE_REP_ATTR report flag to indicate to the user that the
attr field is present, used and valid.

enum blk_zone_report_flags {
 	BLK_ZONE_REP_CAPACITY	= (1 << 0),
+	BLK_ZONE_REP_ATTR	= (1 << 1),
 };

is I think needed.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/6] block: add support for selecting all zones
  2020-06-26  1:27   ` Damien Le Moal
@ 2020-06-26  5:58     ` Javier González
  2020-06-26  6:35       ` Damien Le Moal
  0 siblings, 1 reply; 70+ messages in thread
From: Javier González @ 2020-06-26  5:58 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 26.06.2020 01:27, Damien Le Moal wrote:
>On 2020/06/25 21:22, Javier González wrote:
>> From: Javier González <javier.gonz@samsung.com>
>>
>> Add flag to allow selecting all zones on a single zone management
>> operation
>>
>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> ---
>>  block/blk-zoned.c             | 3 +++
>>  include/linux/blk_types.h     | 3 ++-
>>  include/uapi/linux/blkzoned.h | 9 +++++++++
>>  3 files changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index e87c60004dc5..29194388a1bb 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -420,6 +420,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>  		return -ENOTTY;
>>  	}
>>
>> +	if (zmgmt.flags & BLK_ZONE_SELECT_ALL)
>> +		op |= REQ_ZONE_ALL;
>> +
>>  	return blkdev_zone_mgmt(bdev, op, zmgmt.sector, zmgmt.nr_sectors,
>>  				GFP_KERNEL);
>>  }
>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index ccb895f911b1..16b57fb2b99c 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -351,6 +351,7 @@ enum req_flag_bits {
>>  	 * work item to avoid such priority inversions.
>>  	 */
>>  	__REQ_CGROUP_PUNT,
>> +	__REQ_ZONE_ALL,		/* apply zone operation to all zones */
>>
>>  	/* command specific flags for REQ_OP_WRITE_ZEROES: */
>>  	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
>> @@ -378,7 +379,7 @@ enum req_flag_bits {
>>  #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
>>  #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
>>  #define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
>> -
>> +#define REQ_ZONE_ALL		(1ULL << __REQ_ZONE_ALL)
>>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>>  #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
>>
>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>> index 07b5fde21d9f..a8c89fe58f97 100644
>> --- a/include/uapi/linux/blkzoned.h
>> +++ b/include/uapi/linux/blkzoned.h
>> @@ -157,6 +157,15 @@ enum blk_zone_action {
>>  	BLK_ZONE_MGMT_RESET	= 0x4,
>>  };
>>
>> +/**
>> + * enum blk_zone_mgmt_flags - Flags for blk_zone_mgmt
>> + *
>> + * BLK_ZONE_SELECT_ALL: Select all zones for current zone action
>> + */
>> +enum blk_zone_mgmt_flags {
>> +	BLK_ZONE_SELECT_ALL	= 1 << 0,
>> +};
>> +
>>  /**
>>   * struct blk_zone_mgmt - Extended zoned management
>>   *
>>
>
>NACK.
>
>Details:
>1) REQ_OP_ZONE_RESET together with REQ_ZONE_ALL is the same as
>REQ_OP_ZONE_RESET_ALL, isn't it ? You are duplicating a functionality that
>already exists.
>2) The patch introduces REQ_ZONE_ALL at the block layer only without defining
>how it ties into SCSI and NVMe driver use of it. Is REQ_ZONE_ALL indicating that
>the zone management commands are to be executed with the ALL bit set ? If yes,
>that will break device-mapper. See the special code for handling
>REQ_OP_ZONE_RESET_ALL. That code is in place for a reason: the target block
>device may not be an entire physical device. In that case, applying a zone
>management command to all zones of the physical drive is wrong.
>3) REQ_ZONE_ALL seems completely equivalent to specifying a sector range of [0
>.. drive capacity]. So what is the point ? The current interface handles that.
>That is how we chose between REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL right now.
>4) Without any in-kernel user, I do not see the point. And for applications, I
>do not see any good use case for doing open all, close all, offline all or
>finish all. If you have any such good use case, please elaborate.
>

The main use if reset all, but without having to look through all zones,
as it imposes an overhead when we have a large number of zones. Having
the possibility to offload it to HW is more efficient.

I had not thought about the device mapper use case. Would it be an
option to translate this into REQ_OP_ZONE_RESET_ALL when we have a
device mapper (or any other case where this might break) and then leave
the bit go to the driver if it applies to the whole device?

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/6] block: introduce IOCTL for zone mgmt
  2020-06-26  1:17   ` Damien Le Moal
@ 2020-06-26  6:01     ` Javier González
  2020-06-26  6:37       ` Damien Le Moal
  0 siblings, 1 reply; 70+ messages in thread
From: Javier González @ 2020-06-26  6:01 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 26.06.2020 01:17, Damien Le Moal wrote:
>On 2020/06/25 21:22, Javier González wrote:
>> From: Javier González <javier.gonz@samsung.com>
>>
>> The current IOCTL interface for zone management is limited by struct
>> blk_zone_range, which is unfortunately not extensible. Specially, the
>> lack of flags is problematic for ZNS zoned devices.
>>
>> This new IOCTL is designed to be a superset of the current one, with
>> support for flags and bits for extensibility.
>>
>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> ---
>>  block/blk-zoned.c             | 56 ++++++++++++++++++++++++++++++++++-
>>  block/ioctl.c                 |  2 ++
>>  include/linux/blkdev.h        |  9 ++++++
>>  include/uapi/linux/blkzoned.h | 33 +++++++++++++++++++++
>>  4 files changed, 99 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index 81152a260354..e87c60004dc5 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -322,7 +322,7 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
>>   * BLKRESETZONE, BLKOPENZONE, BLKCLOSEZONE and BLKFINISHZONE ioctl processing.
>>   * Called from blkdev_ioctl.
>>   */
>> -int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>> +int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>  			   unsigned int cmd, unsigned long arg)
>>  {
>>  	void __user *argp = (void __user *)arg;
>> @@ -370,6 +370,60 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>  				GFP_KERNEL);
>>  }
>>
>> +/*
>> + * Zone management ioctl processing. Extension of blkdev_zone_ops_ioctl(), with
>> + * blk_zone_mgmt structure.
>> + *
>> + * Called from blkdev_ioctl.
>> + */
>> +int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>> +			   unsigned int cmd, unsigned long arg)
>> +{
>> +	void __user *argp = (void __user *)arg;
>> +	struct request_queue *q;
>> +	struct blk_zone_mgmt zmgmt;
>> +	enum req_opf op;
>> +
>> +	if (!argp)
>> +		return -EINVAL;
>> +
>> +	q = bdev_get_queue(bdev);
>> +	if (!q)
>> +		return -ENXIO;
>> +
>> +	if (!blk_queue_is_zoned(q))
>> +		return -ENOTTY;
>> +
>> +	if (!capable(CAP_SYS_ADMIN))
>> +		return -EACCES;
>> +
>> +	if (!(mode & FMODE_WRITE))
>> +		return -EBADF;
>> +
>> +	if (copy_from_user(&zmgmt, argp, sizeof(struct blk_zone_mgmt)))
>> +		return -EFAULT;
>> +
>> +	switch (zmgmt.action) {
>> +	case BLK_ZONE_MGMT_CLOSE:
>> +		op = REQ_OP_ZONE_CLOSE;
>> +		break;
>> +	case BLK_ZONE_MGMT_FINISH:
>> +		op = REQ_OP_ZONE_FINISH;
>> +		break;
>> +	case BLK_ZONE_MGMT_OPEN:
>> +		op = REQ_OP_ZONE_OPEN;
>> +		break;
>> +	case BLK_ZONE_MGMT_RESET:
>> +		op = REQ_OP_ZONE_RESET;
>> +		break;
>> +	default:
>> +		return -ENOTTY;
>> +	}
>> +
>> +	return blkdev_zone_mgmt(bdev, op, zmgmt.sector, zmgmt.nr_sectors,
>> +				GFP_KERNEL);
>> +}
>> +
>>  static inline unsigned long *blk_alloc_zone_bitmap(int node,
>>  						   unsigned int nr_zones)
>>  {
>> diff --git a/block/ioctl.c b/block/ioctl.c
>> index bdb3bbb253d9..0ea29754e7dd 100644
>> --- a/block/ioctl.c
>> +++ b/block/ioctl.c
>> @@ -514,6 +514,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>>  	case BLKOPENZONE:
>>  	case BLKCLOSEZONE:
>>  	case BLKFINISHZONE:
>> +		return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
>> +	case BLKMGMTZONE:
>>  		return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
>>  	case BLKGETZONESZ:
>>  		return put_uint(argp, bdev_zone_sectors(bdev));
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 8fd900998b4e..bd8521f94dc4 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -368,6 +368,8 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
>>
>>  extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
>>  				     unsigned int cmd, unsigned long arg);
>> +extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>> +				  unsigned int cmd, unsigned long arg);
>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>  				  unsigned int cmd, unsigned long arg);
>>
>> @@ -385,6 +387,13 @@ static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
>>  	return -ENOTTY;
>>  }
>>
>> +
>> +static inline int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>> +					unsigned int cmd, unsigned long arg)
>> +{
>> +	return -ENOTTY;
>> +}
>> +
>>  static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
>>  					 fmode_t mode, unsigned int cmd,
>>  					 unsigned long arg)
>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>> index 42c3366cc25f..07b5fde21d9f 100644
>> --- a/include/uapi/linux/blkzoned.h
>> +++ b/include/uapi/linux/blkzoned.h
>> @@ -142,6 +142,38 @@ struct blk_zone_range {
>>  	__u64		nr_sectors;
>>  };
>>
>> +/**
>> + * enum blk_zone_action - Zone state transitions managed from user-space
>> + *
>> + * @BLK_ZONE_MGMT_CLOSE: Transition to Closed state
>> + * @BLK_ZONE_MGMT_FINISH: Transition to Finish state
>> + * @BLK_ZONE_MGMT_OPEN: Transition to Open state
>> + * @BLK_ZONE_MGMT_RESET: Transition to Reset state
>> + */
>> +enum blk_zone_action {
>> +	BLK_ZONE_MGMT_CLOSE	= 0x1,
>> +	BLK_ZONE_MGMT_FINISH	= 0x2,
>> +	BLK_ZONE_MGMT_OPEN	= 0x3,
>> +	BLK_ZONE_MGMT_RESET	= 0x4,
>> +};
>> +
>> +/**
>> + * struct blk_zone_mgmt - Extended zoned management
>> + *
>> + * @action: Zone action as in described in enum blk_zone_action
>> + * @flags: Flags for the action
>> + * @sector: Starting sector of the first zone to operate on
>> + * @nr_sectors: Total number of sectors of all zones to operate on
>> + */
>> +struct blk_zone_mgmt {
>> +	__u8		action;
>> +	__u8		resv3[3];
>> +	__u32		flags;
>> +	__u64		sector;
>> +	__u64		nr_sectors;
>> +	__u64		resv31;
>> +};
>> +
>>  /**
>>   * Zoned block device ioctl's:
>>   *
>> @@ -166,5 +198,6 @@ struct blk_zone_range {
>>  #define BLKOPENZONE	_IOW(0x12, 134, struct blk_zone_range)
>>  #define BLKCLOSEZONE	_IOW(0x12, 135, struct blk_zone_range)
>>  #define BLKFINISHZONE	_IOW(0x12, 136, struct blk_zone_range)
>> +#define BLKMGMTZONE	_IOR(0x12, 137, struct blk_zone_mgmt)
>>
>>  #endif /* _UAPI_BLKZONED_H */
>>
>
>Without defining what the flags can be, it is hard to judge what will change
>from the current distinct ioctls.
>

The first flag is the one to select all. Down the line we have other
modifiers that make sense, but it is true that it is public yet.

Would you like to wait until then or is it an option to revise the IOCTL
now?

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/6] block: add zone attr. to zone mgmt IOCTL struct
  2020-06-26  1:45   ` Damien Le Moal
@ 2020-06-26  6:03     ` Javier González
  2020-06-26  6:38       ` Damien Le Moal
  0 siblings, 1 reply; 70+ messages in thread
From: Javier González @ 2020-06-26  6:03 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 26.06.2020 01:45, Damien Le Moal wrote:
>On 2020/06/25 21:22, Javier González wrote:
>> From: Javier González <javier.gonz@samsung.com>
>>
>> Add zone attributes field to the blk_zone structure. Use ZNS attributes
>> as base for zoned block devices in general.
>>
>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> ---
>>  drivers/nvme/host/zns.c       |  1 +
>>  include/uapi/linux/blkzoned.h | 13 ++++++++++++-
>>  2 files changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>> index 258d03610cc0..7d8381fe7665 100644
>> --- a/drivers/nvme/host/zns.c
>> +++ b/drivers/nvme/host/zns.c
>> @@ -195,6 +195,7 @@ static int nvme_zone_parse_entry(struct nvme_ns *ns,
>>  	zone.capacity = nvme_lba_to_sect(ns, le64_to_cpu(entry->zcap));
>>  	zone.start = nvme_lba_to_sect(ns, le64_to_cpu(entry->zslba));
>>  	zone.wp = nvme_lba_to_sect(ns, le64_to_cpu(entry->wp));
>> +	zone.attr = entry->za;
>>
>>  	return cb(&zone, idx, data);
>>  }
>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>> index 0c49a4b2ce5d..2e43a00e3425 100644
>> --- a/include/uapi/linux/blkzoned.h
>> +++ b/include/uapi/linux/blkzoned.h
>> @@ -82,6 +82,16 @@ enum blk_zone_report_flags {
>>  	BLK_ZONE_REP_CAPACITY	= (1 << 0),
>>  };
>>
>> +/**
>> + * Zone Attributes
>
>This is a user interface file. Please document the meaning of each attribute.
>

Sure.

>> + */
>> +enum blk_zone_attr {
>> +	BLK_ZONE_ATTR_ZFC	= 1 << 0,
>> +	BLK_ZONE_ATTR_FZR	= 1 << 1,
>> +	BLK_ZONE_ATTR_RZR	= 1 << 2,
>> +	BLK_ZONE_ATTR_ZDEV	= 1 << 7,
>These are ZNS specific, right ? Integrating the 2 ZBC/ZAC attributes in this
>list would be nice, namely non_seq and reset. That will imply patching sd.c.
>

Of course. I will look at non_seq and reset. Any other that should go
in here?

>> +};
>> +
>>  /**
>>   * struct blk_zone - Zone descriptor for BLKREPORTZONE ioctl.
>>   *
>> @@ -108,7 +118,8 @@ struct blk_zone {
>>  	__u8	cond;		/* Zone condition */
>>  	__u8	non_seq;	/* Non-sequential write resources active */
>>  	__u8	reset;		/* Reset write pointer recommended */
>> -	__u8	resv[4];
>> +	__u8	attr;		/* Zone attributes */
>> +	__u8	resv[3];
>>  	__u64	capacity;	/* Zone capacity in number of sectors */
>>  	__u8	reserved[24];
>>  };
>>
>
>You are missing a BLK_ZONE_REP_ATTR report flag to indicate to the user that the
>attr field is present, used and valid.
>
>enum blk_zone_report_flags {
> 	BLK_ZONE_REP_CAPACITY	= (1 << 0),
>+	BLK_ZONE_REP_ATTR	= (1 << 1),
> };
>
>is I think needed.

Good point. I will add that on a V2.

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/6] block: add support for zone offline transition
  2020-06-26  1:34   ` Damien Le Moal
@ 2020-06-26  6:08     ` Javier González
  2020-06-26  6:42       ` Damien Le Moal
  0 siblings, 1 reply; 70+ messages in thread
From: Javier González @ 2020-06-26  6:08 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 26.06.2020 01:34, Damien Le Moal wrote:
>On 2020/06/25 21:22, Javier González wrote:
>> From: Javier González <javier.gonz@samsung.com>
>>
>> Add support for offline transition on the zoned block device using the
>> new zone management IOCTL
>>
>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> ---
>>  block/blk-core.c              | 2 ++
>>  block/blk-zoned.c             | 3 +++
>>  drivers/nvme/host/core.c      | 3 +++
>>  include/linux/blk_types.h     | 3 +++
>>  include/linux/blkdev.h        | 1 -
>>  include/uapi/linux/blkzoned.h | 1 +
>>  6 files changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 03252af8c82c..589cbdacc5ec 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -140,6 +140,7 @@ static const char *const blk_op_name[] = {
>>  	REQ_OP_NAME(ZONE_CLOSE),
>>  	REQ_OP_NAME(ZONE_FINISH),
>>  	REQ_OP_NAME(ZONE_APPEND),
>> +	REQ_OP_NAME(ZONE_OFFLINE),
>>  	REQ_OP_NAME(WRITE_SAME),
>>  	REQ_OP_NAME(WRITE_ZEROES),
>>  	REQ_OP_NAME(SCSI_IN),
>> @@ -1030,6 +1031,7 @@ generic_make_request_checks(struct bio *bio)
>>  	case REQ_OP_ZONE_OPEN:
>>  	case REQ_OP_ZONE_CLOSE:
>>  	case REQ_OP_ZONE_FINISH:
>> +	case REQ_OP_ZONE_OFFLINE:
>>  		if (!blk_queue_is_zoned(q))
>>  			goto not_supported;
>>  		break;
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index 29194388a1bb..704fc15813d1 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -416,6 +416,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>  	case BLK_ZONE_MGMT_RESET:
>>  		op = REQ_OP_ZONE_RESET;
>>  		break;
>> +	case BLK_ZONE_MGMT_OFFLINE:
>> +		op = REQ_OP_ZONE_OFFLINE;
>> +		break;
>>  	default:
>>  		return -ENOTTY;
>>  	}
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index f1215523792b..5b95c81d2a2d 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
>>  	case REQ_OP_ZONE_FINISH:
>>  		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
>>  		break;
>> +	case REQ_OP_ZONE_OFFLINE:
>> +		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OFFLINE);
>> +		break;
>>  	case REQ_OP_WRITE_ZEROES:
>>  		ret = nvme_setup_write_zeroes(ns, req, cmd);
>>  		break;
>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index 16b57fb2b99c..b3921263c3dd 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -316,6 +316,8 @@ enum req_opf {
>>  	REQ_OP_ZONE_FINISH	= 12,
>>  	/* write data at the current zone write pointer */
>>  	REQ_OP_ZONE_APPEND	= 13,
>> +	/* Transition a zone to offline */
>> +	REQ_OP_ZONE_OFFLINE	= 14,
>>
>>  	/* SCSI passthrough using struct scsi_request */
>>  	REQ_OP_SCSI_IN		= 32,
>> @@ -456,6 +458,7 @@ static inline bool op_is_zone_mgmt(enum req_opf op)
>>  	case REQ_OP_ZONE_OPEN:
>>  	case REQ_OP_ZONE_CLOSE:
>>  	case REQ_OP_ZONE_FINISH:
>> +	case REQ_OP_ZONE_OFFLINE:
>>  		return true;
>>  	default:
>>  		return false;
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index bd8521f94dc4..8308d8a3720b 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -372,7 +372,6 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>  				  unsigned int cmd, unsigned long arg);
>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>  				  unsigned int cmd, unsigned long arg);
>> -
>>  #else /* CONFIG_BLK_DEV_ZONED */
>>
>>  static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>> index a8c89fe58f97..d0978ee10fc7 100644
>> --- a/include/uapi/linux/blkzoned.h
>> +++ b/include/uapi/linux/blkzoned.h
>> @@ -155,6 +155,7 @@ enum blk_zone_action {
>>  	BLK_ZONE_MGMT_FINISH	= 0x2,
>>  	BLK_ZONE_MGMT_OPEN	= 0x3,
>>  	BLK_ZONE_MGMT_RESET	= 0x4,
>> +	BLK_ZONE_MGMT_OFFLINE	= 0x5,
>>  };
>>
>>  /**
>>
>
>As mentioned in previous email, the usefulness of this is dubious. Please
>elaborate in the commit message. Sure NVMe ZNS defines this and we can support
>it. But without a good use case, what is the point ?

Use case is to transition zones in read-only state to offline when we
are done moving valid data. It is easier to explicitly managing zones
that are not usable by having all under the offline state.

>
>scsi SD driver will return BLK_STS_NOTSUPP if this offlining is sent to a
>ZBC/ZAC drive. Not nice. Having a sysfs attribute "max_offline_zone_sectors" or
>the like to indicate support by the device or not would be nicer.

We can do that.

>
>Does offling ALL zones make any sense ? Because this patch does not prevent the
>use of the REQ_ZONE_ALL flags introduced in patch 2. Probably not a good idea to
>allow offlining all zones, no ?

AFAIK the transition to offline is only valid when coming from a
read-only state. I did think of adding a check, but I can see that other
transitions go directly to the driver and then the device, so I decided
to follow the same model. If you think it is better, we can add the
check.

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/6] nvme: Add consistency check for zone count
  2020-06-26  0:04     ` Damien Le Moal
@ 2020-06-26  6:13       ` Javier González
  2020-06-26  6:49         ` Damien Le Moal
  0 siblings, 1 reply; 70+ messages in thread
From: Javier González @ 2020-06-26  6:13 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Keith Busch, linux-nvme, linux-block, hch, sagi, axboe,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 26.06.2020 00:04, Damien Le Moal wrote:
>On 2020/06/26 6:49, Keith Busch wrote:
>> On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>>>  drivers/nvme/host/zns.c | 7 +++++++
>>>  1 file changed, 7 insertions(+)
>>>
>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>> index 7d8381fe7665..de806788a184 100644
>>> --- a/drivers/nvme/host/zns.c
>>> +++ b/drivers/nvme/host/zns.c
>>> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>>  		sector += ns->zsze * nz;
>>>  	}
>>>
>>> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>>> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>>> +				zone_idx, ns->nr_zones);
>>> +		ret = -EINVAL;
>>> +		goto out_free;
>>> +	}
>>> +
>>>  	ret = zone_idx;
>>
>> nr_zones is unsigned, so it's never < 0.
>>
>> The API we're providing doesn't require zone_idx equal the namespace's
>> nr_zones at the end, though. A subset of the total number of zones can
>> be requested here.
>>

I did see nr_zones coming with -1; guess it is my compiler.

>
>Yes, absolutely. zone_idx is not an absolute zone number. It is the index of the
>reported zone descriptor in the current report range requested by the user,
>which is not necessarily for the entire drive (i.e., provided nr zones is less
>than the total number of zones of the disk and/or start sector is > 0). So
>zone_idx indicates the actual number of zones reported, it is not the total

I see. As I can see, when nr_zones comes undefined I believed we could
assume that zone_idx is absolute, but I can be wrong.

Does it make sense to support this check with an additional counter and
a explicit nr_zones initialization when undefined or you
prefer to just remove it as Matias suggested?

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/6] block: add support for zone offline transition
  2020-06-26  1:14       ` Damien Le Moal
@ 2020-06-26  6:18         ` Javier González
  2020-06-26  9:11         ` hch
  1 sibling, 0 replies; 70+ messages in thread
From: Javier González @ 2020-06-26  6:18 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Matias Bjørling, linux-nvme, linux-block, hch, kbusch, sagi,
	axboe, SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 26.06.2020 01:14, Damien Le Moal wrote:
>On 2020/06/26 4:48, Javier González wrote:
>> On 25.06.2020 16:12, Matias Bjørling wrote:
>>> On 25/06/2020 14.21, Javier González wrote:
>>>> From: Javier González <javier.gonz@samsung.com>
>>>>
>>>> Add support for offline transition on the zoned block device using the
>>>> new zone management IOCTL
>>>>
>>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>>> ---
>>>>  block/blk-core.c              | 2 ++
>>>>  block/blk-zoned.c             | 3 +++
>>>>  drivers/nvme/host/core.c      | 3 +++
>>>>  include/linux/blk_types.h     | 3 +++
>>>>  include/linux/blkdev.h        | 1 -
>>>>  include/uapi/linux/blkzoned.h | 1 +
>>>>  6 files changed, 12 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index 03252af8c82c..589cbdacc5ec 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -140,6 +140,7 @@ static const char *const blk_op_name[] = {
>>>>  	REQ_OP_NAME(ZONE_CLOSE),
>>>>  	REQ_OP_NAME(ZONE_FINISH),
>>>>  	REQ_OP_NAME(ZONE_APPEND),
>>>> +	REQ_OP_NAME(ZONE_OFFLINE),
>>>>  	REQ_OP_NAME(WRITE_SAME),
>>>>  	REQ_OP_NAME(WRITE_ZEROES),
>>>>  	REQ_OP_NAME(SCSI_IN),
>>>> @@ -1030,6 +1031,7 @@ generic_make_request_checks(struct bio *bio)
>>>>  	case REQ_OP_ZONE_OPEN:
>>>>  	case REQ_OP_ZONE_CLOSE:
>>>>  	case REQ_OP_ZONE_FINISH:
>>>> +	case REQ_OP_ZONE_OFFLINE:
>>>>  		if (!blk_queue_is_zoned(q))
>>>>  			goto not_supported;
>>>>  		break;
>>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>>> index 29194388a1bb..704fc15813d1 100644
>>>> --- a/block/blk-zoned.c
>>>> +++ b/block/blk-zoned.c
>>>> @@ -416,6 +416,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  	case BLK_ZONE_MGMT_RESET:
>>>>  		op = REQ_OP_ZONE_RESET;
>>>>  		break;
>>>> +	case BLK_ZONE_MGMT_OFFLINE:
>>>> +		op = REQ_OP_ZONE_OFFLINE;
>>>> +		break;
>>>>  	default:
>>>>  		return -ENOTTY;
>>>>  	}
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index f1215523792b..5b95c81d2a2d 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
>>>>  	case REQ_OP_ZONE_FINISH:
>>>>  		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
>>>>  		break;
>>>> +	case REQ_OP_ZONE_OFFLINE:
>>>> +		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OFFLINE);
>>>> +		break;
>>>>  	case REQ_OP_WRITE_ZEROES:
>>>>  		ret = nvme_setup_write_zeroes(ns, req, cmd);
>>>>  		break;
>>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>>> index 16b57fb2b99c..b3921263c3dd 100644
>>>> --- a/include/linux/blk_types.h
>>>> +++ b/include/linux/blk_types.h
>>>> @@ -316,6 +316,8 @@ enum req_opf {
>>>>  	REQ_OP_ZONE_FINISH	= 12,
>>>>  	/* write data at the current zone write pointer */
>>>>  	REQ_OP_ZONE_APPEND	= 13,
>>>> +	/* Transition a zone to offline */
>>>> +	REQ_OP_ZONE_OFFLINE	= 14,
>>>>  	/* SCSI passthrough using struct scsi_request */
>>>>  	REQ_OP_SCSI_IN		= 32,
>>>> @@ -456,6 +458,7 @@ static inline bool op_is_zone_mgmt(enum req_opf op)
>>>>  	case REQ_OP_ZONE_OPEN:
>>>>  	case REQ_OP_ZONE_CLOSE:
>>>>  	case REQ_OP_ZONE_FINISH:
>>>> +	case REQ_OP_ZONE_OFFLINE:
>>>>  		return true;
>>>>  	default:
>>>>  		return false;
>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>> index bd8521f94dc4..8308d8a3720b 100644
>>>> --- a/include/linux/blkdev.h
>>>> +++ b/include/linux/blkdev.h
>>>> @@ -372,7 +372,6 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  				  unsigned int cmd, unsigned long arg);
>>>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  				  unsigned int cmd, unsigned long arg);
>>>> -
>>>>  #else /* CONFIG_BLK_DEV_ZONED */
>>>>  static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
>>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>>> index a8c89fe58f97..d0978ee10fc7 100644
>>>> --- a/include/uapi/linux/blkzoned.h
>>>> +++ b/include/uapi/linux/blkzoned.h
>>>> @@ -155,6 +155,7 @@ enum blk_zone_action {
>>>>  	BLK_ZONE_MGMT_FINISH	= 0x2,
>>>>  	BLK_ZONE_MGMT_OPEN	= 0x3,
>>>>  	BLK_ZONE_MGMT_RESET	= 0x4,
>>>> +	BLK_ZONE_MGMT_OFFLINE	= 0x5,
>>>>  };
>>>>  /**
>>>
>>> I am not sure this makes sense to expose through the kernel zone api.
>>> One of the goals of the kernel zone API is to be a layer that provides
>>> an unified zone model across SMR HDDs and ZNS SSDs. The offline zone
>>> operation, as defined in the ZNS specification, does not have an
>>> equivalent in SMR HDDs (ZAC/ZBC).
>>>
>>> This is different from the Zone Capacity change, where the zone
>>> capacity simply was zone size for SMR HDDs. Making it easy to support.
>>> That is not the same for ZAC/ZBC, that does not offer the offline
>>> operation to transition zones in read only state to offline state.
>>
>> I agree that an unified interface is desirable. However, the truth is
>> that ZAC/ZBC are different, and will differ more and more and time goes
>> by. We can deal with the differences at the driver level or with checks
>> at the API level, but limiting ZNS with ZAC/ZBC is a hard constraint.
>
>As long as you keep ZNS namespace report itself as being "host-managed" like
>ZBC/ZAC disks, we need the consistency and common interface. If you break that,
>the meaning of the zoned model queue attribute disappears and an application or
>in-kernel user cannot rely on this model anymore to know how the drive will behave.

I agree. The API should be clean and common, but that should not prevent
extensions to ZAC/ZBC or ZNS specifics. The suggestions you propose in
the other patches make sense to do this properly.

>
>> Note too that I chose to only support this particular transition on the
>> new management IOCTL to avoid confusion for existing ZAC/ZBC users.
>>
>> It would be good to clarify what is the plan for kernel APIs moving
>> forward, as I believe there is a general desire to support new ZNS
>> features, which will not necessarily be replicated in SMR drives.
>
>What the drive is supposed to support and its behavior is determined by the
>zoned model. ZNS standard was written so that most things have an equivalent
>with ZBC/ZAC, e.g. the zone state machine is nearly identical. Differences are
>either emulated (e.g. zone append scsi emulation), or not supported (e.g. zone
>capacity change) so that the kernel follows the same pattern and maintains a
>coherent behavior between device protocols for the host-managed model.

Yes.

>
>Think of a file system, or any other in-kernel user. If they have to change
>their code based on the device type (NVMe vs SCSI), then the zoned block device
>interface is broken. Right now, that is not the case, everything works equally
>well on ZNS and SCSI, modulo the need by a user for conventional zones that ZNS
>do not define. But that is still consistent with the host-managed model since
>conventional zones are optional.

I think this is a very nice goal, but I do believe we will not be able
to keep a 100% consistent behavior. We will have new features on either
of the specs that do not make sense on the other and we will have to
deal with them. We can deal with this as generic optional features, but
at the end of the day, applications will need to check whether the
feature is selected or not.

This said, I agree that we need a good way to communicate this, and the
suggestions you made with sysfs parameters and flags make sense to me.

>
>For this particular patch, there is currently no in-kernel user, and it is not
>clear how this will be useful to applications. At least please clarify this. And
>most likely, similarly to discard etc operations that are optional, having a
>sysfs attribute and in-kernel API indicating if the drive supports offlining
>zones will be needed. Otherwise, the caller will have to play with error codes
>to understand if the drive does not support the command or if it is supported
>but the command failed. Not nice. Better to know before issuing the command.

Makes sense. See the reply on the patch itself.

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/6] block: introduce IOCTL to report dev properties
  2020-06-26  1:38   ` Damien Le Moal
@ 2020-06-26  6:22     ` Javier González
  0 siblings, 0 replies; 70+ messages in thread
From: Javier González @ 2020-06-26  6:22 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 26.06.2020 01:38, Damien Le Moal wrote:
>On 2020/06/25 21:22, Javier González wrote:
>> From: Javier González <javier.gonz@samsung.com>
>>
>> With the addition of ZNS, a new set of properties have been added to the
>> zoned block device. This patch introduces a new IOCTL to expose these
>> rroperties to user space.
>>
>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> ---
>>  block/blk-zoned.c             | 46 ++++++++++++++++++++++++++
>>  block/ioctl.c                 |  2 ++
>>  drivers/nvme/host/core.c      |  2 ++
>>  drivers/nvme/host/nvme.h      | 11 +++++++
>>  drivers/nvme/host/zns.c       | 61 +++++++++++++++++++++++++++++++++++
>>  include/linux/blkdev.h        |  9 ++++++
>>  include/uapi/linux/blkzoned.h | 13 ++++++++
>>  7 files changed, 144 insertions(+)
>>
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index 704fc15813d1..39ec72af9537 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -169,6 +169,17 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
>>  }
>>  EXPORT_SYMBOL_GPL(blkdev_report_zones);
>>
>> +static int blkdev_report_zonedev_prop(struct block_device *bdev,
>> +				      struct blk_zone_dev *zprop)
>> +{
>> +	struct gendisk *disk = bdev->bd_disk;
>> +
>> +	if (WARN_ON_ONCE(!bdev->bd_disk->fops->report_zone_p))
>> +		return -EOPNOTSUPP;
>> +
>> +	return disk->fops->report_zone_p(disk, zprop);
>> +}
>> +
>>  static inline bool blkdev_allow_reset_all_zones(struct block_device *bdev,
>>  						sector_t sector,
>>  						sector_t nr_sectors)
>> @@ -430,6 +441,41 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>  				GFP_KERNEL);
>>  }
>>
>> +int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
>> +			unsigned int cmd, unsigned long arg)
>> +{
>> +	void __user *argp = (void __user *)arg;
>> +	struct request_queue *q;
>> +	struct blk_zone_dev zprop;
>> +	int ret;
>> +
>> +	if (!argp)
>> +		return -EINVAL;
>> +
>> +	q = bdev_get_queue(bdev);
>> +	if (!q)
>> +		return -ENXIO;
>> +
>> +	if (!blk_queue_is_zoned(q))
>> +		return -ENOTTY;
>> +
>> +	if (!capable(CAP_SYS_ADMIN))
>> +		return -EACCES;
>> +
>> +	if (!(mode & FMODE_WRITE))
>> +		return -EBADF;
>> +
>> +	ret = blkdev_report_zonedev_prop(bdev, &zprop);
>> +	if (ret)
>> +		goto out;
>> +
>> +	if (copy_to_user(argp, &zprop, sizeof(struct blk_zone_dev)))
>> +		return -EFAULT;
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>>  static inline unsigned long *blk_alloc_zone_bitmap(int node,
>>  						   unsigned int nr_zones)
>>  {
>> diff --git a/block/ioctl.c b/block/ioctl.c
>> index 0ea29754e7dd..f7b4e0f2dd4c 100644
>> --- a/block/ioctl.c
>> +++ b/block/ioctl.c
>> @@ -517,6 +517,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>>  		return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
>>  	case BLKMGMTZONE:
>>  		return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
>> +	case BLKZONEDEVPROP:
>> +		return blkdev_zonedev_prop(bdev, mode, cmd, arg);
>>  	case BLKGETZONESZ:
>>  		return put_uint(argp, bdev_zone_sectors(bdev));
>>  	case BLKGETNRZONES:
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 5b95c81d2a2d..a32c909a915f 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -2254,6 +2254,7 @@ static const struct block_device_operations nvme_fops = {
>>  	.getgeo		= nvme_getgeo,
>>  	.revalidate_disk= nvme_revalidate_disk,
>>  	.report_zones	= nvme_report_zones,
>> +	.report_zone_p	= nvme_report_zone_prop,
>>  	.pr_ops		= &nvme_pr_ops,
>>  };
>>
>> @@ -2280,6 +2281,7 @@ const struct block_device_operations nvme_ns_head_ops = {
>>  	.compat_ioctl	= nvme_compat_ioctl,
>>  	.getgeo		= nvme_getgeo,
>>  	.report_zones	= nvme_report_zones,
>> +	.report_zone_p	= nvme_report_zone_prop,
>>  	.pr_ops		= &nvme_pr_ops,
>>  };
>>  #endif /* CONFIG_NVME_MULTIPATH */
>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>> index ecf443efdf91..172e0531f37f 100644
>> --- a/drivers/nvme/host/nvme.h
>> +++ b/drivers/nvme/host/nvme.h
>> @@ -407,6 +407,14 @@ struct nvme_ns {
>>  	u8 pi_type;
>>  #ifdef CONFIG_BLK_DEV_ZONED
>>  	u64 zsze;
>> +
>> +	u32 nr_zones;
>> +	u32 mar;
>> +	u32 mor;
>> +	u32 rrl;
>> +	u32 frl;
>> +	u16 zoc;
>> +	u16 ozcs;
>>  #endif
>>  	unsigned long features;
>>  	unsigned long flags;
>> @@ -704,11 +712,14 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>>  int nvme_report_zones(struct gendisk *disk, sector_t sector,
>>  		      unsigned int nr_zones, report_zones_cb cb, void *data);
>>
>> +int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev *zprop);
>> +
>>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>>  				       struct nvme_command *cmnd,
>>  				       enum nvme_zone_mgmt_action action);
>>  #else
>>  #define nvme_report_zones NULL
>> +#define nvme_report_zone_prop NULL
>>
>>  static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
>>  		struct request *req, struct nvme_command *cmnd,
>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>> index 2e6512ac6f01..258d03610cc0 100644
>> --- a/drivers/nvme/host/zns.c
>> +++ b/drivers/nvme/host/zns.c
>> @@ -32,6 +32,28 @@ static int nvme_set_max_append(struct nvme_ctrl *ctrl)
>>  	return 0;
>>  }
>>
>> +static u64 nvme_zns_nr_zones(struct nvme_ns *ns)
>> +{
>> +	struct nvme_command c = { };
>> +	struct nvme_zone_report report;
>> +	int buflen = sizeof(struct nvme_zone_report);
>> +	int ret;
>> +
>> +	c.zmr.opcode = nvme_cmd_zone_mgmt_recv;
>> +	c.zmr.nsid = cpu_to_le32(ns->head->ns_id);
>> +	c.zmr.slba = cpu_to_le64(0);
>> +	c.zmr.numd = cpu_to_le32(nvme_bytes_to_numd(buflen));
>> +	c.zmr.zra = NVME_ZRA_ZONE_REPORT;
>> +	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>> +	c.zmr.pr = 0;
>> +
>> +	ret = nvme_submit_sync_cmd(ns->queue, &c, &report, buflen);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return le64_to_cpu(report.nr_zones);
>> +}
>> +
>>  int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>>  			  unsigned lbaf)
>>  {
>> @@ -87,6 +109,13 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>>  		goto free_data;
>>  	}
>>
>> +	ns->nr_zones = nvme_zns_nr_zones(ns);
>> +	ns->mar = le32_to_cpu(id->mar);
>> +	ns->mor = le32_to_cpu(id->mor);
>> +	ns->rrl = le32_to_cpu(id->rrl);
>> +	ns->frl = le32_to_cpu(id->frl);
>> +	ns->zoc = le16_to_cpu(id->zoc);
>> +
>>  	q->limits.zoned = BLK_ZONED_HM;
>>  	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
>>  free_data:
>> @@ -230,6 +259,38 @@ int nvme_report_zones(struct gendisk *disk, sector_t sector,
>>  	return ret;
>>  }
>>
>> +static int nvme_ns_report_zone_prop(struct nvme_ns *ns, struct blk_zone_dev *zprop)
>> +{
>> +	zprop->nr_zones = ns->nr_zones;
>> +	zprop->zoc = ns->zoc;
>> +	zprop->ozcs = ns->ozcs;
>> +	zprop->mar = ns->mar;
>> +	zprop->mor = ns->mor;
>> +	zprop->rrl = ns->rrl;
>> +	zprop->frl = ns->frl;
>> +
>> +	return 0;
>> +}
>> +
>> +int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev *zprop)
>> +{
>> +	struct nvme_ns_head *head = NULL;
>> +	struct nvme_ns *ns;
>> +	int srcu_idx, ret;
>> +
>> +	ns = nvme_get_ns_from_disk(disk, &head, &srcu_idx);
>> +	if (unlikely(!ns))
>> +		return -EWOULDBLOCK;
>> +
>> +	if (ns->head->ids.csi == NVME_CSI_ZNS)
>> +		ret = nvme_ns_report_zone_prop(ns, zprop);
>> +	else
>> +		ret = -EINVAL;
>> +	nvme_put_ns_from_disk(head, srcu_idx);
>> +
>> +	return ret;
>> +}
>> +
>>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>>  		struct nvme_command *c, enum nvme_zone_mgmt_action action)
>>  {
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 8308d8a3720b..0c0faa58b7f4 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -372,6 +372,8 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>  				  unsigned int cmd, unsigned long arg);
>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>  				  unsigned int cmd, unsigned long arg);
>> +extern int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
>> +			unsigned int cmd, unsigned long arg);
>>  #else /* CONFIG_BLK_DEV_ZONED */
>>
>>  static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
>> @@ -400,6 +402,12 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
>>  	return -ENOTTY;
>>  }
>>
>> +static inline int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
>> +				      unsigned int cmd, unsigned long arg)
>> +{
>> +	return -ENOTTY;
>> +}
>> +
>>  #endif /* CONFIG_BLK_DEV_ZONED */
>>
>>  struct request_queue {
>> @@ -1770,6 +1778,7 @@ struct block_device_operations {
>>  	int (*report_zones)(struct gendisk *, sector_t sector,
>>  			unsigned int nr_zones, report_zones_cb cb, void *data);
>>  	char *(*devnode)(struct gendisk *disk, umode_t *mode);
>> +	int (*report_zone_p)(struct gendisk *disk, struct blk_zone_dev *zprop);
>>  	struct module *owner;
>>  	const struct pr_ops *pr_ops;
>>  };
>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>> index d0978ee10fc7..0c49a4b2ce5d 100644
>> --- a/include/uapi/linux/blkzoned.h
>> +++ b/include/uapi/linux/blkzoned.h
>> @@ -142,6 +142,18 @@ struct blk_zone_range {
>>  	__u64		nr_sectors;
>>  };
>>
>> +struct blk_zone_dev {
>> +	__u32	nr_zones;
>> +	__u32	mar;
>> +	__u32	mor;
>> +	__u32	rrl;
>> +	__u32	frl;
>> +	__u16	zoc;
>> +	__u16	ozcs;
>> +	__u32	rsv31[2];
>> +	__u64	rsv63[4];
>> +};
>> +
>>  /**
>>   * enum blk_zone_action - Zone state transitions managed from user-space
>>   *
>> @@ -209,5 +221,6 @@ struct blk_zone_mgmt {
>>  #define BLKCLOSEZONE	_IOW(0x12, 135, struct blk_zone_range)
>>  #define BLKFINISHZONE	_IOW(0x12, 136, struct blk_zone_range)
>>  #define BLKMGMTZONE	_IOR(0x12, 137, struct blk_zone_mgmt)
>> +#define BLKZONEDEVPROP	_IOR(0x12, 138, struct blk_zone_dev)
>>
>>  #endif /* _UAPI_BLKZONED_H */
>>
>
>As commented already, NVMe passthrough or sysfs device attributes would be much
>better. See scsi: there is no IOCTL defined to obtain every single log page or
>mode page defined. Passthrough is the interface to do that. For frequently used
>log pages giving device information, sysfs is used as a cache. See all the
>"vpd_pgXX" entries under /sys/block/sdX/device. All of this is done by the
>driver. Not the block layer.

Ok. Let me look into that. I was hesitant to do this as (i) it is a lot
of parameters and (ii) naming ATM is a bit inconsistent. Guess we could
group them all under zone_* prefix for new parameters.

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/6] block: introduce IOCTL to report dev properties
  2020-06-25 19:58       ` Matias Bjørling
@ 2020-06-26  6:24         ` Javier González
  0 siblings, 0 replies; 70+ messages in thread
From: Javier González @ 2020-06-26  6:24 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 25.06.2020 21:58, Matias Bjørling wrote:
>On 25/06/2020 21.42, Javier González wrote:
>>On 25.06.2020 15:10, Matias Bjørling wrote:
>>>On 25/06/2020 14.21, Javier González wrote:
>>>>From: Javier González <javier.gonz@samsung.com>
>>>>
>>>>With the addition of ZNS, a new set of properties have been 
>>>>added to the
>>>>zoned block device. This patch introduces a new IOCTL to expose these
>>>>rroperties to user space.
>>>>
>>>>Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>>Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>>>Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>>>Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>>>---
>>>> block/blk-zoned.c             | 46 ++++++++++++++++++++++++++
>>>> block/ioctl.c                 |  2 ++
>>>> drivers/nvme/host/core.c      |  2 ++
>>>> drivers/nvme/host/nvme.h      | 11 +++++++
>>>> drivers/nvme/host/zns.c       | 61 +++++++++++++++++++++++++++++++++++
>>>> include/linux/blkdev.h        |  9 ++++++
>>>> include/uapi/linux/blkzoned.h | 13 ++++++++
>>>> 7 files changed, 144 insertions(+)
>>>>
>>>>diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>>>index 704fc15813d1..39ec72af9537 100644
>>>>--- a/block/blk-zoned.c
>>>>+++ b/block/blk-zoned.c
>>>>@@ -169,6 +169,17 @@ int blkdev_report_zones(struct block_device 
>>>>*bdev, sector_t sector,
>>>> }
>>>> EXPORT_SYMBOL_GPL(blkdev_report_zones);
>>>>+static int blkdev_report_zonedev_prop(struct block_device *bdev,
>>>>+                      struct blk_zone_dev *zprop)
>>>>+{
>>>>+    struct gendisk *disk = bdev->bd_disk;
>>>>+
>>>>+    if (WARN_ON_ONCE(!bdev->bd_disk->fops->report_zone_p))
>>>>+        return -EOPNOTSUPP;
>>>>+
>>>>+    return disk->fops->report_zone_p(disk, zprop);
>>>>+}
>>>>+
>>>> static inline bool blkdev_allow_reset_all_zones(struct 
>>>>block_device *bdev,
>>>>                         sector_t sector,
>>>>                         sector_t nr_sectors)
>>>>@@ -430,6 +441,41 @@ int blkdev_zone_mgmt_ioctl(struct 
>>>>block_device *bdev, fmode_t mode,
>>>>                 GFP_KERNEL);
>>>> }
>>>>+int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
>>>>+            unsigned int cmd, unsigned long arg)
>>>>+{
>>>>+    void __user *argp = (void __user *)arg;
>>>>+    struct request_queue *q;
>>>>+    struct blk_zone_dev zprop;
>>>>+    int ret;
>>>>+
>>>>+    if (!argp)
>>>>+        return -EINVAL;
>>>>+
>>>>+    q = bdev_get_queue(bdev);
>>>>+    if (!q)
>>>>+        return -ENXIO;
>>>>+
>>>>+    if (!blk_queue_is_zoned(q))
>>>>+        return -ENOTTY;
>>>>+
>>>>+    if (!capable(CAP_SYS_ADMIN))
>>>>+        return -EACCES;
>>>>+
>>>>+    if (!(mode & FMODE_WRITE))
>>>>+        return -EBADF;
>>>>+
>>>>+    ret = blkdev_report_zonedev_prop(bdev, &zprop);
>>>>+    if (ret)
>>>>+        goto out;
>>>>+
>>>>+    if (copy_to_user(argp, &zprop, sizeof(struct blk_zone_dev)))
>>>>+        return -EFAULT;
>>>>+
>>>>+out:
>>>>+    return ret;
>>>>+}
>>>>+
>>>> static inline unsigned long *blk_alloc_zone_bitmap(int node,
>>>>                            unsigned int nr_zones)
>>>> {
>>>>diff --git a/block/ioctl.c b/block/ioctl.c
>>>>index 0ea29754e7dd..f7b4e0f2dd4c 100644
>>>>--- a/block/ioctl.c
>>>>+++ b/block/ioctl.c
>>>>@@ -517,6 +517,8 @@ static int blkdev_common_ioctl(struct 
>>>>block_device *bdev, fmode_t mode,
>>>>         return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
>>>>     case BLKMGMTZONE:
>>>>         return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
>>>>+    case BLKZONEDEVPROP:
>>>>+        return blkdev_zonedev_prop(bdev, mode, cmd, arg);
>>>>     case BLKGETZONESZ:
>>>>         return put_uint(argp, bdev_zone_sectors(bdev));
>>>>     case BLKGETNRZONES:
>>>>diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>>index 5b95c81d2a2d..a32c909a915f 100644
>>>>--- a/drivers/nvme/host/core.c
>>>>+++ b/drivers/nvme/host/core.c
>>>>@@ -2254,6 +2254,7 @@ static const struct 
>>>>block_device_operations nvme_fops = {
>>>>     .getgeo        = nvme_getgeo,
>>>>     .revalidate_disk= nvme_revalidate_disk,
>>>>     .report_zones    = nvme_report_zones,
>>>>+    .report_zone_p    = nvme_report_zone_prop,
>>>>     .pr_ops        = &nvme_pr_ops,
>>>> };
>>>>@@ -2280,6 +2281,7 @@ const struct block_device_operations 
>>>>nvme_ns_head_ops = {
>>>>     .compat_ioctl    = nvme_compat_ioctl,
>>>>     .getgeo        = nvme_getgeo,
>>>>     .report_zones    = nvme_report_zones,
>>>>+    .report_zone_p    = nvme_report_zone_prop,
>>>>     .pr_ops        = &nvme_pr_ops,
>>>> };
>>>> #endif /* CONFIG_NVME_MULTIPATH */
>>>>diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>>>>index ecf443efdf91..172e0531f37f 100644
>>>>--- a/drivers/nvme/host/nvme.h
>>>>+++ b/drivers/nvme/host/nvme.h
>>>>@@ -407,6 +407,14 @@ struct nvme_ns {
>>>>     u8 pi_type;
>>>> #ifdef CONFIG_BLK_DEV_ZONED
>>>>     u64 zsze;
>>>>+
>>>>+    u32 nr_zones;
>>>>+    u32 mar;
>>>>+    u32 mor;
>>>>+    u32 rrl;
>>>>+    u32 frl;
>>>>+    u16 zoc;
>>>>+    u16 ozcs;
>>>> #endif
>>>>     unsigned long features;
>>>>     unsigned long flags;
>>>>@@ -704,11 +712,14 @@ int nvme_update_zone_info(struct gendisk 
>>>>*disk, struct nvme_ns *ns,
>>>> int nvme_report_zones(struct gendisk *disk, sector_t sector,
>>>>               unsigned int nr_zones, report_zones_cb cb, void *data);
>>>>+int nvme_report_zone_prop(struct gendisk *disk, struct 
>>>>blk_zone_dev *zprop);
>>>>+
>>>> blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, 
>>>>struct request *req,
>>>>                        struct nvme_command *cmnd,
>>>>                        enum nvme_zone_mgmt_action action);
>>>> #else
>>>> #define nvme_report_zones NULL
>>>>+#define nvme_report_zone_prop NULL
>>>> static inline blk_status_t nvme_setup_zone_mgmt_send(struct 
>>>>nvme_ns *ns,
>>>>         struct request *req, struct nvme_command *cmnd,
>>>>diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>>>index 2e6512ac6f01..258d03610cc0 100644
>>>>--- a/drivers/nvme/host/zns.c
>>>>+++ b/drivers/nvme/host/zns.c
>>>>@@ -32,6 +32,28 @@ static int nvme_set_max_append(struct 
>>>>nvme_ctrl *ctrl)
>>>>     return 0;
>>>> }
>>>>+static u64 nvme_zns_nr_zones(struct nvme_ns *ns)
>>>>+{
>>>>+    struct nvme_command c = { };
>>>>+    struct nvme_zone_report report;
>>>>+    int buflen = sizeof(struct nvme_zone_report);
>>>>+    int ret;
>>>>+
>>>>+    c.zmr.opcode = nvme_cmd_zone_mgmt_recv;
>>>>+    c.zmr.nsid = cpu_to_le32(ns->head->ns_id);
>>>>+    c.zmr.slba = cpu_to_le64(0);
>>>>+    c.zmr.numd = cpu_to_le32(nvme_bytes_to_numd(buflen));
>>>>+    c.zmr.zra = NVME_ZRA_ZONE_REPORT;
>>>>+    c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>>>>+    c.zmr.pr = 0;
>>>>+
>>>>+    ret = nvme_submit_sync_cmd(ns->queue, &c, &report, buflen);
>>>>+    if (ret)
>>>>+        return ret;
>>>>+
>>>>+    return le64_to_cpu(report.nr_zones);
>>>>+}
>>>>+
>>>> int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>>>>               unsigned lbaf)
>>>> {
>>>>@@ -87,6 +109,13 @@ int nvme_update_zone_info(struct gendisk 
>>>>*disk, struct nvme_ns *ns,
>>>>         goto free_data;
>>>>     }
>>>>+    ns->nr_zones = nvme_zns_nr_zones(ns);
>>>>+    ns->mar = le32_to_cpu(id->mar);
>>>>+    ns->mor = le32_to_cpu(id->mor);
>>>>+    ns->rrl = le32_to_cpu(id->rrl);
>>>>+    ns->frl = le32_to_cpu(id->frl);
>>>>+    ns->zoc = le16_to_cpu(id->zoc);
>>>>+
>>>>     q->limits.zoned = BLK_ZONED_HM;
>>>>     blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
>>>> free_data:
>>>>@@ -230,6 +259,38 @@ int nvme_report_zones(struct gendisk *disk, 
>>>>sector_t sector,
>>>>     return ret;
>>>> }
>>>>+static int nvme_ns_report_zone_prop(struct nvme_ns *ns, struct 
>>>>blk_zone_dev *zprop)
>>>>+{
>>>>+    zprop->nr_zones = ns->nr_zones;
>>>>+    zprop->zoc = ns->zoc;
>>>>+    zprop->ozcs = ns->ozcs;
>>>>+    zprop->mar = ns->mar;
>>>>+    zprop->mor = ns->mor;
>>>>+    zprop->rrl = ns->rrl;
>>>>+    zprop->frl = ns->frl;
>>>>+
>>>>+    return 0;
>>>>+}
>>>>+
>>>>+int nvme_report_zone_prop(struct gendisk *disk, struct 
>>>>blk_zone_dev *zprop)
>>>>+{
>>>>+    struct nvme_ns_head *head = NULL;
>>>>+    struct nvme_ns *ns;
>>>>+    int srcu_idx, ret;
>>>>+
>>>>+    ns = nvme_get_ns_from_disk(disk, &head, &srcu_idx);
>>>>+    if (unlikely(!ns))
>>>>+        return -EWOULDBLOCK;
>>>>+
>>>>+    if (ns->head->ids.csi == NVME_CSI_ZNS)
>>>>+        ret = nvme_ns_report_zone_prop(ns, zprop);
>>>>+    else
>>>>+        ret = -EINVAL;
>>>>+    nvme_put_ns_from_disk(head, srcu_idx);
>>>>+
>>>>+    return ret;
>>>>+}
>>>>+
>>>> blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, 
>>>>struct request *req,
>>>>         struct nvme_command *c, enum nvme_zone_mgmt_action action)
>>>> {
>>>>diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>>index 8308d8a3720b..0c0faa58b7f4 100644
>>>>--- a/include/linux/blkdev.h
>>>>+++ b/include/linux/blkdev.h
>>>>@@ -372,6 +372,8 @@ extern int blkdev_zone_ops_ioctl(struct 
>>>>block_device *bdev, fmode_t mode,
>>>>                   unsigned int cmd, unsigned long arg);
>>>> extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, 
>>>>fmode_t mode,
>>>>                   unsigned int cmd, unsigned long arg);
>>>>+extern int blkdev_zonedev_prop(struct block_device *bdev, 
>>>>fmode_t mode,
>>>>+            unsigned int cmd, unsigned long arg);
>>>> #else /* CONFIG_BLK_DEV_ZONED */
>>>> static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
>>>>@@ -400,6 +402,12 @@ static inline int 
>>>>blkdev_zone_mgmt_ioctl(struct block_device *bdev,
>>>>     return -ENOTTY;
>>>> }
>>>>+static inline int blkdev_zonedev_prop(struct block_device 
>>>>*bdev, fmode_t mode,
>>>>+                      unsigned int cmd, unsigned long arg)
>>>>+{
>>>>+    return -ENOTTY;
>>>>+}
>>>>+
>>>> #endif /* CONFIG_BLK_DEV_ZONED */
>>>> struct request_queue {
>>>>@@ -1770,6 +1778,7 @@ struct block_device_operations {
>>>>     int (*report_zones)(struct gendisk *, sector_t sector,
>>>>             unsigned int nr_zones, report_zones_cb cb, void *data);
>>>>     char *(*devnode)(struct gendisk *disk, umode_t *mode);
>>>>+    int (*report_zone_p)(struct gendisk *disk, struct 
>>>>blk_zone_dev *zprop);
>>>>     struct module *owner;
>>>>     const struct pr_ops *pr_ops;
>>>> };
>>>>diff --git a/include/uapi/linux/blkzoned.h 
>>>>b/include/uapi/linux/blkzoned.h
>>>>index d0978ee10fc7..0c49a4b2ce5d 100644
>>>>--- a/include/uapi/linux/blkzoned.h
>>>>+++ b/include/uapi/linux/blkzoned.h
>>>>@@ -142,6 +142,18 @@ struct blk_zone_range {
>>>>     __u64        nr_sectors;
>>>> };
>>>>+struct blk_zone_dev {
>>>>+    __u32    nr_zones;
>>>>+    __u32    mar;
>>>>+    __u32    mor;
>>>>+    __u32    rrl;
>>>>+    __u32    frl;
>>>>+    __u16    zoc;
>>>>+    __u16    ozcs;
>>>>+    __u32    rsv31[2];
>>>>+    __u64    rsv63[4];
>>>>+};
>>>>+
>>>> /**
>>>>  * enum blk_zone_action - Zone state transitions managed from 
>>>>user-space
>>>>  *
>>>>@@ -209,5 +221,6 @@ struct blk_zone_mgmt {
>>>> #define BLKCLOSEZONE    _IOW(0x12, 135, struct blk_zone_range)
>>>> #define BLKFINISHZONE    _IOW(0x12, 136, struct blk_zone_range)
>>>> #define BLKMGMTZONE    _IOR(0x12, 137, struct blk_zone_mgmt)
>>>>+#define BLKZONEDEVPROP    _IOR(0x12, 138, struct blk_zone_dev)
>>>> #endif /* _UAPI_BLKZONED_H */
>>>
>>>Nak. These properties can already be retrieved using the nvme 
>>>ioctl passthru command and support have also been added to 
>>>nvme-cli.
>>>
>>
>>These properties are intended to be consumed by an application, so
>>nvme-cli is of not much use. I would also like to avoid sysfs variables.
>>
>I can recommend libnvme https://github.com/linux-nvme/libnvme
>
>It provides an easy way to retrieve the options.

Thanks for the advice. We use libnvme already, it is really nice. As I
commented to Damien, I was hesitant to put so many new parameters on
sysfs. I will move it there on a V2.

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/6] ZNS: Extra features for current patches
  2020-06-25 19:53       ` Matias Bjørling
@ 2020-06-26  6:26         ` Javier González
  0 siblings, 0 replies; 70+ messages in thread
From: Javier González @ 2020-06-26  6:26 UTC (permalink / raw)
  To: Matias Bjørling; +Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe

On 25.06.2020 21:53, Matias Bjørling wrote:
>On 25/06/2020 21.39, Javier González wrote:
>>On 25.06.2020 16:48, Matias Bjørling wrote:
>>>On 25/06/2020 15.04, Matias Bjørling wrote:
>>>>On 25/06/2020 14.21, Javier González wrote:
>>>>>From: Javier González <javier.gonz@samsung.com>
>>>>>
>>>>>This patchset extends zoned device functionality on top of the 
>>>>>existing
>>>>>v3 ZNS patchset that Keith sent last week.
>>>>>
>>>>>Patches 1-5 are zoned block interface and IOCTL additions to 
>>>>>expose ZNS
>>>>>values to user-space. One major change is the addition of a new zone
>>>>>management IOCTL that allows to extend zone management commands with
>>>>>flags. I recall a conversation in the mailing list from early 
>>>>>this year
>>>>>where a similar approach was proposed by Matias, but never made it
>>>>>upstream. We extended the IOCTL here to align with the 
>>>>>comments in that
>>>>>thread. Here, we are happy to get sign-offs by anyone that contributed
>>>>>to the thread - just comment here or on the patch.
>>>>
>>>>The original patchset is available here: 
>>>>https://lkml.org/lkml/2019/6/21/419
>>>>
>>>>We wanted to wait posting our updated patches until the base 
>>>>patches were upstream. I guess the cat is out of the bag. :)
>>>>
>>>>For the open/finish/reset patch, you'll want to take a look at 
>>>>the original patchset, and apply the feedback from that thread 
>>>>to your patch. Please also consider the users of these 
>>>>operations, e.g., dm, scsi, null_blk, etc. The original patchset 
>>>>has patches for that.
>>>>
>>>Please disregard the above - I forgot that the original patchset 
>>>actually went upstream.
>>>
>>>You're right that we discussed (I at least discussed it internally 
>>>with Damien, but I can't find the mail) having one mgmt issuing 
>>>the commands. We didn't go ahead and added it at that point due to 
>>>ZNS still being in a fluffy state.
>>>
>>
>>Does the proposed IOCTL align with the use cases you have in mind? I'm
>>happy to take it in a different series if you want to add patches to it
>>for other drivers (scsi, null_blk, etc.).
>
>I think the ioctl makes sense. I wanted to have it like that 
>originally. I'm still thinking through if it covers the short-term 
>cases for the upcoming TPs.

Yes. You can see that some of this is intended to support at least one
of the TPs that are in the TWG. It is also suitable for a couple TPs we
are working on internally and expect to bring to the group.

But please, do make sure it covers TPs that you know will be shared in
the TWG.

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/6] block: introduce IOCTL to report dev properties
  2020-06-26  0:57       ` Damien Le Moal
@ 2020-06-26  6:27         ` Javier González
  0 siblings, 0 replies; 70+ messages in thread
From: Javier González @ 2020-06-26  6:27 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Matias Bjørling, linux-nvme, linux-block, hch, kbusch, sagi,
	axboe, SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 26.06.2020 00:57, Damien Le Moal wrote:
>On 2020/06/26 4:42, Javier González wrote:
>> On 25.06.2020 15:10, Matias Bjørling wrote:
>>> On 25/06/2020 14.21, Javier González wrote:
>>>> From: Javier González <javier.gonz@samsung.com>
>>>>
>>>> With the addition of ZNS, a new set of properties have been added to the
>>>> zoned block device. This patch introduces a new IOCTL to expose these
>>>> rroperties to user space.
>>>>
>>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>>> ---
>>>>  block/blk-zoned.c             | 46 ++++++++++++++++++++++++++
>>>>  block/ioctl.c                 |  2 ++
>>>>  drivers/nvme/host/core.c      |  2 ++
>>>>  drivers/nvme/host/nvme.h      | 11 +++++++
>>>>  drivers/nvme/host/zns.c       | 61 +++++++++++++++++++++++++++++++++++
>>>>  include/linux/blkdev.h        |  9 ++++++
>>>>  include/uapi/linux/blkzoned.h | 13 ++++++++
>>>>  7 files changed, 144 insertions(+)
>>>>
>>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>>> index 704fc15813d1..39ec72af9537 100644
>>>> --- a/block/blk-zoned.c
>>>> +++ b/block/blk-zoned.c
>>>> @@ -169,6 +169,17 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(blkdev_report_zones);
>>>> +static int blkdev_report_zonedev_prop(struct block_device *bdev,
>>>> +				      struct blk_zone_dev *zprop)
>>>> +{
>>>> +	struct gendisk *disk = bdev->bd_disk;
>>>> +
>>>> +	if (WARN_ON_ONCE(!bdev->bd_disk->fops->report_zone_p))
>>>> +		return -EOPNOTSUPP;
>>>> +
>>>> +	return disk->fops->report_zone_p(disk, zprop);
>>>> +}
>>>> +
>>>>  static inline bool blkdev_allow_reset_all_zones(struct block_device *bdev,
>>>>  						sector_t sector,
>>>>  						sector_t nr_sectors)
>>>> @@ -430,6 +441,41 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  				GFP_KERNEL);
>>>>  }
>>>> +int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
>>>> +			unsigned int cmd, unsigned long arg)
>>>> +{
>>>> +	void __user *argp = (void __user *)arg;
>>>> +	struct request_queue *q;
>>>> +	struct blk_zone_dev zprop;
>>>> +	int ret;
>>>> +
>>>> +	if (!argp)
>>>> +		return -EINVAL;
>>>> +
>>>> +	q = bdev_get_queue(bdev);
>>>> +	if (!q)
>>>> +		return -ENXIO;
>>>> +
>>>> +	if (!blk_queue_is_zoned(q))
>>>> +		return -ENOTTY;
>>>> +
>>>> +	if (!capable(CAP_SYS_ADMIN))
>>>> +		return -EACCES;
>>>> +
>>>> +	if (!(mode & FMODE_WRITE))
>>>> +		return -EBADF;
>>>> +
>>>> +	ret = blkdev_report_zonedev_prop(bdev, &zprop);
>>>> +	if (ret)
>>>> +		goto out;
>>>> +
>>>> +	if (copy_to_user(argp, &zprop, sizeof(struct blk_zone_dev)))
>>>> +		return -EFAULT;
>>>> +
>>>> +out:
>>>> +	return ret;
>>>> +}
>>>> +
>>>>  static inline unsigned long *blk_alloc_zone_bitmap(int node,
>>>>  						   unsigned int nr_zones)
>>>>  {
>>>> diff --git a/block/ioctl.c b/block/ioctl.c
>>>> index 0ea29754e7dd..f7b4e0f2dd4c 100644
>>>> --- a/block/ioctl.c
>>>> +++ b/block/ioctl.c
>>>> @@ -517,6 +517,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  		return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
>>>>  	case BLKMGMTZONE:
>>>>  		return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
>>>> +	case BLKZONEDEVPROP:
>>>> +		return blkdev_zonedev_prop(bdev, mode, cmd, arg);
>>>>  	case BLKGETZONESZ:
>>>>  		return put_uint(argp, bdev_zone_sectors(bdev));
>>>>  	case BLKGETNRZONES:
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index 5b95c81d2a2d..a32c909a915f 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -2254,6 +2254,7 @@ static const struct block_device_operations nvme_fops = {
>>>>  	.getgeo		= nvme_getgeo,
>>>>  	.revalidate_disk= nvme_revalidate_disk,
>>>>  	.report_zones	= nvme_report_zones,
>>>> +	.report_zone_p	= nvme_report_zone_prop,
>>>>  	.pr_ops		= &nvme_pr_ops,
>>>>  };
>>>> @@ -2280,6 +2281,7 @@ const struct block_device_operations nvme_ns_head_ops = {
>>>>  	.compat_ioctl	= nvme_compat_ioctl,
>>>>  	.getgeo		= nvme_getgeo,
>>>>  	.report_zones	= nvme_report_zones,
>>>> +	.report_zone_p	= nvme_report_zone_prop,
>>>>  	.pr_ops		= &nvme_pr_ops,
>>>>  };
>>>>  #endif /* CONFIG_NVME_MULTIPATH */
>>>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>>>> index ecf443efdf91..172e0531f37f 100644
>>>> --- a/drivers/nvme/host/nvme.h
>>>> +++ b/drivers/nvme/host/nvme.h
>>>> @@ -407,6 +407,14 @@ struct nvme_ns {
>>>>  	u8 pi_type;
>>>>  #ifdef CONFIG_BLK_DEV_ZONED
>>>>  	u64 zsze;
>>>> +
>>>> +	u32 nr_zones;
>>>> +	u32 mar;
>>>> +	u32 mor;
>>>> +	u32 rrl;
>>>> +	u32 frl;
>>>> +	u16 zoc;
>>>> +	u16 ozcs;
>>>>  #endif
>>>>  	unsigned long features;
>>>>  	unsigned long flags;
>>>> @@ -704,11 +712,14 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>>>>  int nvme_report_zones(struct gendisk *disk, sector_t sector,
>>>>  		      unsigned int nr_zones, report_zones_cb cb, void *data);
>>>> +int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev *zprop);
>>>> +
>>>>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>>>>  				       struct nvme_command *cmnd,
>>>>  				       enum nvme_zone_mgmt_action action);
>>>>  #else
>>>>  #define nvme_report_zones NULL
>>>> +#define nvme_report_zone_prop NULL
>>>>  static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
>>>>  		struct request *req, struct nvme_command *cmnd,
>>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>>> index 2e6512ac6f01..258d03610cc0 100644
>>>> --- a/drivers/nvme/host/zns.c
>>>> +++ b/drivers/nvme/host/zns.c
>>>> @@ -32,6 +32,28 @@ static int nvme_set_max_append(struct nvme_ctrl *ctrl)
>>>>  	return 0;
>>>>  }
>>>> +static u64 nvme_zns_nr_zones(struct nvme_ns *ns)
>>>> +{
>>>> +	struct nvme_command c = { };
>>>> +	struct nvme_zone_report report;
>>>> +	int buflen = sizeof(struct nvme_zone_report);
>>>> +	int ret;
>>>> +
>>>> +	c.zmr.opcode = nvme_cmd_zone_mgmt_recv;
>>>> +	c.zmr.nsid = cpu_to_le32(ns->head->ns_id);
>>>> +	c.zmr.slba = cpu_to_le64(0);
>>>> +	c.zmr.numd = cpu_to_le32(nvme_bytes_to_numd(buflen));
>>>> +	c.zmr.zra = NVME_ZRA_ZONE_REPORT;
>>>> +	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>>>> +	c.zmr.pr = 0;
>>>> +
>>>> +	ret = nvme_submit_sync_cmd(ns->queue, &c, &report, buflen);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	return le64_to_cpu(report.nr_zones);
>>>> +}
>>>> +
>>>>  int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>>>>  			  unsigned lbaf)
>>>>  {
>>>> @@ -87,6 +109,13 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>>>>  		goto free_data;
>>>>  	}
>>>> +	ns->nr_zones = nvme_zns_nr_zones(ns);
>>>> +	ns->mar = le32_to_cpu(id->mar);
>>>> +	ns->mor = le32_to_cpu(id->mor);
>>>> +	ns->rrl = le32_to_cpu(id->rrl);
>>>> +	ns->frl = le32_to_cpu(id->frl);
>>>> +	ns->zoc = le16_to_cpu(id->zoc);
>>>> +
>>>>  	q->limits.zoned = BLK_ZONED_HM;
>>>>  	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
>>>>  free_data:
>>>> @@ -230,6 +259,38 @@ int nvme_report_zones(struct gendisk *disk, sector_t sector,
>>>>  	return ret;
>>>>  }
>>>> +static int nvme_ns_report_zone_prop(struct nvme_ns *ns, struct blk_zone_dev *zprop)
>>>> +{
>>>> +	zprop->nr_zones = ns->nr_zones;
>>>> +	zprop->zoc = ns->zoc;
>>>> +	zprop->ozcs = ns->ozcs;
>>>> +	zprop->mar = ns->mar;
>>>> +	zprop->mor = ns->mor;
>>>> +	zprop->rrl = ns->rrl;
>>>> +	zprop->frl = ns->frl;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +int nvme_report_zone_prop(struct gendisk *disk, struct blk_zone_dev *zprop)
>>>> +{
>>>> +	struct nvme_ns_head *head = NULL;
>>>> +	struct nvme_ns *ns;
>>>> +	int srcu_idx, ret;
>>>> +
>>>> +	ns = nvme_get_ns_from_disk(disk, &head, &srcu_idx);
>>>> +	if (unlikely(!ns))
>>>> +		return -EWOULDBLOCK;
>>>> +
>>>> +	if (ns->head->ids.csi == NVME_CSI_ZNS)
>>>> +		ret = nvme_ns_report_zone_prop(ns, zprop);
>>>> +	else
>>>> +		ret = -EINVAL;
>>>> +	nvme_put_ns_from_disk(head, srcu_idx);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>>  blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>>>>  		struct nvme_command *c, enum nvme_zone_mgmt_action action)
>>>>  {
>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>> index 8308d8a3720b..0c0faa58b7f4 100644
>>>> --- a/include/linux/blkdev.h
>>>> +++ b/include/linux/blkdev.h
>>>> @@ -372,6 +372,8 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  				  unsigned int cmd, unsigned long arg);
>>>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  				  unsigned int cmd, unsigned long arg);
>>>> +extern int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
>>>> +			unsigned int cmd, unsigned long arg);
>>>>  #else /* CONFIG_BLK_DEV_ZONED */
>>>>  static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
>>>> @@ -400,6 +402,12 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
>>>>  	return -ENOTTY;
>>>>  }
>>>> +static inline int blkdev_zonedev_prop(struct block_device *bdev, fmode_t mode,
>>>> +				      unsigned int cmd, unsigned long arg)
>>>> +{
>>>> +	return -ENOTTY;
>>>> +}
>>>> +
>>>>  #endif /* CONFIG_BLK_DEV_ZONED */
>>>>  struct request_queue {
>>>> @@ -1770,6 +1778,7 @@ struct block_device_operations {
>>>>  	int (*report_zones)(struct gendisk *, sector_t sector,
>>>>  			unsigned int nr_zones, report_zones_cb cb, void *data);
>>>>  	char *(*devnode)(struct gendisk *disk, umode_t *mode);
>>>> +	int (*report_zone_p)(struct gendisk *disk, struct blk_zone_dev *zprop);
>>>>  	struct module *owner;
>>>>  	const struct pr_ops *pr_ops;
>>>>  };
>>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>>> index d0978ee10fc7..0c49a4b2ce5d 100644
>>>> --- a/include/uapi/linux/blkzoned.h
>>>> +++ b/include/uapi/linux/blkzoned.h
>>>> @@ -142,6 +142,18 @@ struct blk_zone_range {
>>>>  	__u64		nr_sectors;
>>>>  };
>>>> +struct blk_zone_dev {
>>>> +	__u32	nr_zones;
>>>> +	__u32	mar;
>>>> +	__u32	mor;
>>>> +	__u32	rrl;
>>>> +	__u32	frl;
>>>> +	__u16	zoc;
>>>> +	__u16	ozcs;
>>>> +	__u32	rsv31[2];
>>>> +	__u64	rsv63[4];
>>>> +};
>>>> +
>>>>  /**
>>>>   * enum blk_zone_action - Zone state transitions managed from user-space
>>>>   *
>>>> @@ -209,5 +221,6 @@ struct blk_zone_mgmt {
>>>>  #define BLKCLOSEZONE	_IOW(0x12, 135, struct blk_zone_range)
>>>>  #define BLKFINISHZONE	_IOW(0x12, 136, struct blk_zone_range)
>>>>  #define BLKMGMTZONE	_IOR(0x12, 137, struct blk_zone_mgmt)
>>>> +#define BLKZONEDEVPROP	_IOR(0x12, 138, struct blk_zone_dev)
>>>>  #endif /* _UAPI_BLKZONED_H */
>>>
>>> Nak. These properties can already be retrieved using the nvme ioctl
>>> passthru command and support have also been added to nvme-cli.
>>>
>>
>> These properties are intended to be consumed by an application, so
>> nvme-cli is of not much use. I would also like to avoid sysfs variables.
>
>Why not sysfs ? These are device properties, they can be defined as sysfs device
>attributes. If there is an equivalent for ZBC/ZAC drives, you could even
>consider defining them as queue attributes as long as you also patch sd.c, but
>that may be pushing things too far.
>
>In any case, sysfs seems a much better approach to me as that would be limited
>to the NVMe driver rather than all this additional code in the block layer.

Ok. Will send a V2 moving it to sysfs.

Thanks!
Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/6] block: introduce IOCTL to report dev properties
  2020-06-25 20:25       ` Keith Busch
@ 2020-06-26  6:28         ` Javier González
  2020-06-26 15:52           ` Keith Busch
  0 siblings, 1 reply; 70+ messages in thread
From: Javier González @ 2020-06-26  6:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Matias Bjørling, linux-nvme, linux-block, hch, sagi, axboe,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 26.06.2020 05:25, Keith Busch wrote:
>On Thu, Jun 25, 2020 at 09:42:48PM +0200, Javier González wrote:
>> We can use nvme passthru, but this bypasses the zoned block abstraction.
>> Why not representing ZNS features in the standard zoned block API?
>
>This looks too nvme zns specific to want the block layer in the middle.
>Just use the driver's passthrough interface.

Ok. Is it OK with you to expose them in sysfs as Damien suggested?

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/6] block: add support for selecting all zones
  2020-06-26  5:58     ` Javier González
@ 2020-06-26  6:35       ` Damien Le Moal
  2020-06-26  6:52         ` Javier González
  0 siblings, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  6:35 UTC (permalink / raw)
  To: Javier González
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 2020/06/26 14:58, Javier González wrote:
> On 26.06.2020 01:27, Damien Le Moal wrote:
>> On 2020/06/25 21:22, Javier González wrote:
>>> From: Javier González <javier.gonz@samsung.com>
>>>
>>> Add flag to allow selecting all zones on a single zone management
>>> operation
>>>
>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>> ---
>>>  block/blk-zoned.c             | 3 +++
>>>  include/linux/blk_types.h     | 3 ++-
>>>  include/uapi/linux/blkzoned.h | 9 +++++++++
>>>  3 files changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>> index e87c60004dc5..29194388a1bb 100644
>>> --- a/block/blk-zoned.c
>>> +++ b/block/blk-zoned.c
>>> @@ -420,6 +420,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>  		return -ENOTTY;
>>>  	}
>>>
>>> +	if (zmgmt.flags & BLK_ZONE_SELECT_ALL)
>>> +		op |= REQ_ZONE_ALL;
>>> +
>>>  	return blkdev_zone_mgmt(bdev, op, zmgmt.sector, zmgmt.nr_sectors,
>>>  				GFP_KERNEL);
>>>  }
>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>> index ccb895f911b1..16b57fb2b99c 100644
>>> --- a/include/linux/blk_types.h
>>> +++ b/include/linux/blk_types.h
>>> @@ -351,6 +351,7 @@ enum req_flag_bits {
>>>  	 * work item to avoid such priority inversions.
>>>  	 */
>>>  	__REQ_CGROUP_PUNT,
>>> +	__REQ_ZONE_ALL,		/* apply zone operation to all zones */
>>>
>>>  	/* command specific flags for REQ_OP_WRITE_ZEROES: */
>>>  	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
>>> @@ -378,7 +379,7 @@ enum req_flag_bits {
>>>  #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
>>>  #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
>>>  #define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
>>> -
>>> +#define REQ_ZONE_ALL		(1ULL << __REQ_ZONE_ALL)
>>>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>>>  #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
>>>
>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>> index 07b5fde21d9f..a8c89fe58f97 100644
>>> --- a/include/uapi/linux/blkzoned.h
>>> +++ b/include/uapi/linux/blkzoned.h
>>> @@ -157,6 +157,15 @@ enum blk_zone_action {
>>>  	BLK_ZONE_MGMT_RESET	= 0x4,
>>>  };
>>>
>>> +/**
>>> + * enum blk_zone_mgmt_flags - Flags for blk_zone_mgmt
>>> + *
>>> + * BLK_ZONE_SELECT_ALL: Select all zones for current zone action
>>> + */
>>> +enum blk_zone_mgmt_flags {
>>> +	BLK_ZONE_SELECT_ALL	= 1 << 0,
>>> +};
>>> +
>>>  /**
>>>   * struct blk_zone_mgmt - Extended zoned management
>>>   *
>>>
>>
>> NACK.
>>
>> Details:
>> 1) REQ_OP_ZONE_RESET together with REQ_ZONE_ALL is the same as
>> REQ_OP_ZONE_RESET_ALL, isn't it ? You are duplicating a functionality that
>> already exists.
>> 2) The patch introduces REQ_ZONE_ALL at the block layer only without defining
>> how it ties into SCSI and NVMe driver use of it. Is REQ_ZONE_ALL indicating that
>> the zone management commands are to be executed with the ALL bit set ? If yes,
>> that will break device-mapper. See the special code for handling
>> REQ_OP_ZONE_RESET_ALL. That code is in place for a reason: the target block
>> device may not be an entire physical device. In that case, applying a zone
>> management command to all zones of the physical drive is wrong.
>> 3) REQ_ZONE_ALL seems completely equivalent to specifying a sector range of [0
>> .. drive capacity]. So what is the point ? The current interface handles that.
>> That is how we chose between REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL right now.
>> 4) Without any in-kernel user, I do not see the point. And for applications, I
>> do not see any good use case for doing open all, close all, offline all or
>> finish all. If you have any such good use case, please elaborate.
>>
> 
> The main use if reset all, but without having to look through all zones,
> as it imposes an overhead when we have a large number of zones. Having
> the possibility to offload it to HW is more efficient.
> 
> I had not thought about the device mapper use case. Would it be an
> option to translate this into REQ_OP_ZONE_RESET_ALL when we have a
> device mapper (or any other case where this might break) and then leave
> the bit go to the driver if it applies to the whole device?

But REQ_OP_ZONE_RESET_ALL is already implemented and supported and will reset
all zones of a drive using a single command if the ioctl is called for the
entire sector range of a physical drive. For device mapper with a partial
mapping, the per zone reset loop will be used. If you have no other use case for
the REQ_ZONE_ALL flag, what is the point here ? Reset is already optimized for
the all zones case

> 
> Javier
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/6] block: introduce IOCTL for zone mgmt
  2020-06-26  6:01     ` Javier González
@ 2020-06-26  6:37       ` Damien Le Moal
  2020-06-26  6:51         ` Javier González
  0 siblings, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  6:37 UTC (permalink / raw)
  To: Javier González
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 2020/06/26 15:01, Javier González wrote:
> On 26.06.2020 01:17, Damien Le Moal wrote:
>> On 2020/06/25 21:22, Javier González wrote:
>>> From: Javier González <javier.gonz@samsung.com>
>>>
>>> The current IOCTL interface for zone management is limited by struct
>>> blk_zone_range, which is unfortunately not extensible. Specially, the
>>> lack of flags is problematic for ZNS zoned devices.
>>>
>>> This new IOCTL is designed to be a superset of the current one, with
>>> support for flags and bits for extensibility.
>>>
>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>> ---
>>>  block/blk-zoned.c             | 56 ++++++++++++++++++++++++++++++++++-
>>>  block/ioctl.c                 |  2 ++
>>>  include/linux/blkdev.h        |  9 ++++++
>>>  include/uapi/linux/blkzoned.h | 33 +++++++++++++++++++++
>>>  4 files changed, 99 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>> index 81152a260354..e87c60004dc5 100644
>>> --- a/block/blk-zoned.c
>>> +++ b/block/blk-zoned.c
>>> @@ -322,7 +322,7 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
>>>   * BLKRESETZONE, BLKOPENZONE, BLKCLOSEZONE and BLKFINISHZONE ioctl processing.
>>>   * Called from blkdev_ioctl.
>>>   */
>>> -int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>> +int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>  			   unsigned int cmd, unsigned long arg)
>>>  {
>>>  	void __user *argp = (void __user *)arg;
>>> @@ -370,6 +370,60 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>  				GFP_KERNEL);
>>>  }
>>>
>>> +/*
>>> + * Zone management ioctl processing. Extension of blkdev_zone_ops_ioctl(), with
>>> + * blk_zone_mgmt structure.
>>> + *
>>> + * Called from blkdev_ioctl.
>>> + */
>>> +int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>> +			   unsigned int cmd, unsigned long arg)
>>> +{
>>> +	void __user *argp = (void __user *)arg;
>>> +	struct request_queue *q;
>>> +	struct blk_zone_mgmt zmgmt;
>>> +	enum req_opf op;
>>> +
>>> +	if (!argp)
>>> +		return -EINVAL;
>>> +
>>> +	q = bdev_get_queue(bdev);
>>> +	if (!q)
>>> +		return -ENXIO;
>>> +
>>> +	if (!blk_queue_is_zoned(q))
>>> +		return -ENOTTY;
>>> +
>>> +	if (!capable(CAP_SYS_ADMIN))
>>> +		return -EACCES;
>>> +
>>> +	if (!(mode & FMODE_WRITE))
>>> +		return -EBADF;
>>> +
>>> +	if (copy_from_user(&zmgmt, argp, sizeof(struct blk_zone_mgmt)))
>>> +		return -EFAULT;
>>> +
>>> +	switch (zmgmt.action) {
>>> +	case BLK_ZONE_MGMT_CLOSE:
>>> +		op = REQ_OP_ZONE_CLOSE;
>>> +		break;
>>> +	case BLK_ZONE_MGMT_FINISH:
>>> +		op = REQ_OP_ZONE_FINISH;
>>> +		break;
>>> +	case BLK_ZONE_MGMT_OPEN:
>>> +		op = REQ_OP_ZONE_OPEN;
>>> +		break;
>>> +	case BLK_ZONE_MGMT_RESET:
>>> +		op = REQ_OP_ZONE_RESET;
>>> +		break;
>>> +	default:
>>> +		return -ENOTTY;
>>> +	}
>>> +
>>> +	return blkdev_zone_mgmt(bdev, op, zmgmt.sector, zmgmt.nr_sectors,
>>> +				GFP_KERNEL);
>>> +}
>>> +
>>>  static inline unsigned long *blk_alloc_zone_bitmap(int node,
>>>  						   unsigned int nr_zones)
>>>  {
>>> diff --git a/block/ioctl.c b/block/ioctl.c
>>> index bdb3bbb253d9..0ea29754e7dd 100644
>>> --- a/block/ioctl.c
>>> +++ b/block/ioctl.c
>>> @@ -514,6 +514,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>>>  	case BLKOPENZONE:
>>>  	case BLKCLOSEZONE:
>>>  	case BLKFINISHZONE:
>>> +		return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
>>> +	case BLKMGMTZONE:
>>>  		return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
>>>  	case BLKGETZONESZ:
>>>  		return put_uint(argp, bdev_zone_sectors(bdev));
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index 8fd900998b4e..bd8521f94dc4 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -368,6 +368,8 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
>>>
>>>  extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
>>>  				     unsigned int cmd, unsigned long arg);
>>> +extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>> +				  unsigned int cmd, unsigned long arg);
>>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>  				  unsigned int cmd, unsigned long arg);
>>>
>>> @@ -385,6 +387,13 @@ static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
>>>  	return -ENOTTY;
>>>  }
>>>
>>> +
>>> +static inline int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>> +					unsigned int cmd, unsigned long arg)
>>> +{
>>> +	return -ENOTTY;
>>> +}
>>> +
>>>  static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
>>>  					 fmode_t mode, unsigned int cmd,
>>>  					 unsigned long arg)
>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>> index 42c3366cc25f..07b5fde21d9f 100644
>>> --- a/include/uapi/linux/blkzoned.h
>>> +++ b/include/uapi/linux/blkzoned.h
>>> @@ -142,6 +142,38 @@ struct blk_zone_range {
>>>  	__u64		nr_sectors;
>>>  };
>>>
>>> +/**
>>> + * enum blk_zone_action - Zone state transitions managed from user-space
>>> + *
>>> + * @BLK_ZONE_MGMT_CLOSE: Transition to Closed state
>>> + * @BLK_ZONE_MGMT_FINISH: Transition to Finish state
>>> + * @BLK_ZONE_MGMT_OPEN: Transition to Open state
>>> + * @BLK_ZONE_MGMT_RESET: Transition to Reset state
>>> + */
>>> +enum blk_zone_action {
>>> +	BLK_ZONE_MGMT_CLOSE	= 0x1,
>>> +	BLK_ZONE_MGMT_FINISH	= 0x2,
>>> +	BLK_ZONE_MGMT_OPEN	= 0x3,
>>> +	BLK_ZONE_MGMT_RESET	= 0x4,
>>> +};
>>> +
>>> +/**
>>> + * struct blk_zone_mgmt - Extended zoned management
>>> + *
>>> + * @action: Zone action as in described in enum blk_zone_action
>>> + * @flags: Flags for the action
>>> + * @sector: Starting sector of the first zone to operate on
>>> + * @nr_sectors: Total number of sectors of all zones to operate on
>>> + */
>>> +struct blk_zone_mgmt {
>>> +	__u8		action;
>>> +	__u8		resv3[3];
>>> +	__u32		flags;
>>> +	__u64		sector;
>>> +	__u64		nr_sectors;
>>> +	__u64		resv31;
>>> +};
>>> +
>>>  /**
>>>   * Zoned block device ioctl's:
>>>   *
>>> @@ -166,5 +198,6 @@ struct blk_zone_range {
>>>  #define BLKOPENZONE	_IOW(0x12, 134, struct blk_zone_range)
>>>  #define BLKCLOSEZONE	_IOW(0x12, 135, struct blk_zone_range)
>>>  #define BLKFINISHZONE	_IOW(0x12, 136, struct blk_zone_range)
>>> +#define BLKMGMTZONE	_IOR(0x12, 137, struct blk_zone_mgmt)
>>>
>>>  #endif /* _UAPI_BLKZONED_H */
>>>
>>
>> Without defining what the flags can be, it is hard to judge what will change
>>from the current distinct ioctls.
>>
> 
> The first flag is the one to select all. Down the line we have other
> modifiers that make sense, but it is true that it is public yet.

You mean *not* public ?

> 
> Would you like to wait until then or is it an option to revise the IOCTL
> now?

Yes. Wait until it is actually needed. Adding code that has no users makes it
impossible to test so not acceptable. As for the "all zones" flag, I already
commented about it.

> 
> Javier
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/6] block: add zone attr. to zone mgmt IOCTL struct
  2020-06-26  6:03     ` Javier González
@ 2020-06-26  6:38       ` Damien Le Moal
  2020-06-26  6:49         ` Javier González
  0 siblings, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  6:38 UTC (permalink / raw)
  To: Javier González
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 2020/06/26 15:03, Javier González wrote:
> On 26.06.2020 01:45, Damien Le Moal wrote:
>> On 2020/06/25 21:22, Javier González wrote:
>>> From: Javier González <javier.gonz@samsung.com>
>>>
>>> Add zone attributes field to the blk_zone structure. Use ZNS attributes
>>> as base for zoned block devices in general.
>>>
>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>> ---
>>>  drivers/nvme/host/zns.c       |  1 +
>>>  include/uapi/linux/blkzoned.h | 13 ++++++++++++-
>>>  2 files changed, 13 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>> index 258d03610cc0..7d8381fe7665 100644
>>> --- a/drivers/nvme/host/zns.c
>>> +++ b/drivers/nvme/host/zns.c
>>> @@ -195,6 +195,7 @@ static int nvme_zone_parse_entry(struct nvme_ns *ns,
>>>  	zone.capacity = nvme_lba_to_sect(ns, le64_to_cpu(entry->zcap));
>>>  	zone.start = nvme_lba_to_sect(ns, le64_to_cpu(entry->zslba));
>>>  	zone.wp = nvme_lba_to_sect(ns, le64_to_cpu(entry->wp));
>>> +	zone.attr = entry->za;
>>>
>>>  	return cb(&zone, idx, data);
>>>  }
>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>> index 0c49a4b2ce5d..2e43a00e3425 100644
>>> --- a/include/uapi/linux/blkzoned.h
>>> +++ b/include/uapi/linux/blkzoned.h
>>> @@ -82,6 +82,16 @@ enum blk_zone_report_flags {
>>>  	BLK_ZONE_REP_CAPACITY	= (1 << 0),
>>>  };
>>>
>>> +/**
>>> + * Zone Attributes
>>
>> This is a user interface file. Please document the meaning of each attribute.
>>
> 
> Sure.
> 
>>> + */
>>> +enum blk_zone_attr {
>>> +	BLK_ZONE_ATTR_ZFC	= 1 << 0,
>>> +	BLK_ZONE_ATTR_FZR	= 1 << 1,
>>> +	BLK_ZONE_ATTR_RZR	= 1 << 2,
>>> +	BLK_ZONE_ATTR_ZDEV	= 1 << 7,
>> These are ZNS specific, right ? Integrating the 2 ZBC/ZAC attributes in this
>> list would be nice, namely non_seq and reset. That will imply patching sd.c.
>>
> 
> Of course. I will look at non_seq and reset. Any other that should go
> in here?

See ZBC specs. These are the only 2 zone attributes defined.

> 
>>> +};
>>> +
>>>  /**
>>>   * struct blk_zone - Zone descriptor for BLKREPORTZONE ioctl.
>>>   *
>>> @@ -108,7 +118,8 @@ struct blk_zone {
>>>  	__u8	cond;		/* Zone condition */
>>>  	__u8	non_seq;	/* Non-sequential write resources active */
>>>  	__u8	reset;		/* Reset write pointer recommended */
>>> -	__u8	resv[4];
>>> +	__u8	attr;		/* Zone attributes */
>>> +	__u8	resv[3];
>>>  	__u64	capacity;	/* Zone capacity in number of sectors */
>>>  	__u8	reserved[24];
>>>  };
>>>
>>
>> You are missing a BLK_ZONE_REP_ATTR report flag to indicate to the user that the
>> attr field is present, used and valid.
>>
>> enum blk_zone_report_flags {
>> 	BLK_ZONE_REP_CAPACITY	= (1 << 0),
>> +	BLK_ZONE_REP_ATTR	= (1 << 1),
>> };
>>
>> is I think needed.
> 
> Good point. I will add that on a V2.
> 
> Javier
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/6] block: add support for zone offline transition
  2020-06-26  6:08     ` Javier González
@ 2020-06-26  6:42       ` Damien Le Moal
  2020-06-26  6:58         ` Javier González
  0 siblings, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  6:42 UTC (permalink / raw)
  To: Javier González
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 2020/06/26 15:09, Javier González wrote:
> On 26.06.2020 01:34, Damien Le Moal wrote:
>> On 2020/06/25 21:22, Javier González wrote:
>>> From: Javier González <javier.gonz@samsung.com>
>>>
>>> Add support for offline transition on the zoned block device using the
>>> new zone management IOCTL
>>>
>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>> ---
>>>  block/blk-core.c              | 2 ++
>>>  block/blk-zoned.c             | 3 +++
>>>  drivers/nvme/host/core.c      | 3 +++
>>>  include/linux/blk_types.h     | 3 +++
>>>  include/linux/blkdev.h        | 1 -
>>>  include/uapi/linux/blkzoned.h | 1 +
>>>  6 files changed, 12 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 03252af8c82c..589cbdacc5ec 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -140,6 +140,7 @@ static const char *const blk_op_name[] = {
>>>  	REQ_OP_NAME(ZONE_CLOSE),
>>>  	REQ_OP_NAME(ZONE_FINISH),
>>>  	REQ_OP_NAME(ZONE_APPEND),
>>> +	REQ_OP_NAME(ZONE_OFFLINE),
>>>  	REQ_OP_NAME(WRITE_SAME),
>>>  	REQ_OP_NAME(WRITE_ZEROES),
>>>  	REQ_OP_NAME(SCSI_IN),
>>> @@ -1030,6 +1031,7 @@ generic_make_request_checks(struct bio *bio)
>>>  	case REQ_OP_ZONE_OPEN:
>>>  	case REQ_OP_ZONE_CLOSE:
>>>  	case REQ_OP_ZONE_FINISH:
>>> +	case REQ_OP_ZONE_OFFLINE:
>>>  		if (!blk_queue_is_zoned(q))
>>>  			goto not_supported;
>>>  		break;
>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>> index 29194388a1bb..704fc15813d1 100644
>>> --- a/block/blk-zoned.c
>>> +++ b/block/blk-zoned.c
>>> @@ -416,6 +416,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>  	case BLK_ZONE_MGMT_RESET:
>>>  		op = REQ_OP_ZONE_RESET;
>>>  		break;
>>> +	case BLK_ZONE_MGMT_OFFLINE:
>>> +		op = REQ_OP_ZONE_OFFLINE;
>>> +		break;
>>>  	default:
>>>  		return -ENOTTY;
>>>  	}
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index f1215523792b..5b95c81d2a2d 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
>>>  	case REQ_OP_ZONE_FINISH:
>>>  		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
>>>  		break;
>>> +	case REQ_OP_ZONE_OFFLINE:
>>> +		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OFFLINE);
>>> +		break;
>>>  	case REQ_OP_WRITE_ZEROES:
>>>  		ret = nvme_setup_write_zeroes(ns, req, cmd);
>>>  		break;
>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>> index 16b57fb2b99c..b3921263c3dd 100644
>>> --- a/include/linux/blk_types.h
>>> +++ b/include/linux/blk_types.h
>>> @@ -316,6 +316,8 @@ enum req_opf {
>>>  	REQ_OP_ZONE_FINISH	= 12,
>>>  	/* write data at the current zone write pointer */
>>>  	REQ_OP_ZONE_APPEND	= 13,
>>> +	/* Transition a zone to offline */
>>> +	REQ_OP_ZONE_OFFLINE	= 14,
>>>
>>>  	/* SCSI passthrough using struct scsi_request */
>>>  	REQ_OP_SCSI_IN		= 32,
>>> @@ -456,6 +458,7 @@ static inline bool op_is_zone_mgmt(enum req_opf op)
>>>  	case REQ_OP_ZONE_OPEN:
>>>  	case REQ_OP_ZONE_CLOSE:
>>>  	case REQ_OP_ZONE_FINISH:
>>> +	case REQ_OP_ZONE_OFFLINE:
>>>  		return true;
>>>  	default:
>>>  		return false;
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index bd8521f94dc4..8308d8a3720b 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -372,7 +372,6 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>  				  unsigned int cmd, unsigned long arg);
>>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>  				  unsigned int cmd, unsigned long arg);
>>> -
>>>  #else /* CONFIG_BLK_DEV_ZONED */
>>>
>>>  static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>> index a8c89fe58f97..d0978ee10fc7 100644
>>> --- a/include/uapi/linux/blkzoned.h
>>> +++ b/include/uapi/linux/blkzoned.h
>>> @@ -155,6 +155,7 @@ enum blk_zone_action {
>>>  	BLK_ZONE_MGMT_FINISH	= 0x2,
>>>  	BLK_ZONE_MGMT_OPEN	= 0x3,
>>>  	BLK_ZONE_MGMT_RESET	= 0x4,
>>> +	BLK_ZONE_MGMT_OFFLINE	= 0x5,
>>>  };
>>>
>>>  /**
>>>
>>
>> As mentioned in previous email, the usefulness of this is dubious. Please
>> elaborate in the commit message. Sure NVMe ZNS defines this and we can support
>> it. But without a good use case, what is the point ?
> 
> Use case is to transition zones in read-only state to offline when we
> are done moving valid data. It is easier to explicitly managing zones
> that are not usable by having all under the offline state.

Then adding a simple BLKZONEOFFLINE ioctl, similar to open, close, finish and
reset, would be enough. No need for all the new zone management ioctl with flags
plumbing.

> 
>>
>> scsi SD driver will return BLK_STS_NOTSUPP if this offlining is sent to a
>> ZBC/ZAC drive. Not nice. Having a sysfs attribute "max_offline_zone_sectors" or
>> the like to indicate support by the device or not would be nicer.
> 
> We can do that.
> 
>>
>> Does offling ALL zones make any sense ? Because this patch does not prevent the
>> use of the REQ_ZONE_ALL flags introduced in patch 2. Probably not a good idea to
>> allow offlining all zones, no ?
> 
> AFAIK the transition to offline is only valid when coming from a
> read-only state. I did think of adding a check, but I can see that other
> transitions go directly to the driver and then the device, so I decided
> to follow the same model. If you think it is better, we can add the
> check.

My point was that the REQ_ZONE_ALL flag would make no sense for offlining zones
but this patch does not have anything checking that. There is no point in
sending a command that is known to be incorrect to the drive...

> 
> Javier
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/6] block: add zone attr. to zone mgmt IOCTL struct
  2020-06-26  6:38       ` Damien Le Moal
@ 2020-06-26  6:49         ` Javier González
  0 siblings, 0 replies; 70+ messages in thread
From: Javier González @ 2020-06-26  6:49 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 26.06.2020 06:38, Damien Le Moal wrote:
>On 2020/06/26 15:03, Javier González wrote:
>> On 26.06.2020 01:45, Damien Le Moal wrote:
>>> On 2020/06/25 21:22, Javier González wrote:
>>>> From: Javier González <javier.gonz@samsung.com>
>>>>
>>>> Add zone attributes field to the blk_zone structure. Use ZNS attributes
>>>> as base for zoned block devices in general.
>>>>
>>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>>> ---
>>>>  drivers/nvme/host/zns.c       |  1 +
>>>>  include/uapi/linux/blkzoned.h | 13 ++++++++++++-
>>>>  2 files changed, 13 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>>> index 258d03610cc0..7d8381fe7665 100644
>>>> --- a/drivers/nvme/host/zns.c
>>>> +++ b/drivers/nvme/host/zns.c
>>>> @@ -195,6 +195,7 @@ static int nvme_zone_parse_entry(struct nvme_ns *ns,
>>>>  	zone.capacity = nvme_lba_to_sect(ns, le64_to_cpu(entry->zcap));
>>>>  	zone.start = nvme_lba_to_sect(ns, le64_to_cpu(entry->zslba));
>>>>  	zone.wp = nvme_lba_to_sect(ns, le64_to_cpu(entry->wp));
>>>> +	zone.attr = entry->za;
>>>>
>>>>  	return cb(&zone, idx, data);
>>>>  }
>>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>>> index 0c49a4b2ce5d..2e43a00e3425 100644
>>>> --- a/include/uapi/linux/blkzoned.h
>>>> +++ b/include/uapi/linux/blkzoned.h
>>>> @@ -82,6 +82,16 @@ enum blk_zone_report_flags {
>>>>  	BLK_ZONE_REP_CAPACITY	= (1 << 0),
>>>>  };
>>>>
>>>> +/**
>>>> + * Zone Attributes
>>>
>>> This is a user interface file. Please document the meaning of each attribute.
>>>
>>
>> Sure.
>>
>>>> + */
>>>> +enum blk_zone_attr {
>>>> +	BLK_ZONE_ATTR_ZFC	= 1 << 0,
>>>> +	BLK_ZONE_ATTR_FZR	= 1 << 1,
>>>> +	BLK_ZONE_ATTR_RZR	= 1 << 2,
>>>> +	BLK_ZONE_ATTR_ZDEV	= 1 << 7,
>>> These are ZNS specific, right ? Integrating the 2 ZBC/ZAC attributes in this
>>> list would be nice, namely non_seq and reset. That will imply patching sd.c.
>>>
>>
>> Of course. I will look at non_seq and reset. Any other that should go
>> in here?
>
>See ZBC specs. These are the only 2 zone attributes defined.

Good. We will add this in V2.

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/6] nvme: Add consistency check for zone count
  2020-06-26  6:13       ` Javier González
@ 2020-06-26  6:49         ` Damien Le Moal
  2020-06-26  6:55           ` Javier González
  0 siblings, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  6:49 UTC (permalink / raw)
  To: Javier González
  Cc: Keith Busch, linux-nvme, linux-block, hch, sagi, axboe,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 2020/06/26 15:13, Javier González wrote:
> On 26.06.2020 00:04, Damien Le Moal wrote:
>> On 2020/06/26 6:49, Keith Busch wrote:
>>> On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>>>>  drivers/nvme/host/zns.c | 7 +++++++
>>>>  1 file changed, 7 insertions(+)
>>>>
>>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>>> index 7d8381fe7665..de806788a184 100644
>>>> --- a/drivers/nvme/host/zns.c
>>>> +++ b/drivers/nvme/host/zns.c
>>>> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>>>  		sector += ns->zsze * nz;
>>>>  	}
>>>>
>>>> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>>>> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>>>> +				zone_idx, ns->nr_zones);
>>>> +		ret = -EINVAL;
>>>> +		goto out_free;
>>>> +	}
>>>> +
>>>>  	ret = zone_idx;
>>>
>>> nr_zones is unsigned, so it's never < 0.
>>>
>>> The API we're providing doesn't require zone_idx equal the namespace's
>>> nr_zones at the end, though. A subset of the total number of zones can
>>> be requested here.
>>>
> 
> I did see nr_zones coming with -1; guess it is my compiler.

See include/linux/blkdev.h. -1 is:

#define BLK_ALL_ZONES  ((unsigned int)-1)

Which is documented in block/blk-zoned.c:

/**
 * blkdev_report_zones - Get zones information
 * @bdev:       Target block device
 * @sector:     Sector from which to report zones
 * @nr_zones:   Maximum number of zones to report
 * @cb:         Callback function called for each reported zone
 * @data:       Private data for the callback
 *
 * Description:
 *    Get zone information starting from the zone containing @sector for at most
 *    @nr_zones, and call @cb for each zone reported by the device.
 *    To report all zones in a device starting from @sector, the BLK_ALL_ZONES
 *    constant can be passed to @nr_zones.
 *    Returns the number of zones reported by the device, or a negative errno
 *    value in case of failure.
 *
 *    Note: The caller must use memalloc_noXX_save/restore() calls to control
 *    memory allocations done within this function.
 */
int blkdev_report_zones(struct block_device *bdev, sector_t sector,
                        unsigned int nr_zones, report_zones_cb cb, void *data)

> 
>>
>> Yes, absolutely. zone_idx is not an absolute zone number. It is the index of the
>> reported zone descriptor in the current report range requested by the user,
>> which is not necessarily for the entire drive (i.e., provided nr zones is less
>> than the total number of zones of the disk and/or start sector is > 0). So
>> zone_idx indicates the actual number of zones reported, it is not the total
> 
> I see. As I can see, when nr_zones comes undefined I believed we could
> assume that zone_idx is absolute, but I can be wrong.

No. zone_idx is *always* the index of the zone in the current report. Whatever
that report is, regardless of the report starting point and number of zones
requested. E.g. For a single zone report (nr_zones = 1), you will always see
zone_idx = 0. For a full report, zone_idx will correspond to the zone number.
This is used for example in blk_revalidate_disk_zones() to initialize the zone
bitmaps.

> Does it make sense to support this check with an additional counter and
> a explicit nr_zones initialization when undefined or you
> prefer to just remove it as Matias suggested?

The check is not needed at all.

If the device is buggy and reports more zones than the device capacity or any
other bugs, the driver can catch that when it processes the report.
blk_revalidate_disk_zones() also has many checks.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/6] block: introduce IOCTL for zone mgmt
  2020-06-26  6:37       ` Damien Le Moal
@ 2020-06-26  6:51         ` Javier González
  2020-06-26  7:03           ` Damien Le Moal
  0 siblings, 1 reply; 70+ messages in thread
From: Javier González @ 2020-06-26  6:51 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 26.06.2020 06:37, Damien Le Moal wrote:
>On 2020/06/26 15:01, Javier González wrote:
>> On 26.06.2020 01:17, Damien Le Moal wrote:
>>> On 2020/06/25 21:22, Javier González wrote:
>>>> From: Javier González <javier.gonz@samsung.com>
>>>>
>>>> The current IOCTL interface for zone management is limited by struct
>>>> blk_zone_range, which is unfortunately not extensible. Specially, the
>>>> lack of flags is problematic for ZNS zoned devices.
>>>>
>>>> This new IOCTL is designed to be a superset of the current one, with
>>>> support for flags and bits for extensibility.
>>>>
>>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>>> ---
>>>>  block/blk-zoned.c             | 56 ++++++++++++++++++++++++++++++++++-
>>>>  block/ioctl.c                 |  2 ++
>>>>  include/linux/blkdev.h        |  9 ++++++
>>>>  include/uapi/linux/blkzoned.h | 33 +++++++++++++++++++++
>>>>  4 files changed, 99 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>>> index 81152a260354..e87c60004dc5 100644
>>>> --- a/block/blk-zoned.c
>>>> +++ b/block/blk-zoned.c
>>>> @@ -322,7 +322,7 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
>>>>   * BLKRESETZONE, BLKOPENZONE, BLKCLOSEZONE and BLKFINISHZONE ioctl processing.
>>>>   * Called from blkdev_ioctl.
>>>>   */
>>>> -int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>> +int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  			   unsigned int cmd, unsigned long arg)
>>>>  {
>>>>  	void __user *argp = (void __user *)arg;
>>>> @@ -370,6 +370,60 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  				GFP_KERNEL);
>>>>  }
>>>>
>>>> +/*
>>>> + * Zone management ioctl processing. Extension of blkdev_zone_ops_ioctl(), with
>>>> + * blk_zone_mgmt structure.
>>>> + *
>>>> + * Called from blkdev_ioctl.
>>>> + */
>>>> +int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>> +			   unsigned int cmd, unsigned long arg)
>>>> +{
>>>> +	void __user *argp = (void __user *)arg;
>>>> +	struct request_queue *q;
>>>> +	struct blk_zone_mgmt zmgmt;
>>>> +	enum req_opf op;
>>>> +
>>>> +	if (!argp)
>>>> +		return -EINVAL;
>>>> +
>>>> +	q = bdev_get_queue(bdev);
>>>> +	if (!q)
>>>> +		return -ENXIO;
>>>> +
>>>> +	if (!blk_queue_is_zoned(q))
>>>> +		return -ENOTTY;
>>>> +
>>>> +	if (!capable(CAP_SYS_ADMIN))
>>>> +		return -EACCES;
>>>> +
>>>> +	if (!(mode & FMODE_WRITE))
>>>> +		return -EBADF;
>>>> +
>>>> +	if (copy_from_user(&zmgmt, argp, sizeof(struct blk_zone_mgmt)))
>>>> +		return -EFAULT;
>>>> +
>>>> +	switch (zmgmt.action) {
>>>> +	case BLK_ZONE_MGMT_CLOSE:
>>>> +		op = REQ_OP_ZONE_CLOSE;
>>>> +		break;
>>>> +	case BLK_ZONE_MGMT_FINISH:
>>>> +		op = REQ_OP_ZONE_FINISH;
>>>> +		break;
>>>> +	case BLK_ZONE_MGMT_OPEN:
>>>> +		op = REQ_OP_ZONE_OPEN;
>>>> +		break;
>>>> +	case BLK_ZONE_MGMT_RESET:
>>>> +		op = REQ_OP_ZONE_RESET;
>>>> +		break;
>>>> +	default:
>>>> +		return -ENOTTY;
>>>> +	}
>>>> +
>>>> +	return blkdev_zone_mgmt(bdev, op, zmgmt.sector, zmgmt.nr_sectors,
>>>> +				GFP_KERNEL);
>>>> +}
>>>> +
>>>>  static inline unsigned long *blk_alloc_zone_bitmap(int node,
>>>>  						   unsigned int nr_zones)
>>>>  {
>>>> diff --git a/block/ioctl.c b/block/ioctl.c
>>>> index bdb3bbb253d9..0ea29754e7dd 100644
>>>> --- a/block/ioctl.c
>>>> +++ b/block/ioctl.c
>>>> @@ -514,6 +514,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  	case BLKOPENZONE:
>>>>  	case BLKCLOSEZONE:
>>>>  	case BLKFINISHZONE:
>>>> +		return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
>>>> +	case BLKMGMTZONE:
>>>>  		return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
>>>>  	case BLKGETZONESZ:
>>>>  		return put_uint(argp, bdev_zone_sectors(bdev));
>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>> index 8fd900998b4e..bd8521f94dc4 100644
>>>> --- a/include/linux/blkdev.h
>>>> +++ b/include/linux/blkdev.h
>>>> @@ -368,6 +368,8 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
>>>>
>>>>  extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  				     unsigned int cmd, unsigned long arg);
>>>> +extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>> +				  unsigned int cmd, unsigned long arg);
>>>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  				  unsigned int cmd, unsigned long arg);
>>>>
>>>> @@ -385,6 +387,13 @@ static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
>>>>  	return -ENOTTY;
>>>>  }
>>>>
>>>> +
>>>> +static inline int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>> +					unsigned int cmd, unsigned long arg)
>>>> +{
>>>> +	return -ENOTTY;
>>>> +}
>>>> +
>>>>  static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
>>>>  					 fmode_t mode, unsigned int cmd,
>>>>  					 unsigned long arg)
>>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>>> index 42c3366cc25f..07b5fde21d9f 100644
>>>> --- a/include/uapi/linux/blkzoned.h
>>>> +++ b/include/uapi/linux/blkzoned.h
>>>> @@ -142,6 +142,38 @@ struct blk_zone_range {
>>>>  	__u64		nr_sectors;
>>>>  };
>>>>
>>>> +/**
>>>> + * enum blk_zone_action - Zone state transitions managed from user-space
>>>> + *
>>>> + * @BLK_ZONE_MGMT_CLOSE: Transition to Closed state
>>>> + * @BLK_ZONE_MGMT_FINISH: Transition to Finish state
>>>> + * @BLK_ZONE_MGMT_OPEN: Transition to Open state
>>>> + * @BLK_ZONE_MGMT_RESET: Transition to Reset state
>>>> + */
>>>> +enum blk_zone_action {
>>>> +	BLK_ZONE_MGMT_CLOSE	= 0x1,
>>>> +	BLK_ZONE_MGMT_FINISH	= 0x2,
>>>> +	BLK_ZONE_MGMT_OPEN	= 0x3,
>>>> +	BLK_ZONE_MGMT_RESET	= 0x4,
>>>> +};
>>>> +
>>>> +/**
>>>> + * struct blk_zone_mgmt - Extended zoned management
>>>> + *
>>>> + * @action: Zone action as in described in enum blk_zone_action
>>>> + * @flags: Flags for the action
>>>> + * @sector: Starting sector of the first zone to operate on
>>>> + * @nr_sectors: Total number of sectors of all zones to operate on
>>>> + */
>>>> +struct blk_zone_mgmt {
>>>> +	__u8		action;
>>>> +	__u8		resv3[3];
>>>> +	__u32		flags;
>>>> +	__u64		sector;
>>>> +	__u64		nr_sectors;
>>>> +	__u64		resv31;
>>>> +};
>>>> +
>>>>  /**
>>>>   * Zoned block device ioctl's:
>>>>   *
>>>> @@ -166,5 +198,6 @@ struct blk_zone_range {
>>>>  #define BLKOPENZONE	_IOW(0x12, 134, struct blk_zone_range)
>>>>  #define BLKCLOSEZONE	_IOW(0x12, 135, struct blk_zone_range)
>>>>  #define BLKFINISHZONE	_IOW(0x12, 136, struct blk_zone_range)
>>>> +#define BLKMGMTZONE	_IOR(0x12, 137, struct blk_zone_mgmt)
>>>>
>>>>  #endif /* _UAPI_BLKZONED_H */
>>>>
>>>
>>> Without defining what the flags can be, it is hard to judge what will change
>>>from the current distinct ioctls.
>>>
>>
>> The first flag is the one to select all. Down the line we have other
>> modifiers that make sense, but it is true that it is public yet.
>
>You mean *not* public ?

Yes...

>
>>
>> Would you like to wait until then or is it an option to revise the IOCTL
>> now?
>
>Yes. Wait until it is actually needed. Adding code that has no users makes it
>impossible to test so not acceptable. As for the "all zones" flag, I already
>commented about it.

Ok. We will have this in the backlog then.

It would be great if you and Matias would like to comment on it if you
have some ideas on how to improve it. Happy to set a branch somewhere to
keep a patchset with this functionality somewhere.

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/6] block: add support for selecting all zones
  2020-06-26  6:35       ` Damien Le Moal
@ 2020-06-26  6:52         ` Javier González
  2020-06-26  7:06           ` Damien Le Moal
  0 siblings, 1 reply; 70+ messages in thread
From: Javier González @ 2020-06-26  6:52 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 26.06.2020 06:35, Damien Le Moal wrote:
>On 2020/06/26 14:58, Javier González wrote:
>> On 26.06.2020 01:27, Damien Le Moal wrote:
>>> On 2020/06/25 21:22, Javier González wrote:
>>>> From: Javier González <javier.gonz@samsung.com>
>>>>
>>>> Add flag to allow selecting all zones on a single zone management
>>>> operation
>>>>
>>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>>> ---
>>>>  block/blk-zoned.c             | 3 +++
>>>>  include/linux/blk_types.h     | 3 ++-
>>>>  include/uapi/linux/blkzoned.h | 9 +++++++++
>>>>  3 files changed, 14 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>>> index e87c60004dc5..29194388a1bb 100644
>>>> --- a/block/blk-zoned.c
>>>> +++ b/block/blk-zoned.c
>>>> @@ -420,6 +420,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  		return -ENOTTY;
>>>>  	}
>>>>
>>>> +	if (zmgmt.flags & BLK_ZONE_SELECT_ALL)
>>>> +		op |= REQ_ZONE_ALL;
>>>> +
>>>>  	return blkdev_zone_mgmt(bdev, op, zmgmt.sector, zmgmt.nr_sectors,
>>>>  				GFP_KERNEL);
>>>>  }
>>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>>> index ccb895f911b1..16b57fb2b99c 100644
>>>> --- a/include/linux/blk_types.h
>>>> +++ b/include/linux/blk_types.h
>>>> @@ -351,6 +351,7 @@ enum req_flag_bits {
>>>>  	 * work item to avoid such priority inversions.
>>>>  	 */
>>>>  	__REQ_CGROUP_PUNT,
>>>> +	__REQ_ZONE_ALL,		/* apply zone operation to all zones */
>>>>
>>>>  	/* command specific flags for REQ_OP_WRITE_ZEROES: */
>>>>  	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
>>>> @@ -378,7 +379,7 @@ enum req_flag_bits {
>>>>  #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
>>>>  #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
>>>>  #define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
>>>> -
>>>> +#define REQ_ZONE_ALL		(1ULL << __REQ_ZONE_ALL)
>>>>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>>>>  #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
>>>>
>>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>>> index 07b5fde21d9f..a8c89fe58f97 100644
>>>> --- a/include/uapi/linux/blkzoned.h
>>>> +++ b/include/uapi/linux/blkzoned.h
>>>> @@ -157,6 +157,15 @@ enum blk_zone_action {
>>>>  	BLK_ZONE_MGMT_RESET	= 0x4,
>>>>  };
>>>>
>>>> +/**
>>>> + * enum blk_zone_mgmt_flags - Flags for blk_zone_mgmt
>>>> + *
>>>> + * BLK_ZONE_SELECT_ALL: Select all zones for current zone action
>>>> + */
>>>> +enum blk_zone_mgmt_flags {
>>>> +	BLK_ZONE_SELECT_ALL	= 1 << 0,
>>>> +};
>>>> +
>>>>  /**
>>>>   * struct blk_zone_mgmt - Extended zoned management
>>>>   *
>>>>
>>>
>>> NACK.
>>>
>>> Details:
>>> 1) REQ_OP_ZONE_RESET together with REQ_ZONE_ALL is the same as
>>> REQ_OP_ZONE_RESET_ALL, isn't it ? You are duplicating a functionality that
>>> already exists.
>>> 2) The patch introduces REQ_ZONE_ALL at the block layer only without defining
>>> how it ties into SCSI and NVMe driver use of it. Is REQ_ZONE_ALL indicating that
>>> the zone management commands are to be executed with the ALL bit set ? If yes,
>>> that will break device-mapper. See the special code for handling
>>> REQ_OP_ZONE_RESET_ALL. That code is in place for a reason: the target block
>>> device may not be an entire physical device. In that case, applying a zone
>>> management command to all zones of the physical drive is wrong.
>>> 3) REQ_ZONE_ALL seems completely equivalent to specifying a sector range of [0
>>> .. drive capacity]. So what is the point ? The current interface handles that.
>>> That is how we chose between REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL right now.
>>> 4) Without any in-kernel user, I do not see the point. And for applications, I
>>> do not see any good use case for doing open all, close all, offline all or
>>> finish all. If you have any such good use case, please elaborate.
>>>
>>
>> The main use if reset all, but without having to look through all zones,
>> as it imposes an overhead when we have a large number of zones. Having
>> the possibility to offload it to HW is more efficient.
>>
>> I had not thought about the device mapper use case. Would it be an
>> option to translate this into REQ_OP_ZONE_RESET_ALL when we have a
>> device mapper (or any other case where this might break) and then leave
>> the bit go to the driver if it applies to the whole device?
>
>But REQ_OP_ZONE_RESET_ALL is already implemented and supported and will reset
>all zones of a drive using a single command if the ioctl is called for the
>entire sector range of a physical drive. For device mapper with a partial
>mapping, the per zone reset loop will be used. If you have no other use case for
>the REQ_ZONE_ALL flag, what is the point here ? Reset is already optimized for
>the all zones case

OK. I might have missed this. I thought we were sending several commands
instead of a single reset with the bit. I will check again. Thanks for
pointing at this.

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/6] nvme: Add consistency check for zone count
  2020-06-26  6:49         ` Damien Le Moal
@ 2020-06-26  6:55           ` Javier González
  2020-06-26  7:09             ` Damien Le Moal
  0 siblings, 1 reply; 70+ messages in thread
From: Javier González @ 2020-06-26  6:55 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Keith Busch, linux-nvme, linux-block, hch, sagi, axboe,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 26.06.2020 06:49, Damien Le Moal wrote:
>On 2020/06/26 15:13, Javier González wrote:
>> On 26.06.2020 00:04, Damien Le Moal wrote:
>>> On 2020/06/26 6:49, Keith Busch wrote:
>>>> On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>>>>>  drivers/nvme/host/zns.c | 7 +++++++
>>>>>  1 file changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>>>> index 7d8381fe7665..de806788a184 100644
>>>>> --- a/drivers/nvme/host/zns.c
>>>>> +++ b/drivers/nvme/host/zns.c
>>>>> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>>>>  		sector += ns->zsze * nz;
>>>>>  	}
>>>>>
>>>>> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>>>>> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>>>>> +				zone_idx, ns->nr_zones);
>>>>> +		ret = -EINVAL;
>>>>> +		goto out_free;
>>>>> +	}
>>>>> +
>>>>>  	ret = zone_idx;
>>>>
>>>> nr_zones is unsigned, so it's never < 0.
>>>>
>>>> The API we're providing doesn't require zone_idx equal the namespace's
>>>> nr_zones at the end, though. A subset of the total number of zones can
>>>> be requested here.
>>>>
>>
>> I did see nr_zones coming with -1; guess it is my compiler.
>
>See include/linux/blkdev.h. -1 is:
>
>#define BLK_ALL_ZONES  ((unsigned int)-1)
>
>Which is documented in block/blk-zoned.c:
>
>/**
> * blkdev_report_zones - Get zones information
> * @bdev:       Target block device
> * @sector:     Sector from which to report zones
> * @nr_zones:   Maximum number of zones to report
> * @cb:         Callback function called for each reported zone
> * @data:       Private data for the callback
> *
> * Description:
> *    Get zone information starting from the zone containing @sector for at most
> *    @nr_zones, and call @cb for each zone reported by the device.
> *    To report all zones in a device starting from @sector, the BLK_ALL_ZONES
> *    constant can be passed to @nr_zones.
> *    Returns the number of zones reported by the device, or a negative errno
> *    value in case of failure.
> *
> *    Note: The caller must use memalloc_noXX_save/restore() calls to control
> *    memory allocations done within this function.
> */
>int blkdev_report_zones(struct block_device *bdev, sector_t sector,
>                        unsigned int nr_zones, report_zones_cb cb, void *data)
>
>>
>>>
>>> Yes, absolutely. zone_idx is not an absolute zone number. It is the index of the
>>> reported zone descriptor in the current report range requested by the user,
>>> which is not necessarily for the entire drive (i.e., provided nr zones is less
>>> than the total number of zones of the disk and/or start sector is > 0). So
>>> zone_idx indicates the actual number of zones reported, it is not the total
>>
>> I see. As I can see, when nr_zones comes undefined I believed we could
>> assume that zone_idx is absolute, but I can be wrong.
>
>No. zone_idx is *always* the index of the zone in the current report. Whatever
>that report is, regardless of the report starting point and number of zones
>requested. E.g. For a single zone report (nr_zones = 1), you will always see
>zone_idx = 0. For a full report, zone_idx will correspond to the zone number.
>This is used for example in blk_revalidate_disk_zones() to initialize the zone
>bitmaps.
>
>> Does it make sense to support this check with an additional counter and
>> a explicit nr_zones initialization when undefined or you
>> prefer to just remove it as Matias suggested?
>
>The check is not needed at all.
>
>If the device is buggy and reports more zones than the device capacity or any
>other bugs, the driver can catch that when it processes the report.
>blk_revalidate_disk_zones() also has many checks.

I have managed to create a QEMU ZNS device that gave me a headache with
a little bit of extra capacity that triggered an additional zone report.
This was the motivation for the patch.

I will look at the checks to cover this case too then.

Thanks,
Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/6] block: add support for zone offline transition
  2020-06-26  6:42       ` Damien Le Moal
@ 2020-06-26  6:58         ` Javier González
  2020-06-26  7:17           ` Damien Le Moal
  0 siblings, 1 reply; 70+ messages in thread
From: Javier González @ 2020-06-26  6:58 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 26.06.2020 06:42, Damien Le Moal wrote:
>On 2020/06/26 15:09, Javier González wrote:
>> On 26.06.2020 01:34, Damien Le Moal wrote:
>>> On 2020/06/25 21:22, Javier González wrote:
>>>> From: Javier González <javier.gonz@samsung.com>
>>>>
>>>> Add support for offline transition on the zoned block device using the
>>>> new zone management IOCTL
>>>>
>>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>>> ---
>>>>  block/blk-core.c              | 2 ++
>>>>  block/blk-zoned.c             | 3 +++
>>>>  drivers/nvme/host/core.c      | 3 +++
>>>>  include/linux/blk_types.h     | 3 +++
>>>>  include/linux/blkdev.h        | 1 -
>>>>  include/uapi/linux/blkzoned.h | 1 +
>>>>  6 files changed, 12 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index 03252af8c82c..589cbdacc5ec 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -140,6 +140,7 @@ static const char *const blk_op_name[] = {
>>>>  	REQ_OP_NAME(ZONE_CLOSE),
>>>>  	REQ_OP_NAME(ZONE_FINISH),
>>>>  	REQ_OP_NAME(ZONE_APPEND),
>>>> +	REQ_OP_NAME(ZONE_OFFLINE),
>>>>  	REQ_OP_NAME(WRITE_SAME),
>>>>  	REQ_OP_NAME(WRITE_ZEROES),
>>>>  	REQ_OP_NAME(SCSI_IN),
>>>> @@ -1030,6 +1031,7 @@ generic_make_request_checks(struct bio *bio)
>>>>  	case REQ_OP_ZONE_OPEN:
>>>>  	case REQ_OP_ZONE_CLOSE:
>>>>  	case REQ_OP_ZONE_FINISH:
>>>> +	case REQ_OP_ZONE_OFFLINE:
>>>>  		if (!blk_queue_is_zoned(q))
>>>>  			goto not_supported;
>>>>  		break;
>>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>>> index 29194388a1bb..704fc15813d1 100644
>>>> --- a/block/blk-zoned.c
>>>> +++ b/block/blk-zoned.c
>>>> @@ -416,6 +416,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  	case BLK_ZONE_MGMT_RESET:
>>>>  		op = REQ_OP_ZONE_RESET;
>>>>  		break;
>>>> +	case BLK_ZONE_MGMT_OFFLINE:
>>>> +		op = REQ_OP_ZONE_OFFLINE;
>>>> +		break;
>>>>  	default:
>>>>  		return -ENOTTY;
>>>>  	}
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index f1215523792b..5b95c81d2a2d 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
>>>>  	case REQ_OP_ZONE_FINISH:
>>>>  		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
>>>>  		break;
>>>> +	case REQ_OP_ZONE_OFFLINE:
>>>> +		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OFFLINE);
>>>> +		break;
>>>>  	case REQ_OP_WRITE_ZEROES:
>>>>  		ret = nvme_setup_write_zeroes(ns, req, cmd);
>>>>  		break;
>>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>>> index 16b57fb2b99c..b3921263c3dd 100644
>>>> --- a/include/linux/blk_types.h
>>>> +++ b/include/linux/blk_types.h
>>>> @@ -316,6 +316,8 @@ enum req_opf {
>>>>  	REQ_OP_ZONE_FINISH	= 12,
>>>>  	/* write data at the current zone write pointer */
>>>>  	REQ_OP_ZONE_APPEND	= 13,
>>>> +	/* Transition a zone to offline */
>>>> +	REQ_OP_ZONE_OFFLINE	= 14,
>>>>
>>>>  	/* SCSI passthrough using struct scsi_request */
>>>>  	REQ_OP_SCSI_IN		= 32,
>>>> @@ -456,6 +458,7 @@ static inline bool op_is_zone_mgmt(enum req_opf op)
>>>>  	case REQ_OP_ZONE_OPEN:
>>>>  	case REQ_OP_ZONE_CLOSE:
>>>>  	case REQ_OP_ZONE_FINISH:
>>>> +	case REQ_OP_ZONE_OFFLINE:
>>>>  		return true;
>>>>  	default:
>>>>  		return false;
>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>> index bd8521f94dc4..8308d8a3720b 100644
>>>> --- a/include/linux/blkdev.h
>>>> +++ b/include/linux/blkdev.h
>>>> @@ -372,7 +372,6 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  				  unsigned int cmd, unsigned long arg);
>>>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>  				  unsigned int cmd, unsigned long arg);
>>>> -
>>>>  #else /* CONFIG_BLK_DEV_ZONED */
>>>>
>>>>  static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
>>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>>> index a8c89fe58f97..d0978ee10fc7 100644
>>>> --- a/include/uapi/linux/blkzoned.h
>>>> +++ b/include/uapi/linux/blkzoned.h
>>>> @@ -155,6 +155,7 @@ enum blk_zone_action {
>>>>  	BLK_ZONE_MGMT_FINISH	= 0x2,
>>>>  	BLK_ZONE_MGMT_OPEN	= 0x3,
>>>>  	BLK_ZONE_MGMT_RESET	= 0x4,
>>>> +	BLK_ZONE_MGMT_OFFLINE	= 0x5,
>>>>  };
>>>>
>>>>  /**
>>>>
>>>
>>> As mentioned in previous email, the usefulness of this is dubious. Please
>>> elaborate in the commit message. Sure NVMe ZNS defines this and we can support
>>> it. But without a good use case, what is the point ?
>>
>> Use case is to transition zones in read-only state to offline when we
>> are done moving valid data. It is easier to explicitly managing zones
>> that are not usable by having all under the offline state.
>
>Then adding a simple BLKZONEOFFLINE ioctl, similar to open, close, finish and
>reset, would be enough. No need for all the new zone management ioctl with flags
>plumbing.

Ok. We can add that then.

Note that zone management is not motivated by this use case at all, but
it made sense to implement it here instead of as a new BLKZONEOFFLINE
IOCTL as ZAC/ZBC users will not be able to use it either way.

>
>>
>>>
>>> scsi SD driver will return BLK_STS_NOTSUPP if this offlining is sent to a
>>> ZBC/ZAC drive. Not nice. Having a sysfs attribute "max_offline_zone_sectors" or
>>> the like to indicate support by the device or not would be nicer.
>>
>> We can do that.
>>
>>>
>>> Does offling ALL zones make any sense ? Because this patch does not prevent the
>>> use of the REQ_ZONE_ALL flags introduced in patch 2. Probably not a good idea to
>>> allow offlining all zones, no ?
>>
>> AFAIK the transition to offline is only valid when coming from a
>> read-only state. I did think of adding a check, but I can see that other
>> transitions go directly to the driver and then the device, so I decided
>> to follow the same model. If you think it is better, we can add the
>> check.
>
>My point was that the REQ_ZONE_ALL flag would make no sense for offlining zones
>but this patch does not have anything checking that. There is no point in
>sending a command that is known to be incorrect to the drive...

I will add some extra checks then to fail early. I assume these should
be in the NVMe driver as it is NVMe-specific, right?

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/6] block: introduce IOCTL for zone mgmt
  2020-06-26  6:51         ` Javier González
@ 2020-06-26  7:03           ` Damien Le Moal
  2020-06-26  7:08             ` Javier González
  0 siblings, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  7:03 UTC (permalink / raw)
  To: Javier González
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 2020/06/26 15:51, Javier González wrote:
> On 26.06.2020 06:37, Damien Le Moal wrote:
>> On 2020/06/26 15:01, Javier González wrote:
>>> On 26.06.2020 01:17, Damien Le Moal wrote:
>>>> On 2020/06/25 21:22, Javier González wrote:
>>>>> From: Javier González <javier.gonz@samsung.com>
>>>>>
>>>>> The current IOCTL interface for zone management is limited by struct
>>>>> blk_zone_range, which is unfortunately not extensible. Specially, the
>>>>> lack of flags is problematic for ZNS zoned devices.
>>>>>
>>>>> This new IOCTL is designed to be a superset of the current one, with
>>>>> support for flags and bits for extensibility.
>>>>>
>>>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>>>> ---
>>>>>  block/blk-zoned.c             | 56 ++++++++++++++++++++++++++++++++++-
>>>>>  block/ioctl.c                 |  2 ++
>>>>>  include/linux/blkdev.h        |  9 ++++++
>>>>>  include/uapi/linux/blkzoned.h | 33 +++++++++++++++++++++
>>>>>  4 files changed, 99 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>>>> index 81152a260354..e87c60004dc5 100644
>>>>> --- a/block/blk-zoned.c
>>>>> +++ b/block/blk-zoned.c
>>>>> @@ -322,7 +322,7 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>   * BLKRESETZONE, BLKOPENZONE, BLKCLOSEZONE and BLKFINISHZONE ioctl processing.
>>>>>   * Called from blkdev_ioctl.
>>>>>   */
>>>>> -int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>> +int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>  			   unsigned int cmd, unsigned long arg)
>>>>>  {
>>>>>  	void __user *argp = (void __user *)arg;
>>>>> @@ -370,6 +370,60 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>  				GFP_KERNEL);
>>>>>  }
>>>>>
>>>>> +/*
>>>>> + * Zone management ioctl processing. Extension of blkdev_zone_ops_ioctl(), with
>>>>> + * blk_zone_mgmt structure.
>>>>> + *
>>>>> + * Called from blkdev_ioctl.
>>>>> + */
>>>>> +int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>> +			   unsigned int cmd, unsigned long arg)
>>>>> +{
>>>>> +	void __user *argp = (void __user *)arg;
>>>>> +	struct request_queue *q;
>>>>> +	struct blk_zone_mgmt zmgmt;
>>>>> +	enum req_opf op;
>>>>> +
>>>>> +	if (!argp)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	q = bdev_get_queue(bdev);
>>>>> +	if (!q)
>>>>> +		return -ENXIO;
>>>>> +
>>>>> +	if (!blk_queue_is_zoned(q))
>>>>> +		return -ENOTTY;
>>>>> +
>>>>> +	if (!capable(CAP_SYS_ADMIN))
>>>>> +		return -EACCES;
>>>>> +
>>>>> +	if (!(mode & FMODE_WRITE))
>>>>> +		return -EBADF;
>>>>> +
>>>>> +	if (copy_from_user(&zmgmt, argp, sizeof(struct blk_zone_mgmt)))
>>>>> +		return -EFAULT;
>>>>> +
>>>>> +	switch (zmgmt.action) {
>>>>> +	case BLK_ZONE_MGMT_CLOSE:
>>>>> +		op = REQ_OP_ZONE_CLOSE;
>>>>> +		break;
>>>>> +	case BLK_ZONE_MGMT_FINISH:
>>>>> +		op = REQ_OP_ZONE_FINISH;
>>>>> +		break;
>>>>> +	case BLK_ZONE_MGMT_OPEN:
>>>>> +		op = REQ_OP_ZONE_OPEN;
>>>>> +		break;
>>>>> +	case BLK_ZONE_MGMT_RESET:
>>>>> +		op = REQ_OP_ZONE_RESET;
>>>>> +		break;
>>>>> +	default:
>>>>> +		return -ENOTTY;
>>>>> +	}
>>>>> +
>>>>> +	return blkdev_zone_mgmt(bdev, op, zmgmt.sector, zmgmt.nr_sectors,
>>>>> +				GFP_KERNEL);
>>>>> +}
>>>>> +
>>>>>  static inline unsigned long *blk_alloc_zone_bitmap(int node,
>>>>>  						   unsigned int nr_zones)
>>>>>  {
>>>>> diff --git a/block/ioctl.c b/block/ioctl.c
>>>>> index bdb3bbb253d9..0ea29754e7dd 100644
>>>>> --- a/block/ioctl.c
>>>>> +++ b/block/ioctl.c
>>>>> @@ -514,6 +514,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>  	case BLKOPENZONE:
>>>>>  	case BLKCLOSEZONE:
>>>>>  	case BLKFINISHZONE:
>>>>> +		return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
>>>>> +	case BLKMGMTZONE:
>>>>>  		return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
>>>>>  	case BLKGETZONESZ:
>>>>>  		return put_uint(argp, bdev_zone_sectors(bdev));
>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>>> index 8fd900998b4e..bd8521f94dc4 100644
>>>>> --- a/include/linux/blkdev.h
>>>>> +++ b/include/linux/blkdev.h
>>>>> @@ -368,6 +368,8 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
>>>>>
>>>>>  extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>  				     unsigned int cmd, unsigned long arg);
>>>>> +extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>>> +				  unsigned int cmd, unsigned long arg);
>>>>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>  				  unsigned int cmd, unsigned long arg);
>>>>>
>>>>> @@ -385,6 +387,13 @@ static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
>>>>>  	return -ENOTTY;
>>>>>  }
>>>>>
>>>>> +
>>>>> +static inline int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>>> +					unsigned int cmd, unsigned long arg)
>>>>> +{
>>>>> +	return -ENOTTY;
>>>>> +}
>>>>> +
>>>>>  static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
>>>>>  					 fmode_t mode, unsigned int cmd,
>>>>>  					 unsigned long arg)
>>>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>>>> index 42c3366cc25f..07b5fde21d9f 100644
>>>>> --- a/include/uapi/linux/blkzoned.h
>>>>> +++ b/include/uapi/linux/blkzoned.h
>>>>> @@ -142,6 +142,38 @@ struct blk_zone_range {
>>>>>  	__u64		nr_sectors;
>>>>>  };
>>>>>
>>>>> +/**
>>>>> + * enum blk_zone_action - Zone state transitions managed from user-space
>>>>> + *
>>>>> + * @BLK_ZONE_MGMT_CLOSE: Transition to Closed state
>>>>> + * @BLK_ZONE_MGMT_FINISH: Transition to Finish state
>>>>> + * @BLK_ZONE_MGMT_OPEN: Transition to Open state
>>>>> + * @BLK_ZONE_MGMT_RESET: Transition to Reset state
>>>>> + */
>>>>> +enum blk_zone_action {
>>>>> +	BLK_ZONE_MGMT_CLOSE	= 0x1,
>>>>> +	BLK_ZONE_MGMT_FINISH	= 0x2,
>>>>> +	BLK_ZONE_MGMT_OPEN	= 0x3,
>>>>> +	BLK_ZONE_MGMT_RESET	= 0x4,
>>>>> +};
>>>>> +
>>>>> +/**
>>>>> + * struct blk_zone_mgmt - Extended zoned management
>>>>> + *
>>>>> + * @action: Zone action as in described in enum blk_zone_action
>>>>> + * @flags: Flags for the action
>>>>> + * @sector: Starting sector of the first zone to operate on
>>>>> + * @nr_sectors: Total number of sectors of all zones to operate on
>>>>> + */
>>>>> +struct blk_zone_mgmt {
>>>>> +	__u8		action;
>>>>> +	__u8		resv3[3];
>>>>> +	__u32		flags;
>>>>> +	__u64		sector;
>>>>> +	__u64		nr_sectors;
>>>>> +	__u64		resv31;
>>>>> +};
>>>>> +
>>>>>  /**
>>>>>   * Zoned block device ioctl's:
>>>>>   *
>>>>> @@ -166,5 +198,6 @@ struct blk_zone_range {
>>>>>  #define BLKOPENZONE	_IOW(0x12, 134, struct blk_zone_range)
>>>>>  #define BLKCLOSEZONE	_IOW(0x12, 135, struct blk_zone_range)
>>>>>  #define BLKFINISHZONE	_IOW(0x12, 136, struct blk_zone_range)
>>>>> +#define BLKMGMTZONE	_IOR(0x12, 137, struct blk_zone_mgmt)
>>>>>
>>>>>  #endif /* _UAPI_BLKZONED_H */
>>>>>
>>>>
>>>> Without defining what the flags can be, it is hard to judge what will change
>>> >from the current distinct ioctls.
>>>>
>>>
>>> The first flag is the one to select all. Down the line we have other
>>> modifiers that make sense, but it is true that it is public yet.
>>
>> You mean *not* public ?
> 
> Yes...
> 
>>
>>>
>>> Would you like to wait until then or is it an option to revise the IOCTL
>>> now?
>>
>> Yes. Wait until it is actually needed. Adding code that has no users makes it
>> impossible to test so not acceptable. As for the "all zones" flag, I already
>> commented about it.
> 
> Ok. We will have this in the backlog then.
> 
> It would be great if you and Matias would like to comment on it if you
> have some ideas on how to improve it. Happy to set a branch somewhere to
> keep a patchset with this functionality somewhere.

I sent a much simpler version of this using a REQ_ZONE_ALL flag too, but driven
by the specified sector range. That allowed to do reset, open, close, finish all
zones using a single command much more simply than your patch. But as Christoph
commented, the only real use case interesting for this is reset all (e.g. for FS
format). open, close and finish all zones have no user...

Let's see first what kind of flags may be needed in the future, if at all. We
can then cook something if needed.

> 
> Javier
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/6] block: add support for selecting all zones
  2020-06-26  6:52         ` Javier González
@ 2020-06-26  7:06           ` Damien Le Moal
  0 siblings, 0 replies; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  7:06 UTC (permalink / raw)
  To: Javier González
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 2020/06/26 15:52, Javier González wrote:
> On 26.06.2020 06:35, Damien Le Moal wrote:
>> On 2020/06/26 14:58, Javier González wrote:
>>> On 26.06.2020 01:27, Damien Le Moal wrote:
>>>> On 2020/06/25 21:22, Javier González wrote:
>>>>> From: Javier González <javier.gonz@samsung.com>
>>>>>
>>>>> Add flag to allow selecting all zones on a single zone management
>>>>> operation
>>>>>
>>>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>>>> ---
>>>>>  block/blk-zoned.c             | 3 +++
>>>>>  include/linux/blk_types.h     | 3 ++-
>>>>>  include/uapi/linux/blkzoned.h | 9 +++++++++
>>>>>  3 files changed, 14 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>>>> index e87c60004dc5..29194388a1bb 100644
>>>>> --- a/block/blk-zoned.c
>>>>> +++ b/block/blk-zoned.c
>>>>> @@ -420,6 +420,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>  		return -ENOTTY;
>>>>>  	}
>>>>>
>>>>> +	if (zmgmt.flags & BLK_ZONE_SELECT_ALL)
>>>>> +		op |= REQ_ZONE_ALL;
>>>>> +
>>>>>  	return blkdev_zone_mgmt(bdev, op, zmgmt.sector, zmgmt.nr_sectors,
>>>>>  				GFP_KERNEL);
>>>>>  }
>>>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>>>> index ccb895f911b1..16b57fb2b99c 100644
>>>>> --- a/include/linux/blk_types.h
>>>>> +++ b/include/linux/blk_types.h
>>>>> @@ -351,6 +351,7 @@ enum req_flag_bits {
>>>>>  	 * work item to avoid such priority inversions.
>>>>>  	 */
>>>>>  	__REQ_CGROUP_PUNT,
>>>>> +	__REQ_ZONE_ALL,		/* apply zone operation to all zones */
>>>>>
>>>>>  	/* command specific flags for REQ_OP_WRITE_ZEROES: */
>>>>>  	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
>>>>> @@ -378,7 +379,7 @@ enum req_flag_bits {
>>>>>  #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
>>>>>  #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
>>>>>  #define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
>>>>> -
>>>>> +#define REQ_ZONE_ALL		(1ULL << __REQ_ZONE_ALL)
>>>>>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>>>>>  #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
>>>>>
>>>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>>>> index 07b5fde21d9f..a8c89fe58f97 100644
>>>>> --- a/include/uapi/linux/blkzoned.h
>>>>> +++ b/include/uapi/linux/blkzoned.h
>>>>> @@ -157,6 +157,15 @@ enum blk_zone_action {
>>>>>  	BLK_ZONE_MGMT_RESET	= 0x4,
>>>>>  };
>>>>>
>>>>> +/**
>>>>> + * enum blk_zone_mgmt_flags - Flags for blk_zone_mgmt
>>>>> + *
>>>>> + * BLK_ZONE_SELECT_ALL: Select all zones for current zone action
>>>>> + */
>>>>> +enum blk_zone_mgmt_flags {
>>>>> +	BLK_ZONE_SELECT_ALL	= 1 << 0,
>>>>> +};
>>>>> +
>>>>>  /**
>>>>>   * struct blk_zone_mgmt - Extended zoned management
>>>>>   *
>>>>>
>>>>
>>>> NACK.
>>>>
>>>> Details:
>>>> 1) REQ_OP_ZONE_RESET together with REQ_ZONE_ALL is the same as
>>>> REQ_OP_ZONE_RESET_ALL, isn't it ? You are duplicating a functionality that
>>>> already exists.
>>>> 2) The patch introduces REQ_ZONE_ALL at the block layer only without defining
>>>> how it ties into SCSI and NVMe driver use of it. Is REQ_ZONE_ALL indicating that
>>>> the zone management commands are to be executed with the ALL bit set ? If yes,
>>>> that will break device-mapper. See the special code for handling
>>>> REQ_OP_ZONE_RESET_ALL. That code is in place for a reason: the target block
>>>> device may not be an entire physical device. In that case, applying a zone
>>>> management command to all zones of the physical drive is wrong.
>>>> 3) REQ_ZONE_ALL seems completely equivalent to specifying a sector range of [0
>>>> .. drive capacity]. So what is the point ? The current interface handles that.
>>>> That is how we chose between REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL right now.
>>>> 4) Without any in-kernel user, I do not see the point. And for applications, I
>>>> do not see any good use case for doing open all, close all, offline all or
>>>> finish all. If you have any such good use case, please elaborate.
>>>>
>>>
>>> The main use if reset all, but without having to look through all zones,
>>> as it imposes an overhead when we have a large number of zones. Having
>>> the possibility to offload it to HW is more efficient.
>>>
>>> I had not thought about the device mapper use case. Would it be an
>>> option to translate this into REQ_OP_ZONE_RESET_ALL when we have a
>>> device mapper (or any other case where this might break) and then leave
>>> the bit go to the driver if it applies to the whole device?
>>
>> But REQ_OP_ZONE_RESET_ALL is already implemented and supported and will reset
>> all zones of a drive using a single command if the ioctl is called for the
>> entire sector range of a physical drive. For device mapper with a partial
>> mapping, the per zone reset loop will be used. If you have no other use case for
>> the REQ_ZONE_ALL flag, what is the point here ? Reset is already optimized for
>> the all zones case
> 
> OK. I might have missed this. I thought we were sending several commands
> instead of a single reset with the bit. I will check again. Thanks for
> pointing at this.

In block/blk-zoned.c, there is:

static inline bool blkdev_allow_reset_all_zones(struct block_device *bdev,
                                                sector_t sector,
                                                sector_t nr_sectors)
{
        if (!blk_queue_zone_resetall(bdev_get_queue(bdev)))
                return false;

        /*
         * REQ_OP_ZONE_RESET_ALL can be executed only if the number of sectors
         * of the applicable zone range is the entire disk.
         */
        return !sector && nr_sectors == get_capacity(bdev->bd_disk);
}

And:

int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
                     sector_t sector, sector_t nr_sectors,
                     gfp_t gfp_mask)
{
	...

	while (sector < end_sector) {
                bio = blk_next_bio(bio, 0, gfp_mask);
                bio_set_dev(bio, bdev);

                /*
                 * Special case for the zone reset operation that reset all
                 * zones, this is useful for applications like mkfs.
                 */
                if (op == REQ_OP_ZONE_RESET &&
                    blkdev_allow_reset_all_zones(bdev, sector, nr_sectors)) {
                        bio->bi_opf = REQ_OP_ZONE_RESET_ALL;
                        break;
			^^^^^ This means one command only...

                }

                bio->bi_opf = op | REQ_SYNC;
                bio->bi_iter.bi_sector = sector;
                sector += zone_sectors;

                /* This may take a while, so be nice to others */
                cond_resched();
        }

        ret = submit_bio_wait(bio);
        bio_put(bio);

        return ret;
}

And see scsi/sd_zbc.c and zns.c. REQ_OP_ZONE_RESET_ALL end up setting the ALL
bit for reset command.


> 
> Javier
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/6] block: introduce IOCTL for zone mgmt
  2020-06-26  7:03           ` Damien Le Moal
@ 2020-06-26  7:08             ` Javier González
  0 siblings, 0 replies; 70+ messages in thread
From: Javier González @ 2020-06-26  7:08 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 26.06.2020 07:03, Damien Le Moal wrote:
>On 2020/06/26 15:51, Javier González wrote:
>> On 26.06.2020 06:37, Damien Le Moal wrote:
>>> On 2020/06/26 15:01, Javier González wrote:
>>>> On 26.06.2020 01:17, Damien Le Moal wrote:
>>>>> On 2020/06/25 21:22, Javier González wrote:
>>>>>> From: Javier González <javier.gonz@samsung.com>
>>>>>>
>>>>>> The current IOCTL interface for zone management is limited by struct
>>>>>> blk_zone_range, which is unfortunately not extensible. Specially, the
>>>>>> lack of flags is problematic for ZNS zoned devices.
>>>>>>
>>>>>> This new IOCTL is designed to be a superset of the current one, with
>>>>>> support for flags and bits for extensibility.
>>>>>>
>>>>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>>>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>>>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>>>>> ---
>>>>>>  block/blk-zoned.c             | 56 ++++++++++++++++++++++++++++++++++-
>>>>>>  block/ioctl.c                 |  2 ++
>>>>>>  include/linux/blkdev.h        |  9 ++++++
>>>>>>  include/uapi/linux/blkzoned.h | 33 +++++++++++++++++++++
>>>>>>  4 files changed, 99 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>>>>> index 81152a260354..e87c60004dc5 100644
>>>>>> --- a/block/blk-zoned.c
>>>>>> +++ b/block/blk-zoned.c
>>>>>> @@ -322,7 +322,7 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>>   * BLKRESETZONE, BLKOPENZONE, BLKCLOSEZONE and BLKFINISHZONE ioctl processing.
>>>>>>   * Called from blkdev_ioctl.
>>>>>>   */
>>>>>> -int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>> +int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>>  			   unsigned int cmd, unsigned long arg)
>>>>>>  {
>>>>>>  	void __user *argp = (void __user *)arg;
>>>>>> @@ -370,6 +370,60 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>>  				GFP_KERNEL);
>>>>>>  }
>>>>>>
>>>>>> +/*
>>>>>> + * Zone management ioctl processing. Extension of blkdev_zone_ops_ioctl(), with
>>>>>> + * blk_zone_mgmt structure.
>>>>>> + *
>>>>>> + * Called from blkdev_ioctl.
>>>>>> + */
>>>>>> +int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>> +			   unsigned int cmd, unsigned long arg)
>>>>>> +{
>>>>>> +	void __user *argp = (void __user *)arg;
>>>>>> +	struct request_queue *q;
>>>>>> +	struct blk_zone_mgmt zmgmt;
>>>>>> +	enum req_opf op;
>>>>>> +
>>>>>> +	if (!argp)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	q = bdev_get_queue(bdev);
>>>>>> +	if (!q)
>>>>>> +		return -ENXIO;
>>>>>> +
>>>>>> +	if (!blk_queue_is_zoned(q))
>>>>>> +		return -ENOTTY;
>>>>>> +
>>>>>> +	if (!capable(CAP_SYS_ADMIN))
>>>>>> +		return -EACCES;
>>>>>> +
>>>>>> +	if (!(mode & FMODE_WRITE))
>>>>>> +		return -EBADF;
>>>>>> +
>>>>>> +	if (copy_from_user(&zmgmt, argp, sizeof(struct blk_zone_mgmt)))
>>>>>> +		return -EFAULT;
>>>>>> +
>>>>>> +	switch (zmgmt.action) {
>>>>>> +	case BLK_ZONE_MGMT_CLOSE:
>>>>>> +		op = REQ_OP_ZONE_CLOSE;
>>>>>> +		break;
>>>>>> +	case BLK_ZONE_MGMT_FINISH:
>>>>>> +		op = REQ_OP_ZONE_FINISH;
>>>>>> +		break;
>>>>>> +	case BLK_ZONE_MGMT_OPEN:
>>>>>> +		op = REQ_OP_ZONE_OPEN;
>>>>>> +		break;
>>>>>> +	case BLK_ZONE_MGMT_RESET:
>>>>>> +		op = REQ_OP_ZONE_RESET;
>>>>>> +		break;
>>>>>> +	default:
>>>>>> +		return -ENOTTY;
>>>>>> +	}
>>>>>> +
>>>>>> +	return blkdev_zone_mgmt(bdev, op, zmgmt.sector, zmgmt.nr_sectors,
>>>>>> +				GFP_KERNEL);
>>>>>> +}
>>>>>> +
>>>>>>  static inline unsigned long *blk_alloc_zone_bitmap(int node,
>>>>>>  						   unsigned int nr_zones)
>>>>>>  {
>>>>>> diff --git a/block/ioctl.c b/block/ioctl.c
>>>>>> index bdb3bbb253d9..0ea29754e7dd 100644
>>>>>> --- a/block/ioctl.c
>>>>>> +++ b/block/ioctl.c
>>>>>> @@ -514,6 +514,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>>  	case BLKOPENZONE:
>>>>>>  	case BLKCLOSEZONE:
>>>>>>  	case BLKFINISHZONE:
>>>>>> +		return blkdev_zone_ops_ioctl(bdev, mode, cmd, arg);
>>>>>> +	case BLKMGMTZONE:
>>>>>>  		return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
>>>>>>  	case BLKGETZONESZ:
>>>>>>  		return put_uint(argp, bdev_zone_sectors(bdev));
>>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>>>> index 8fd900998b4e..bd8521f94dc4 100644
>>>>>> --- a/include/linux/blkdev.h
>>>>>> +++ b/include/linux/blkdev.h
>>>>>> @@ -368,6 +368,8 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
>>>>>>
>>>>>>  extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>>  				     unsigned int cmd, unsigned long arg);
>>>>>> +extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>> +				  unsigned int cmd, unsigned long arg);
>>>>>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>>  				  unsigned int cmd, unsigned long arg);
>>>>>>
>>>>>> @@ -385,6 +387,13 @@ static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
>>>>>>  	return -ENOTTY;
>>>>>>  }
>>>>>>
>>>>>> +
>>>>>> +static inline int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>> +					unsigned int cmd, unsigned long arg)
>>>>>> +{
>>>>>> +	return -ENOTTY;
>>>>>> +}
>>>>>> +
>>>>>>  static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
>>>>>>  					 fmode_t mode, unsigned int cmd,
>>>>>>  					 unsigned long arg)
>>>>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>>>>> index 42c3366cc25f..07b5fde21d9f 100644
>>>>>> --- a/include/uapi/linux/blkzoned.h
>>>>>> +++ b/include/uapi/linux/blkzoned.h
>>>>>> @@ -142,6 +142,38 @@ struct blk_zone_range {
>>>>>>  	__u64		nr_sectors;
>>>>>>  };
>>>>>>
>>>>>> +/**
>>>>>> + * enum blk_zone_action - Zone state transitions managed from user-space
>>>>>> + *
>>>>>> + * @BLK_ZONE_MGMT_CLOSE: Transition to Closed state
>>>>>> + * @BLK_ZONE_MGMT_FINISH: Transition to Finish state
>>>>>> + * @BLK_ZONE_MGMT_OPEN: Transition to Open state
>>>>>> + * @BLK_ZONE_MGMT_RESET: Transition to Reset state
>>>>>> + */
>>>>>> +enum blk_zone_action {
>>>>>> +	BLK_ZONE_MGMT_CLOSE	= 0x1,
>>>>>> +	BLK_ZONE_MGMT_FINISH	= 0x2,
>>>>>> +	BLK_ZONE_MGMT_OPEN	= 0x3,
>>>>>> +	BLK_ZONE_MGMT_RESET	= 0x4,
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct blk_zone_mgmt - Extended zoned management
>>>>>> + *
>>>>>> + * @action: Zone action as in described in enum blk_zone_action
>>>>>> + * @flags: Flags for the action
>>>>>> + * @sector: Starting sector of the first zone to operate on
>>>>>> + * @nr_sectors: Total number of sectors of all zones to operate on
>>>>>> + */
>>>>>> +struct blk_zone_mgmt {
>>>>>> +	__u8		action;
>>>>>> +	__u8		resv3[3];
>>>>>> +	__u32		flags;
>>>>>> +	__u64		sector;
>>>>>> +	__u64		nr_sectors;
>>>>>> +	__u64		resv31;
>>>>>> +};
>>>>>> +
>>>>>>  /**
>>>>>>   * Zoned block device ioctl's:
>>>>>>   *
>>>>>> @@ -166,5 +198,6 @@ struct blk_zone_range {
>>>>>>  #define BLKOPENZONE	_IOW(0x12, 134, struct blk_zone_range)
>>>>>>  #define BLKCLOSEZONE	_IOW(0x12, 135, struct blk_zone_range)
>>>>>>  #define BLKFINISHZONE	_IOW(0x12, 136, struct blk_zone_range)
>>>>>> +#define BLKMGMTZONE	_IOR(0x12, 137, struct blk_zone_mgmt)
>>>>>>
>>>>>>  #endif /* _UAPI_BLKZONED_H */
>>>>>>
>>>>>
>>>>> Without defining what the flags can be, it is hard to judge what will change
>>>> >from the current distinct ioctls.
>>>>>
>>>>
>>>> The first flag is the one to select all. Down the line we have other
>>>> modifiers that make sense, but it is true that it is public yet.
>>>
>>> You mean *not* public ?
>>
>> Yes...
>>
>>>
>>>>
>>>> Would you like to wait until then or is it an option to revise the IOCTL
>>>> now?
>>>
>>> Yes. Wait until it is actually needed. Adding code that has no users makes it
>>> impossible to test so not acceptable. As for the "all zones" flag, I already
>>> commented about it.
>>
>> Ok. We will have this in the backlog then.
>>
>> It would be great if you and Matias would like to comment on it if you
>> have some ideas on how to improve it. Happy to set a branch somewhere to
>> keep a patchset with this functionality somewhere.
>
>I sent a much simpler version of this using a REQ_ZONE_ALL flag too, but driven
>by the specified sector range. That allowed to do reset, open, close, finish all
>zones using a single command much more simply than your patch. But as Christoph
>commented, the only real use case interesting for this is reset all (e.g. for FS
>format). open, close and finish all zones have no user...

Yes. The use-case here is format too, which might take some time on very
large drives.
>
>Let's see first what kind of flags may be needed in the future, if at all. We
>can then cook something if needed.

Makes sense.

Thanks Damien!

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/6] nvme: Add consistency check for zone count
  2020-06-26  6:55           ` Javier González
@ 2020-06-26  7:09             ` Damien Le Moal
  2020-06-26  7:29               ` Javier González
  0 siblings, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  7:09 UTC (permalink / raw)
  To: Javier González
  Cc: Keith Busch, linux-nvme, linux-block, hch, sagi, axboe,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 2020/06/26 15:55, Javier González wrote:
> On 26.06.2020 06:49, Damien Le Moal wrote:
>> On 2020/06/26 15:13, Javier González wrote:
>>> On 26.06.2020 00:04, Damien Le Moal wrote:
>>>> On 2020/06/26 6:49, Keith Busch wrote:
>>>>> On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>>>>>>  drivers/nvme/host/zns.c | 7 +++++++
>>>>>>  1 file changed, 7 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>>>>> index 7d8381fe7665..de806788a184 100644
>>>>>> --- a/drivers/nvme/host/zns.c
>>>>>> +++ b/drivers/nvme/host/zns.c
>>>>>> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>>>>>  		sector += ns->zsze * nz;
>>>>>>  	}
>>>>>>
>>>>>> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>>>>>> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>>>>>> +				zone_idx, ns->nr_zones);
>>>>>> +		ret = -EINVAL;
>>>>>> +		goto out_free;
>>>>>> +	}
>>>>>> +
>>>>>>  	ret = zone_idx;
>>>>>
>>>>> nr_zones is unsigned, so it's never < 0.
>>>>>
>>>>> The API we're providing doesn't require zone_idx equal the namespace's
>>>>> nr_zones at the end, though. A subset of the total number of zones can
>>>>> be requested here.
>>>>>
>>>
>>> I did see nr_zones coming with -1; guess it is my compiler.
>>
>> See include/linux/blkdev.h. -1 is:
>>
>> #define BLK_ALL_ZONES  ((unsigned int)-1)
>>
>> Which is documented in block/blk-zoned.c:
>>
>> /**
>> * blkdev_report_zones - Get zones information
>> * @bdev:       Target block device
>> * @sector:     Sector from which to report zones
>> * @nr_zones:   Maximum number of zones to report
>> * @cb:         Callback function called for each reported zone
>> * @data:       Private data for the callback
>> *
>> * Description:
>> *    Get zone information starting from the zone containing @sector for at most
>> *    @nr_zones, and call @cb for each zone reported by the device.
>> *    To report all zones in a device starting from @sector, the BLK_ALL_ZONES
>> *    constant can be passed to @nr_zones.
>> *    Returns the number of zones reported by the device, or a negative errno
>> *    value in case of failure.
>> *
>> *    Note: The caller must use memalloc_noXX_save/restore() calls to control
>> *    memory allocations done within this function.
>> */
>> int blkdev_report_zones(struct block_device *bdev, sector_t sector,
>>                        unsigned int nr_zones, report_zones_cb cb, void *data)
>>
>>>
>>>>
>>>> Yes, absolutely. zone_idx is not an absolute zone number. It is the index of the
>>>> reported zone descriptor in the current report range requested by the user,
>>>> which is not necessarily for the entire drive (i.e., provided nr zones is less
>>>> than the total number of zones of the disk and/or start sector is > 0). So
>>>> zone_idx indicates the actual number of zones reported, it is not the total
>>>
>>> I see. As I can see, when nr_zones comes undefined I believed we could
>>> assume that zone_idx is absolute, but I can be wrong.
>>
>> No. zone_idx is *always* the index of the zone in the current report. Whatever
>> that report is, regardless of the report starting point and number of zones
>> requested. E.g. For a single zone report (nr_zones = 1), you will always see
>> zone_idx = 0. For a full report, zone_idx will correspond to the zone number.
>> This is used for example in blk_revalidate_disk_zones() to initialize the zone
>> bitmaps.
>>
>>> Does it make sense to support this check with an additional counter and
>>> a explicit nr_zones initialization when undefined or you
>>> prefer to just remove it as Matias suggested?
>>
>> The check is not needed at all.
>>
>> If the device is buggy and reports more zones than the device capacity or any
>> other bugs, the driver can catch that when it processes the report.
>> blk_revalidate_disk_zones() also has many checks.
> 
> I have managed to create a QEMU ZNS device that gave me a headache with
> a little bit of extra capacity that triggered an additional zone report.
> This was the motivation for the patch.

The device emulation sound buggy... If the capacity is wrong, then the report
will be too since zones are all supposed to be sequential (no holes between
zones) and up to the disk capacity only (last zone start + len = capacity + 1)

If one or the other is wrong, this should be easy to detect. Normally,
blk_revalidate_disk_zones() should be able to catch that.

> 
> I will look at the checks to cover this case too then.
> 
> Thanks,
> Javier
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/6] block: add support for zone offline transition
  2020-06-26  6:58         ` Javier González
@ 2020-06-26  7:17           ` Damien Le Moal
  2020-06-26  7:26             ` Javier González
  0 siblings, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  7:17 UTC (permalink / raw)
  To: Javier González
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 2020/06/26 15:58, Javier González wrote:
> On 26.06.2020 06:42, Damien Le Moal wrote:
>> On 2020/06/26 15:09, Javier González wrote:
>>> On 26.06.2020 01:34, Damien Le Moal wrote:
>>>> On 2020/06/25 21:22, Javier González wrote:
>>>>> From: Javier González <javier.gonz@samsung.com>
>>>>>
>>>>> Add support for offline transition on the zoned block device using the
>>>>> new zone management IOCTL
>>>>>
>>>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>>>> ---
>>>>>  block/blk-core.c              | 2 ++
>>>>>  block/blk-zoned.c             | 3 +++
>>>>>  drivers/nvme/host/core.c      | 3 +++
>>>>>  include/linux/blk_types.h     | 3 +++
>>>>>  include/linux/blkdev.h        | 1 -
>>>>>  include/uapi/linux/blkzoned.h | 1 +
>>>>>  6 files changed, 12 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>> index 03252af8c82c..589cbdacc5ec 100644
>>>>> --- a/block/blk-core.c
>>>>> +++ b/block/blk-core.c
>>>>> @@ -140,6 +140,7 @@ static const char *const blk_op_name[] = {
>>>>>  	REQ_OP_NAME(ZONE_CLOSE),
>>>>>  	REQ_OP_NAME(ZONE_FINISH),
>>>>>  	REQ_OP_NAME(ZONE_APPEND),
>>>>> +	REQ_OP_NAME(ZONE_OFFLINE),
>>>>>  	REQ_OP_NAME(WRITE_SAME),
>>>>>  	REQ_OP_NAME(WRITE_ZEROES),
>>>>>  	REQ_OP_NAME(SCSI_IN),
>>>>> @@ -1030,6 +1031,7 @@ generic_make_request_checks(struct bio *bio)
>>>>>  	case REQ_OP_ZONE_OPEN:
>>>>>  	case REQ_OP_ZONE_CLOSE:
>>>>>  	case REQ_OP_ZONE_FINISH:
>>>>> +	case REQ_OP_ZONE_OFFLINE:
>>>>>  		if (!blk_queue_is_zoned(q))
>>>>>  			goto not_supported;
>>>>>  		break;
>>>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>>>> index 29194388a1bb..704fc15813d1 100644
>>>>> --- a/block/blk-zoned.c
>>>>> +++ b/block/blk-zoned.c
>>>>> @@ -416,6 +416,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>  	case BLK_ZONE_MGMT_RESET:
>>>>>  		op = REQ_OP_ZONE_RESET;
>>>>>  		break;
>>>>> +	case BLK_ZONE_MGMT_OFFLINE:
>>>>> +		op = REQ_OP_ZONE_OFFLINE;
>>>>> +		break;
>>>>>  	default:
>>>>>  		return -ENOTTY;
>>>>>  	}
>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>>> index f1215523792b..5b95c81d2a2d 100644
>>>>> --- a/drivers/nvme/host/core.c
>>>>> +++ b/drivers/nvme/host/core.c
>>>>> @@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
>>>>>  	case REQ_OP_ZONE_FINISH:
>>>>>  		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
>>>>>  		break;
>>>>> +	case REQ_OP_ZONE_OFFLINE:
>>>>> +		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OFFLINE);
>>>>> +		break;
>>>>>  	case REQ_OP_WRITE_ZEROES:
>>>>>  		ret = nvme_setup_write_zeroes(ns, req, cmd);
>>>>>  		break;
>>>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>>>> index 16b57fb2b99c..b3921263c3dd 100644
>>>>> --- a/include/linux/blk_types.h
>>>>> +++ b/include/linux/blk_types.h
>>>>> @@ -316,6 +316,8 @@ enum req_opf {
>>>>>  	REQ_OP_ZONE_FINISH	= 12,
>>>>>  	/* write data at the current zone write pointer */
>>>>>  	REQ_OP_ZONE_APPEND	= 13,
>>>>> +	/* Transition a zone to offline */
>>>>> +	REQ_OP_ZONE_OFFLINE	= 14,
>>>>>
>>>>>  	/* SCSI passthrough using struct scsi_request */
>>>>>  	REQ_OP_SCSI_IN		= 32,
>>>>> @@ -456,6 +458,7 @@ static inline bool op_is_zone_mgmt(enum req_opf op)
>>>>>  	case REQ_OP_ZONE_OPEN:
>>>>>  	case REQ_OP_ZONE_CLOSE:
>>>>>  	case REQ_OP_ZONE_FINISH:
>>>>> +	case REQ_OP_ZONE_OFFLINE:
>>>>>  		return true;
>>>>>  	default:
>>>>>  		return false;
>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>>> index bd8521f94dc4..8308d8a3720b 100644
>>>>> --- a/include/linux/blkdev.h
>>>>> +++ b/include/linux/blkdev.h
>>>>> @@ -372,7 +372,6 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>  				  unsigned int cmd, unsigned long arg);
>>>>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>  				  unsigned int cmd, unsigned long arg);
>>>>> -
>>>>>  #else /* CONFIG_BLK_DEV_ZONED */
>>>>>
>>>>>  static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
>>>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>>>> index a8c89fe58f97..d0978ee10fc7 100644
>>>>> --- a/include/uapi/linux/blkzoned.h
>>>>> +++ b/include/uapi/linux/blkzoned.h
>>>>> @@ -155,6 +155,7 @@ enum blk_zone_action {
>>>>>  	BLK_ZONE_MGMT_FINISH	= 0x2,
>>>>>  	BLK_ZONE_MGMT_OPEN	= 0x3,
>>>>>  	BLK_ZONE_MGMT_RESET	= 0x4,
>>>>> +	BLK_ZONE_MGMT_OFFLINE	= 0x5,
>>>>>  };
>>>>>
>>>>>  /**
>>>>>
>>>>
>>>> As mentioned in previous email, the usefulness of this is dubious. Please
>>>> elaborate in the commit message. Sure NVMe ZNS defines this and we can support
>>>> it. But without a good use case, what is the point ?
>>>
>>> Use case is to transition zones in read-only state to offline when we
>>> are done moving valid data. It is easier to explicitly managing zones
>>> that are not usable by having all under the offline state.
>>
>> Then adding a simple BLKZONEOFFLINE ioctl, similar to open, close, finish and
>> reset, would be enough. No need for all the new zone management ioctl with flags
>> plumbing.
> 
> Ok. We can add that then.
> 
> Note that zone management is not motivated by this use case at all, but
> it made sense to implement it here instead of as a new BLKZONEOFFLINE
> IOCTL as ZAC/ZBC users will not be able to use it either way.

Sure, that is fine. We could actually add that to sd_zbc.c since we have zone
tracking there. A read-only zone can be reported as offline to sync-up with zns.
The value of it is dubious though as most applications will treat read-only and
offline zones the same way: as unusable. That is what zonefs does.

> 
>>
>>>
>>>>
>>>> scsi SD driver will return BLK_STS_NOTSUPP if this offlining is sent to a
>>>> ZBC/ZAC drive. Not nice. Having a sysfs attribute "max_offline_zone_sectors" or
>>>> the like to indicate support by the device or not would be nicer.
>>>
>>> We can do that.
>>>
>>>>
>>>> Does offling ALL zones make any sense ? Because this patch does not prevent the
>>>> use of the REQ_ZONE_ALL flags introduced in patch 2. Probably not a good idea to
>>>> allow offlining all zones, no ?
>>>
>>> AFAIK the transition to offline is only valid when coming from a
>>> read-only state. I did think of adding a check, but I can see that other
>>> transitions go directly to the driver and then the device, so I decided
>>> to follow the same model. If you think it is better, we can add the
>>> check.
>>
>> My point was that the REQ_ZONE_ALL flag would make no sense for offlining zones
>> but this patch does not have anything checking that. There is no point in
>> sending a command that is known to be incorrect to the drive...
> 
> I will add some extra checks then to fail early. I assume these should
> be in the NVMe driver as it is NVMe-specific, right?

If it is a simple BLKZONEOFFLINE ioctl, it can be processed exactly like open,
close and finish, using blkdev_zone_mgmt(). Calling that one for a range of
sectors of more than one zone will likely not make any sense most of the time,
but that is allowed for all other ops, so I guess you can keep it as is for
offline too. blkdev_zone_mgmt() will actually not need any change. You will only
need to wire the ioctl path and update op_is_zone_mgmt(). That's it. Simple that
way.

> 
> Javier
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/6] block: add support for zone offline transition
  2020-06-26  7:17           ` Damien Le Moal
@ 2020-06-26  7:26             ` Javier González
  0 siblings, 0 replies; 70+ messages in thread
From: Javier González @ 2020-06-26  7:26 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 26.06.2020 07:17, Damien Le Moal wrote:
>On 2020/06/26 15:58, Javier González wrote:
>> On 26.06.2020 06:42, Damien Le Moal wrote:
>>> On 2020/06/26 15:09, Javier González wrote:
>>>> On 26.06.2020 01:34, Damien Le Moal wrote:
>>>>> On 2020/06/25 21:22, Javier González wrote:
>>>>>> From: Javier González <javier.gonz@samsung.com>
>>>>>>
>>>>>> Add support for offline transition on the zoned block device using the
>>>>>> new zone management IOCTL
>>>>>>
>>>>>> Signed-off-by: Javier González <javier.gonz@samsung.com>
>>>>>> Signed-off-by: SelvaKumar S <selvakuma.s1@samsung.com>
>>>>>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>>>>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>>>>> ---
>>>>>>  block/blk-core.c              | 2 ++
>>>>>>  block/blk-zoned.c             | 3 +++
>>>>>>  drivers/nvme/host/core.c      | 3 +++
>>>>>>  include/linux/blk_types.h     | 3 +++
>>>>>>  include/linux/blkdev.h        | 1 -
>>>>>>  include/uapi/linux/blkzoned.h | 1 +
>>>>>>  6 files changed, 12 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>>> index 03252af8c82c..589cbdacc5ec 100644
>>>>>> --- a/block/blk-core.c
>>>>>> +++ b/block/blk-core.c
>>>>>> @@ -140,6 +140,7 @@ static const char *const blk_op_name[] = {
>>>>>>  	REQ_OP_NAME(ZONE_CLOSE),
>>>>>>  	REQ_OP_NAME(ZONE_FINISH),
>>>>>>  	REQ_OP_NAME(ZONE_APPEND),
>>>>>> +	REQ_OP_NAME(ZONE_OFFLINE),
>>>>>>  	REQ_OP_NAME(WRITE_SAME),
>>>>>>  	REQ_OP_NAME(WRITE_ZEROES),
>>>>>>  	REQ_OP_NAME(SCSI_IN),
>>>>>> @@ -1030,6 +1031,7 @@ generic_make_request_checks(struct bio *bio)
>>>>>>  	case REQ_OP_ZONE_OPEN:
>>>>>>  	case REQ_OP_ZONE_CLOSE:
>>>>>>  	case REQ_OP_ZONE_FINISH:
>>>>>> +	case REQ_OP_ZONE_OFFLINE:
>>>>>>  		if (!blk_queue_is_zoned(q))
>>>>>>  			goto not_supported;
>>>>>>  		break;
>>>>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>>>>> index 29194388a1bb..704fc15813d1 100644
>>>>>> --- a/block/blk-zoned.c
>>>>>> +++ b/block/blk-zoned.c
>>>>>> @@ -416,6 +416,9 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>>  	case BLK_ZONE_MGMT_RESET:
>>>>>>  		op = REQ_OP_ZONE_RESET;
>>>>>>  		break;
>>>>>> +	case BLK_ZONE_MGMT_OFFLINE:
>>>>>> +		op = REQ_OP_ZONE_OFFLINE;
>>>>>> +		break;
>>>>>>  	default:
>>>>>>  		return -ENOTTY;
>>>>>>  	}
>>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>>>> index f1215523792b..5b95c81d2a2d 100644
>>>>>> --- a/drivers/nvme/host/core.c
>>>>>> +++ b/drivers/nvme/host/core.c
>>>>>> @@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
>>>>>>  	case REQ_OP_ZONE_FINISH:
>>>>>>  		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
>>>>>>  		break;
>>>>>> +	case REQ_OP_ZONE_OFFLINE:
>>>>>> +		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OFFLINE);
>>>>>> +		break;
>>>>>>  	case REQ_OP_WRITE_ZEROES:
>>>>>>  		ret = nvme_setup_write_zeroes(ns, req, cmd);
>>>>>>  		break;
>>>>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>>>>> index 16b57fb2b99c..b3921263c3dd 100644
>>>>>> --- a/include/linux/blk_types.h
>>>>>> +++ b/include/linux/blk_types.h
>>>>>> @@ -316,6 +316,8 @@ enum req_opf {
>>>>>>  	REQ_OP_ZONE_FINISH	= 12,
>>>>>>  	/* write data at the current zone write pointer */
>>>>>>  	REQ_OP_ZONE_APPEND	= 13,
>>>>>> +	/* Transition a zone to offline */
>>>>>> +	REQ_OP_ZONE_OFFLINE	= 14,
>>>>>>
>>>>>>  	/* SCSI passthrough using struct scsi_request */
>>>>>>  	REQ_OP_SCSI_IN		= 32,
>>>>>> @@ -456,6 +458,7 @@ static inline bool op_is_zone_mgmt(enum req_opf op)
>>>>>>  	case REQ_OP_ZONE_OPEN:
>>>>>>  	case REQ_OP_ZONE_CLOSE:
>>>>>>  	case REQ_OP_ZONE_FINISH:
>>>>>> +	case REQ_OP_ZONE_OFFLINE:
>>>>>>  		return true;
>>>>>>  	default:
>>>>>>  		return false;
>>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>>>> index bd8521f94dc4..8308d8a3720b 100644
>>>>>> --- a/include/linux/blkdev.h
>>>>>> +++ b/include/linux/blkdev.h
>>>>>> @@ -372,7 +372,6 @@ extern int blkdev_zone_ops_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>>  				  unsigned int cmd, unsigned long arg);
>>>>>>  extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>>>>>>  				  unsigned int cmd, unsigned long arg);
>>>>>> -
>>>>>>  #else /* CONFIG_BLK_DEV_ZONED */
>>>>>>
>>>>>>  static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
>>>>>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>>>>>> index a8c89fe58f97..d0978ee10fc7 100644
>>>>>> --- a/include/uapi/linux/blkzoned.h
>>>>>> +++ b/include/uapi/linux/blkzoned.h
>>>>>> @@ -155,6 +155,7 @@ enum blk_zone_action {
>>>>>>  	BLK_ZONE_MGMT_FINISH	= 0x2,
>>>>>>  	BLK_ZONE_MGMT_OPEN	= 0x3,
>>>>>>  	BLK_ZONE_MGMT_RESET	= 0x4,
>>>>>> +	BLK_ZONE_MGMT_OFFLINE	= 0x5,
>>>>>>  };
>>>>>>
>>>>>>  /**
>>>>>>
>>>>>
>>>>> As mentioned in previous email, the usefulness of this is dubious. Please
>>>>> elaborate in the commit message. Sure NVMe ZNS defines this and we can support
>>>>> it. But without a good use case, what is the point ?
>>>>
>>>> Use case is to transition zones in read-only state to offline when we
>>>> are done moving valid data. It is easier to explicitly managing zones
>>>> that are not usable by having all under the offline state.
>>>
>>> Then adding a simple BLKZONEOFFLINE ioctl, similar to open, close, finish and
>>> reset, would be enough. No need for all the new zone management ioctl with flags
>>> plumbing.
>>
>> Ok. We can add that then.
>>
>> Note that zone management is not motivated by this use case at all, but
>> it made sense to implement it here instead of as a new BLKZONEOFFLINE
>> IOCTL as ZAC/ZBC users will not be able to use it either way.
>
>Sure, that is fine. We could actually add that to sd_zbc.c since we have zone
>tracking there. A read-only zone can be reported as offline to sync-up with zns.
>The value of it is dubious though as most applications will treat read-only and
>offline zones the same way: as unusable. That is what zonefs does.

Ok.

>
>>
>>>
>>>>
>>>>>
>>>>> scsi SD driver will return BLK_STS_NOTSUPP if this offlining is sent to a
>>>>> ZBC/ZAC drive. Not nice. Having a sysfs attribute "max_offline_zone_sectors" or
>>>>> the like to indicate support by the device or not would be nicer.
>>>>
>>>> We can do that.
>>>>
>>>>>
>>>>> Does offling ALL zones make any sense ? Because this patch does not prevent the
>>>>> use of the REQ_ZONE_ALL flags introduced in patch 2. Probably not a good idea to
>>>>> allow offlining all zones, no ?
>>>>
>>>> AFAIK the transition to offline is only valid when coming from a
>>>> read-only state. I did think of adding a check, but I can see that other
>>>> transitions go directly to the driver and then the device, so I decided
>>>> to follow the same model. If you think it is better, we can add the
>>>> check.
>>>
>>> My point was that the REQ_ZONE_ALL flag would make no sense for offlining zones
>>> but this patch does not have anything checking that. There is no point in
>>> sending a command that is known to be incorrect to the drive...
>>
>> I will add some extra checks then to fail early. I assume these should
>> be in the NVMe driver as it is NVMe-specific, right?
>
>If it is a simple BLKZONEOFFLINE ioctl, it can be processed exactly like open,
>close and finish, using blkdev_zone_mgmt(). Calling that one for a range of
>sectors of more than one zone will likely not make any sense most of the time,
>but that is allowed for all other ops, so I guess you can keep it as is for
>offline too. blkdev_zone_mgmt() will actually not need any change. You will only
>need to wire the ioctl path and update op_is_zone_mgmt(). That's it. Simple that
>way.

Sounds good.

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/6] nvme: Add consistency check for zone count
  2020-06-26  7:09             ` Damien Le Moal
@ 2020-06-26  7:29               ` Javier González
  2020-06-26  7:42                 ` Damien Le Moal
  0 siblings, 1 reply; 70+ messages in thread
From: Javier González @ 2020-06-26  7:29 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Keith Busch, linux-nvme, linux-block, hch, sagi, axboe,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 26.06.2020 07:09, Damien Le Moal wrote:
>On 2020/06/26 15:55, Javier González wrote:
>> On 26.06.2020 06:49, Damien Le Moal wrote:
>>> On 2020/06/26 15:13, Javier González wrote:
>>>> On 26.06.2020 00:04, Damien Le Moal wrote:
>>>>> On 2020/06/26 6:49, Keith Busch wrote:
>>>>>> On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>>>>>>>  drivers/nvme/host/zns.c | 7 +++++++
>>>>>>>  1 file changed, 7 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>>>>>> index 7d8381fe7665..de806788a184 100644
>>>>>>> --- a/drivers/nvme/host/zns.c
>>>>>>> +++ b/drivers/nvme/host/zns.c
>>>>>>> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>>>>>>  		sector += ns->zsze * nz;
>>>>>>>  	}
>>>>>>>
>>>>>>> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>>>>>>> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>>>>>>> +				zone_idx, ns->nr_zones);
>>>>>>> +		ret = -EINVAL;
>>>>>>> +		goto out_free;
>>>>>>> +	}
>>>>>>> +
>>>>>>>  	ret = zone_idx;
>>>>>>
>>>>>> nr_zones is unsigned, so it's never < 0.
>>>>>>
>>>>>> The API we're providing doesn't require zone_idx equal the namespace's
>>>>>> nr_zones at the end, though. A subset of the total number of zones can
>>>>>> be requested here.
>>>>>>
>>>>
>>>> I did see nr_zones coming with -1; guess it is my compiler.
>>>
>>> See include/linux/blkdev.h. -1 is:
>>>
>>> #define BLK_ALL_ZONES  ((unsigned int)-1)
>>>
>>> Which is documented in block/blk-zoned.c:
>>>
>>> /**
>>> * blkdev_report_zones - Get zones information
>>> * @bdev:       Target block device
>>> * @sector:     Sector from which to report zones
>>> * @nr_zones:   Maximum number of zones to report
>>> * @cb:         Callback function called for each reported zone
>>> * @data:       Private data for the callback
>>> *
>>> * Description:
>>> *    Get zone information starting from the zone containing @sector for at most
>>> *    @nr_zones, and call @cb for each zone reported by the device.
>>> *    To report all zones in a device starting from @sector, the BLK_ALL_ZONES
>>> *    constant can be passed to @nr_zones.
>>> *    Returns the number of zones reported by the device, or a negative errno
>>> *    value in case of failure.
>>> *
>>> *    Note: The caller must use memalloc_noXX_save/restore() calls to control
>>> *    memory allocations done within this function.
>>> */
>>> int blkdev_report_zones(struct block_device *bdev, sector_t sector,
>>>                        unsigned int nr_zones, report_zones_cb cb, void *data)
>>>
>>>>
>>>>>
>>>>> Yes, absolutely. zone_idx is not an absolute zone number. It is the index of the
>>>>> reported zone descriptor in the current report range requested by the user,
>>>>> which is not necessarily for the entire drive (i.e., provided nr zones is less
>>>>> than the total number of zones of the disk and/or start sector is > 0). So
>>>>> zone_idx indicates the actual number of zones reported, it is not the total
>>>>
>>>> I see. As I can see, when nr_zones comes undefined I believed we could
>>>> assume that zone_idx is absolute, but I can be wrong.
>>>
>>> No. zone_idx is *always* the index of the zone in the current report. Whatever
>>> that report is, regardless of the report starting point and number of zones
>>> requested. E.g. For a single zone report (nr_zones = 1), you will always see
>>> zone_idx = 0. For a full report, zone_idx will correspond to the zone number.
>>> This is used for example in blk_revalidate_disk_zones() to initialize the zone
>>> bitmaps.
>>>
>>>> Does it make sense to support this check with an additional counter and
>>>> a explicit nr_zones initialization when undefined or you
>>>> prefer to just remove it as Matias suggested?
>>>
>>> The check is not needed at all.
>>>
>>> If the device is buggy and reports more zones than the device capacity or any
>>> other bugs, the driver can catch that when it processes the report.
>>> blk_revalidate_disk_zones() also has many checks.
>>
>> I have managed to create a QEMU ZNS device that gave me a headache with
>> a little bit of extra capacity that triggered an additional zone report.
>> This was the motivation for the patch.
>
>The device emulation sound buggy... If the capacity is wrong, then the report
>will be too since zones are all supposed to be sequential (no holes between
>zones) and up to the disk capacity only (last zone start + len = capacity + 1)
>
>If one or the other is wrong, this should be easy to detect. Normally,
>blk_revalidate_disk_zones() should be able to catch that.

We have the capability to select the reported device capacity manually
for a number of reasons. One of the different test configurations in our
CI did go through.

But it is OK, I will remove the check on V2.

Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/6] nvme: Add consistency check for zone count
  2020-06-26  7:29               ` Javier González
@ 2020-06-26  7:42                 ` Damien Le Moal
  0 siblings, 0 replies; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  7:42 UTC (permalink / raw)
  To: Javier González
  Cc: Keith Busch, linux-nvme, linux-block, hch, sagi, axboe,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 2020/06/26 16:29, Javier González wrote:
> On 26.06.2020 07:09, Damien Le Moal wrote:
>> On 2020/06/26 15:55, Javier González wrote:
>>> On 26.06.2020 06:49, Damien Le Moal wrote:
>>>> On 2020/06/26 15:13, Javier González wrote:
>>>>> On 26.06.2020 00:04, Damien Le Moal wrote:
>>>>>> On 2020/06/26 6:49, Keith Busch wrote:
>>>>>>> On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>>>>>>>>  drivers/nvme/host/zns.c | 7 +++++++
>>>>>>>>  1 file changed, 7 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>>>>>>> index 7d8381fe7665..de806788a184 100644
>>>>>>>> --- a/drivers/nvme/host/zns.c
>>>>>>>> +++ b/drivers/nvme/host/zns.c
>>>>>>>> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>>>>>>>  		sector += ns->zsze * nz;
>>>>>>>>  	}
>>>>>>>>
>>>>>>>> +	if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>>>>>>>> +		dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>>>>>>>> +				zone_idx, ns->nr_zones);
>>>>>>>> +		ret = -EINVAL;
>>>>>>>> +		goto out_free;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>>  	ret = zone_idx;
>>>>>>>
>>>>>>> nr_zones is unsigned, so it's never < 0.
>>>>>>>
>>>>>>> The API we're providing doesn't require zone_idx equal the namespace's
>>>>>>> nr_zones at the end, though. A subset of the total number of zones can
>>>>>>> be requested here.
>>>>>>>
>>>>>
>>>>> I did see nr_zones coming with -1; guess it is my compiler.
>>>>
>>>> See include/linux/blkdev.h. -1 is:
>>>>
>>>> #define BLK_ALL_ZONES  ((unsigned int)-1)
>>>>
>>>> Which is documented in block/blk-zoned.c:
>>>>
>>>> /**
>>>> * blkdev_report_zones - Get zones information
>>>> * @bdev:       Target block device
>>>> * @sector:     Sector from which to report zones
>>>> * @nr_zones:   Maximum number of zones to report
>>>> * @cb:         Callback function called for each reported zone
>>>> * @data:       Private data for the callback
>>>> *
>>>> * Description:
>>>> *    Get zone information starting from the zone containing @sector for at most
>>>> *    @nr_zones, and call @cb for each zone reported by the device.
>>>> *    To report all zones in a device starting from @sector, the BLK_ALL_ZONES
>>>> *    constant can be passed to @nr_zones.
>>>> *    Returns the number of zones reported by the device, or a negative errno
>>>> *    value in case of failure.
>>>> *
>>>> *    Note: The caller must use memalloc_noXX_save/restore() calls to control
>>>> *    memory allocations done within this function.
>>>> */
>>>> int blkdev_report_zones(struct block_device *bdev, sector_t sector,
>>>>                        unsigned int nr_zones, report_zones_cb cb, void *data)
>>>>
>>>>>
>>>>>>
>>>>>> Yes, absolutely. zone_idx is not an absolute zone number. It is the index of the
>>>>>> reported zone descriptor in the current report range requested by the user,
>>>>>> which is not necessarily for the entire drive (i.e., provided nr zones is less
>>>>>> than the total number of zones of the disk and/or start sector is > 0). So
>>>>>> zone_idx indicates the actual number of zones reported, it is not the total
>>>>>
>>>>> I see. As I can see, when nr_zones comes undefined I believed we could
>>>>> assume that zone_idx is absolute, but I can be wrong.
>>>>
>>>> No. zone_idx is *always* the index of the zone in the current report. Whatever
>>>> that report is, regardless of the report starting point and number of zones
>>>> requested. E.g. For a single zone report (nr_zones = 1), you will always see
>>>> zone_idx = 0. For a full report, zone_idx will correspond to the zone number.
>>>> This is used for example in blk_revalidate_disk_zones() to initialize the zone
>>>> bitmaps.
>>>>
>>>>> Does it make sense to support this check with an additional counter and
>>>>> a explicit nr_zones initialization when undefined or you
>>>>> prefer to just remove it as Matias suggested?
>>>>
>>>> The check is not needed at all.
>>>>
>>>> If the device is buggy and reports more zones than the device capacity or any
>>>> other bugs, the driver can catch that when it processes the report.
>>>> blk_revalidate_disk_zones() also has many checks.
>>>
>>> I have managed to create a QEMU ZNS device that gave me a headache with
>>> a little bit of extra capacity that triggered an additional zone report.
>>> This was the motivation for the patch.
>>
>> The device emulation sound buggy... If the capacity is wrong, then the report
>> will be too since zones are all supposed to be sequential (no holes between
>> zones) and up to the disk capacity only (last zone start + len = capacity + 1)
>>
>> If one or the other is wrong, this should be easy to detect. Normally,
>> blk_revalidate_disk_zones() should be able to catch that.
> 
> We have the capability to select the reported device capacity manually
> for a number of reasons. One of the different test configurations in our
> CI did go through.

If you change the drive capacity on the fly (e.g. with a low level format ?),
you must revalidate the disk/drive to get the changed capacity. A lot of things
will break otherwise This is not just report zones that will be incorrect.

> 
> But it is OK, I will remove the check on V2.
> 
> Javier
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/6] block: add support for zone offline transition
  2020-06-25 14:12   ` Matias Bjørling
  2020-06-25 19:48     ` Javier González
@ 2020-06-26  9:07     ` Christoph Hellwig
  1 sibling, 0 replies; 70+ messages in thread
From: Christoph Hellwig @ 2020-06-26  9:07 UTC (permalink / raw)
  To: Matias Bjørling
  Cc: Javier González, linux-nvme, linux-block, hch, kbusch, sagi,
	axboe, Javier González, SelvaKumar S, Kanchan Joshi,
	Nitesh Shetty

On Thu, Jun 25, 2020 at 04:12:21PM +0200, Matias Bjørling wrote:
> I am not sure this makes sense to expose through the kernel zone api. One 
> of the goals of the kernel zone API is to be a layer that provides an 
> unified zone model across SMR HDDs and ZNS SSDs. The offline zone 
> operation, as defined in the ZNS specification, does not have an equivalent 
> in SMR HDDs (ZAC/ZBC).
>
> This is different from the Zone Capacity change, where the zone capacity 
> simply was zone size for SMR HDDs. Making it easy to support. That is not 
> the same for ZAC/ZBC, that does not offer the offline operation to 
> transition zones in read only state to offline state.

Bullshit.  It is eactly the same case of careful additions to the model,
which totally make sense.

The only major issue with the patch is that we need a flag to indicate
if a given device supports offlining zones before wiring it up.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/6] block: add support for zone offline transition
  2020-06-26  1:14       ` Damien Le Moal
  2020-06-26  6:18         ` Javier González
@ 2020-06-26  9:11         ` hch
  2020-06-26  9:15           ` Damien Le Moal
  1 sibling, 1 reply; 70+ messages in thread
From: hch @ 2020-06-26  9:11 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Javier González, Matias Bjørling, linux-nvme,
	linux-block, hch, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On Fri, Jun 26, 2020 at 01:14:30AM +0000, Damien Le Moal wrote:
> As long as you keep ZNS namespace report itself as being "host-managed" like
> ZBC/ZAC disks, we need the consistency and common interface. If you break that,
> the meaning of the zoned model queue attribute disappears and an application or
> in-kernel user cannot rely on this model anymore to know how the drive will behave.

We just need a way to expose to applications that new feature are
supported.  Just like we did with zone capacity support.  That is why
we added the feature flags to uapi zone structure.

> Think of a file system, or any other in-kernel user. If they have to change
> their code based on the device type (NVMe vs SCSI), then the zoned block device
> interface is broken. Right now, that is not the case, everything works equally
> well on ZNS and SCSI, modulo the need by a user for conventional zones that ZNS
> do not define. But that is still consistent with the host-managed model since
> conventional zones are optional.

That is why we have the feature flag.  That user should not know the
underlying transport or spec.  But it can reliably discover "this thing
support zone capacity" or "this thing supports offline zones" or even
nasty thing like "this zone can time out when open" which are much
harder to deal with.

> For this particular patch, there is currently no in-kernel user, and it is not
> clear how this will be useful to applications. At least please clarify this. And

The main user is the ioctl.  And if you think about how offline zones are
(suppose to) be used driving this from management tools in userspace
actually totally make sense.  Unlike for example open/close all which
just don't make sense as primitives to start with.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/6] block: add zone attr. to zone mgmt IOCTL struct
  2020-06-25 12:21 ` [PATCH 5/6] block: add zone attr. to zone mgmt IOCTL struct Javier González
  2020-06-25 15:13   ` Matias Bjørling
  2020-06-26  1:45   ` Damien Le Moal
@ 2020-06-26  9:14   ` Christoph Hellwig
  2020-06-26 10:01     ` Javier González
  2 siblings, 1 reply; 70+ messages in thread
From: Christoph Hellwig @ 2020-06-26  9:14 UTC (permalink / raw)
  To: Javier González
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe,
	Javier González, SelvaKumar S, Kanchan Joshi, Nitesh Shetty

> + * Zone Attributes
> + */
> +enum blk_zone_attr {
> +	BLK_ZONE_ATTR_ZFC	= 1 << 0,
> +	BLK_ZONE_ATTR_FZR	= 1 << 1,
> +	BLK_ZONE_ATTR_RZR	= 1 << 2,
> +	BLK_ZONE_ATTR_ZDEV	= 1 << 7,
> +};

I have no ^$^$#$%#% idea what this is supposed to be.  If you add
userspace ABIs you need to document them in detail.  Until I've seen
that documentation I can't even comment if they make sense.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/6] block: add support for zone offline transition
  2020-06-26  9:11         ` hch
@ 2020-06-26  9:15           ` Damien Le Moal
  2020-06-26  9:17             ` hch
  0 siblings, 1 reply; 70+ messages in thread
From: Damien Le Moal @ 2020-06-26  9:15 UTC (permalink / raw)
  To: hch
  Cc: Javier González, Matias Bjørling, linux-nvme,
	linux-block, kbusch, sagi, axboe, SelvaKumar S, Kanchan Joshi,
	Nitesh Shetty

On 2020/06/26 18:11, hch@lst.de wrote:
> On Fri, Jun 26, 2020 at 01:14:30AM +0000, Damien Le Moal wrote:
>> As long as you keep ZNS namespace report itself as being "host-managed" like
>> ZBC/ZAC disks, we need the consistency and common interface. If you break that,
>> the meaning of the zoned model queue attribute disappears and an application or
>> in-kernel user cannot rely on this model anymore to know how the drive will behave.
> 
> We just need a way to expose to applications that new feature are
> supported.  Just like we did with zone capacity support.  That is why
> we added the feature flags to uapi zone structure.
> 
>> Think of a file system, or any other in-kernel user. If they have to change
>> their code based on the device type (NVMe vs SCSI), then the zoned block device
>> interface is broken. Right now, that is not the case, everything works equally
>> well on ZNS and SCSI, modulo the need by a user for conventional zones that ZNS
>> do not define. But that is still consistent with the host-managed model since
>> conventional zones are optional.
> 
> That is why we have the feature flag.  That user should not know the
> underlying transport or spec.  But it can reliably discover "this thing
> support zone capacity" or "this thing supports offline zones" or even
> nasty thing like "this zone can time out when open" which are much
> harder to deal with.
> 
>> For this particular patch, there is currently no in-kernel user, and it is not
>> clear how this will be useful to applications. At least please clarify this. And
> 
> The main user is the ioctl.  And if you think about how offline zones are
> (suppose to) be used driving this from management tools in userspace
> actually totally make sense.  Unlike for example open/close all which
> just don't make sense as primitives to start with.

OK. Adding a new BLKZONEOFFLINE ioctl is easy though and fits into the current
zone management plumbing well, I think. So the patch can be significantly
simplified (no need for the new zone management op function with flags).

> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/6] nvme: Add consistency check for zone count
  2020-06-25 12:21 ` [PATCH 6/6] nvme: Add consistency check for zone count Javier González
  2020-06-25 13:16   ` Matias Bjørling
  2020-06-25 21:49   ` Keith Busch
@ 2020-06-26  9:16   ` Christoph Hellwig
  2020-06-26 10:03     ` Javier González
  2 siblings, 1 reply; 70+ messages in thread
From: Christoph Hellwig @ 2020-06-26  9:16 UTC (permalink / raw)
  To: Javier González
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe,
	Javier González, SelvaKumar S, Kanchan Joshi, Nitesh Shetty

As a bunch of folks noticed I don't zone_idx does what you think it
does.  That being said I'm very happy about any kind of validation
(except maybe in a supper hot path).  So if we want to validate the
zone number we can do, just not as in this patch.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/6] block: add support for zone offline transition
  2020-06-26  9:15           ` Damien Le Moal
@ 2020-06-26  9:17             ` hch
  2020-06-26 10:02               ` Javier González
  0 siblings, 1 reply; 70+ messages in thread
From: hch @ 2020-06-26  9:17 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: hch, Javier González, Matias Bjørling, linux-nvme,
	linux-block, kbusch, sagi, axboe, SelvaKumar S, Kanchan Joshi,
	Nitesh Shetty

On Fri, Jun 26, 2020 at 09:15:14AM +0000, Damien Le Moal wrote:
> On 2020/06/26 18:11, hch@lst.de wrote:
> > On Fri, Jun 26, 2020 at 01:14:30AM +0000, Damien Le Moal wrote:
> >> As long as you keep ZNS namespace report itself as being "host-managed" like
> >> ZBC/ZAC disks, we need the consistency and common interface. If you break that,
> >> the meaning of the zoned model queue attribute disappears and an application or
> >> in-kernel user cannot rely on this model anymore to know how the drive will behave.
> > 
> > We just need a way to expose to applications that new feature are
> > supported.  Just like we did with zone capacity support.  That is why
> > we added the feature flags to uapi zone structure.
> > 
> >> Think of a file system, or any other in-kernel user. If they have to change
> >> their code based on the device type (NVMe vs SCSI), then the zoned block device
> >> interface is broken. Right now, that is not the case, everything works equally
> >> well on ZNS and SCSI, modulo the need by a user for conventional zones that ZNS
> >> do not define. But that is still consistent with the host-managed model since
> >> conventional zones are optional.
> > 
> > That is why we have the feature flag.  That user should not know the
> > underlying transport or spec.  But it can reliably discover "this thing
> > support zone capacity" or "this thing supports offline zones" or even
> > nasty thing like "this zone can time out when open" which are much
> > harder to deal with.
> > 
> >> For this particular patch, there is currently no in-kernel user, and it is not
> >> clear how this will be useful to applications. At least please clarify this. And
> > 
> > The main user is the ioctl.  And if you think about how offline zones are
> > (suppose to) be used driving this from management tools in userspace
> > actually totally make sense.  Unlike for example open/close all which
> > just don't make sense as primitives to start with.
> 
> OK. Adding a new BLKZONEOFFLINE ioctl is easy though and fits into the current
> zone management plumbing well, I think. So the patch can be significantly
> simplified (no need for the new zone management op function with flags).

Yes, I'm all for reusing the existing plumbing and style as much as
possible.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/6] block: add zone attr. to zone mgmt IOCTL struct
  2020-06-26  9:14   ` Christoph Hellwig
@ 2020-06-26 10:01     ` Javier González
  0 siblings, 0 replies; 70+ messages in thread
From: Javier González @ 2020-06-26 10:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, linux-block, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 26.06.2020 11:14, Christoph Hellwig wrote:
>> + * Zone Attributes
>> + */
>> +enum blk_zone_attr {
>> +	BLK_ZONE_ATTR_ZFC	= 1 << 0,
>> +	BLK_ZONE_ATTR_FZR	= 1 << 1,
>> +	BLK_ZONE_ATTR_RZR	= 1 << 2,
>> +	BLK_ZONE_ATTR_ZDEV	= 1 << 7,
>> +};
>
>I have no ^$^$#$%#% idea what this is supposed to be.  If you add
>userspace ABIs you need to document them in detail.  Until I've seen
>that documentation I can't even comment if they make sense.

Understood. I'll add the ZAC/ZBC attributes and document this properly
on V2.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/6] block: add support for zone offline transition
  2020-06-26  9:17             ` hch
@ 2020-06-26 10:02               ` Javier González
  0 siblings, 0 replies; 70+ messages in thread
From: Javier González @ 2020-06-26 10:02 UTC (permalink / raw)
  To: hch
  Cc: Damien Le Moal, Matias Bjørling, linux-nvme, linux-block,
	kbusch, sagi, axboe, SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On 26.06.2020 11:17, hch@lst.de wrote:
>On Fri, Jun 26, 2020 at 09:15:14AM +0000, Damien Le Moal wrote:
>> On 2020/06/26 18:11, hch@lst.de wrote:
>> > On Fri, Jun 26, 2020 at 01:14:30AM +0000, Damien Le Moal wrote:
>> >> As long as you keep ZNS namespace report itself as being "host-managed" like
>> >> ZBC/ZAC disks, we need the consistency and common interface. If you break that,
>> >> the meaning of the zoned model queue attribute disappears and an application or
>> >> in-kernel user cannot rely on this model anymore to know how the drive will behave.
>> >
>> > We just need a way to expose to applications that new feature are
>> > supported.  Just like we did with zone capacity support.  That is why
>> > we added the feature flags to uapi zone structure.
>> >
>> >> Think of a file system, or any other in-kernel user. If they have to change
>> >> their code based on the device type (NVMe vs SCSI), then the zoned block device
>> >> interface is broken. Right now, that is not the case, everything works equally
>> >> well on ZNS and SCSI, modulo the need by a user for conventional zones that ZNS
>> >> do not define. But that is still consistent with the host-managed model since
>> >> conventional zones are optional.
>> >
>> > That is why we have the feature flag.  That user should not know the
>> > underlying transport or spec.  But it can reliably discover "this thing
>> > support zone capacity" or "this thing supports offline zones" or even
>> > nasty thing like "this zone can time out when open" which are much
>> > harder to deal with.
>> >
>> >> For this particular patch, there is currently no in-kernel user, and it is not
>> >> clear how this will be useful to applications. At least please clarify this. And
>> >
>> > The main user is the ioctl.  And if you think about how offline zones are
>> > (suppose to) be used driving this from management tools in userspace
>> > actually totally make sense.  Unlike for example open/close all which
>> > just don't make sense as primitives to start with.
>>
>> OK. Adding a new BLKZONEOFFLINE ioctl is easy though and fits into the current
>> zone management plumbing well, I think. So the patch can be significantly
>> simplified (no need for the new zone management op function with flags).
>
>Yes, I'm all for reusing the existing plumbing and style as much as
>possible.

OK. Will use the current path on V2.

Thanks!
Javier

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/6] nvme: Add consistency check for zone count
  2020-06-26  9:16   ` Christoph Hellwig
@ 2020-06-26 10:03     ` Javier González
  0 siblings, 0 replies; 70+ messages in thread
From: Javier González @ 2020-06-26 10:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, linux-block, kbusch, sagi, axboe, SelvaKumar S,
	Kanchan Joshi, Nitesh Shetty

On 26.06.2020 11:16, Christoph Hellwig wrote:
>As a bunch of folks noticed I don't zone_idx does what you think it
>does.  That being said I'm very happy about any kind of validation
>(except maybe in a supper hot path).  So if we want to validate the
>zone number we can do, just not as in this patch.

Ok. I will send a proper implementation of it and then we see if it
fits.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/6] block: introduce IOCTL to report dev properties
  2020-06-26  6:28         ` Javier González
@ 2020-06-26 15:52           ` Keith Busch
  2020-06-26 16:25             ` Javier González
  0 siblings, 1 reply; 70+ messages in thread
From: Keith Busch @ 2020-06-26 15:52 UTC (permalink / raw)
  To: Javier González
  Cc: Matias Bjørling, linux-nvme, linux-block, hch, sagi, axboe,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty

On Fri, Jun 26, 2020 at 08:28:28AM +0200, Javier González wrote:
> On 26.06.2020 05:25, Keith Busch wrote:
> > On Thu, Jun 25, 2020 at 09:42:48PM +0200, Javier González wrote:
> > > We can use nvme passthru, but this bypasses the zoned block abstraction.
> > > Why not representing ZNS features in the standard zoned block API?
> > 
> > This looks too nvme zns specific to want the block layer in the middle.
> > Just use the driver's passthrough interface.
> 
> Ok. Is it OK with you to expose them in sysfs as Damien suggested?

sysfs sounds fine to me for some attributes (ex: mar/mor), but I don't
think every zns property warrants exposing through this interface. I
just don't want to corner driver maintenance into chasing every spec
field. If applications really want every last detail, they can get it
directly from the device without any kernel dependencies.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/6] block: introduce IOCTL to report dev properties
  2020-06-26 15:52           ` Keith Busch
@ 2020-06-26 16:25             ` Javier González
  0 siblings, 0 replies; 70+ messages in thread
From: Javier González @ 2020-06-26 16:25 UTC (permalink / raw)
  To: Keith Busch
  Cc: Matias Bjørling, linux-nvme, linux-block, hch, sagi, axboe,
	SelvaKumar S, Kanchan Joshi, Nitesh Shetty


> On 26 Jun 2020, at 17.52, Keith Busch <kbusch@kernel.org> wrote:
> 
> On Fri, Jun 26, 2020 at 08:28:28AM +0200, Javier González wrote:
>>> On 26.06.2020 05:25, Keith Busch wrote:
>>> On Thu, Jun 25, 2020 at 09:42:48PM +0200, Javier González wrote:
>>>> We can use nvme passthru, but this bypasses the zoned block abstraction.
>>>> Why not representing ZNS features in the standard zoned block API?
>>> 
>>> This looks too nvme zns specific to want the block layer in the middle.
>>> Just use the driver's passthrough interface.
>> 
>> Ok. Is it OK with you to expose them in sysfs as Damien suggested?
> 
> sysfs sounds fine to me for some attributes (ex: mar/mor), but I don't
> think every zns property warrants exposing through this interface. I
> just don't want to corner driver maintenance into chasing every spec
> field. If applications really want every last detail, they can get it
> directly from the device without any kernel dependencies.

Ok. Think Niklas sent a patch with mar/mor already. Will check which ones we depend on in user space. 

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2020-06-26 16:25 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-25 12:21 [PATCH 0/6] ZNS: Extra features for current patches Javier González
2020-06-25 12:21 ` [PATCH 1/6] block: introduce IOCTL for zone mgmt Javier González
2020-06-26  1:17   ` Damien Le Moal
2020-06-26  6:01     ` Javier González
2020-06-26  6:37       ` Damien Le Moal
2020-06-26  6:51         ` Javier González
2020-06-26  7:03           ` Damien Le Moal
2020-06-26  7:08             ` Javier González
2020-06-25 12:21 ` [PATCH 2/6] block: add support for selecting all zones Javier González
2020-06-26  1:27   ` Damien Le Moal
2020-06-26  5:58     ` Javier González
2020-06-26  6:35       ` Damien Le Moal
2020-06-26  6:52         ` Javier González
2020-06-26  7:06           ` Damien Le Moal
2020-06-25 12:21 ` [PATCH 3/6] block: add support for zone offline transition Javier González
2020-06-25 14:12   ` Matias Bjørling
2020-06-25 19:48     ` Javier González
2020-06-26  1:14       ` Damien Le Moal
2020-06-26  6:18         ` Javier González
2020-06-26  9:11         ` hch
2020-06-26  9:15           ` Damien Le Moal
2020-06-26  9:17             ` hch
2020-06-26 10:02               ` Javier González
2020-06-26  9:07     ` Christoph Hellwig
2020-06-26  1:34   ` Damien Le Moal
2020-06-26  6:08     ` Javier González
2020-06-26  6:42       ` Damien Le Moal
2020-06-26  6:58         ` Javier González
2020-06-26  7:17           ` Damien Le Moal
2020-06-26  7:26             ` Javier González
2020-06-25 12:21 ` [PATCH 4/6] block: introduce IOCTL to report dev properties Javier González
2020-06-25 13:10   ` Matias Bjørling
2020-06-25 19:42     ` Javier González
2020-06-25 19:58       ` Matias Bjørling
2020-06-26  6:24         ` Javier González
2020-06-25 20:25       ` Keith Busch
2020-06-26  6:28         ` Javier González
2020-06-26 15:52           ` Keith Busch
2020-06-26 16:25             ` Javier González
2020-06-26  0:57       ` Damien Le Moal
2020-06-26  6:27         ` Javier González
2020-06-26  1:38   ` Damien Le Moal
2020-06-26  6:22     ` Javier González
2020-06-25 12:21 ` [PATCH 5/6] block: add zone attr. to zone mgmt IOCTL struct Javier González
2020-06-25 15:13   ` Matias Bjørling
2020-06-25 19:51     ` Javier González
2020-06-26  1:45   ` Damien Le Moal
2020-06-26  6:03     ` Javier González
2020-06-26  6:38       ` Damien Le Moal
2020-06-26  6:49         ` Javier González
2020-06-26  9:14   ` Christoph Hellwig
2020-06-26 10:01     ` Javier González
2020-06-25 12:21 ` [PATCH 6/6] nvme: Add consistency check for zone count Javier González
2020-06-25 13:16   ` Matias Bjørling
2020-06-25 19:45     ` Javier González
2020-06-25 21:49   ` Keith Busch
2020-06-26  0:04     ` Damien Le Moal
2020-06-26  6:13       ` Javier González
2020-06-26  6:49         ` Damien Le Moal
2020-06-26  6:55           ` Javier González
2020-06-26  7:09             ` Damien Le Moal
2020-06-26  7:29               ` Javier González
2020-06-26  7:42                 ` Damien Le Moal
2020-06-26  9:16   ` Christoph Hellwig
2020-06-26 10:03     ` Javier González
2020-06-25 13:04 ` [PATCH 0/6] ZNS: Extra features for current patches Matias Bjørling
2020-06-25 14:48   ` Matias Bjørling
2020-06-25 19:39     ` Javier González
2020-06-25 19:53       ` Matias Bjørling
2020-06-26  6:26         ` Javier González

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).