Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* [PATCHv3 0/5] nvme support for zoned namespace command set
@ 2020-06-22 16:25 Keith Busch
  2020-06-22 16:25 ` [PATCHv3 1/5] block: add capacity field to zone descriptors Keith Busch
                   ` (4 more replies)
  0 siblings, 5 replies; 38+ messages in thread
From: Keith Busch @ 2020-06-22 16:25 UTC (permalink / raw)
  To: linux-nvme, hch, sagi, linux-block, axboe; +Cc: Keith Busch

v2->v3:

  Added warnings for unsupported ZNS drives (Klaus)

  Fixed double newline

  Added reviews

Background:

The NVM Express workgroup has ratified technical proposals enabling new
command sets. The specifications may be viewed from the following link:

  https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs.zip

This series implements support for the Zoned Namespace (ZNS) Command Set
defined in TP4053, and the Namespace Types base support it depends on
from TP4056. As this series depends on the block layer's append support
infrastructure, append-capable ZNS devices are required for this patch
sets enabling.

The block layer is updated to include the new zone writeable capacity
feature from ZNS, and existing zone block device drivers are updated to
incorporate this feature.

Aravind Ramesh (1):
  null_blk: introduce zone capacity for zoned device

Keith Busch (2):
  nvme: support for multi-command set effects
  nvme: support for zoned namespaces

Matias Bjørling (1):
  block: add capacity field to zone descriptors

Niklas Cassel (1):
  nvme: implement I/O Command Sets Command Set support

 block/Kconfig                  |   5 +-
 block/blk-zoned.c              |   1 +
 drivers/block/null_blk.h       |   1 +
 drivers/block/null_blk_main.c  |  10 +-
 drivers/block/null_blk_zoned.c |  16 ++-
 drivers/nvme/host/Makefile     |   1 +
 drivers/nvme/host/core.c       | 218 +++++++++++++++++++++++------
 drivers/nvme/host/hwmon.c      |   2 +-
 drivers/nvme/host/lightnvm.c   |   4 +-
 drivers/nvme/host/multipath.c  |   2 +-
 drivers/nvme/host/nvme.h       |  50 ++++++-
 drivers/nvme/host/zns.c        | 245 +++++++++++++++++++++++++++++++++
 drivers/scsi/sd_zbc.c          |   1 +
 include/linux/nvme.h           | 137 +++++++++++++++++-
 include/uapi/linux/blkzoned.h  |  15 +-
 15 files changed, 654 insertions(+), 54 deletions(-)
 create mode 100644 drivers/nvme/host/zns.c

-- 
2.24.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCHv3 1/5] block: add capacity field to zone descriptors
  2020-06-22 16:25 [PATCHv3 0/5] nvme support for zoned namespace command set Keith Busch
@ 2020-06-22 16:25 ` Keith Busch
  2020-06-23  6:15   ` Hannes Reinecke
                     ` (2 more replies)
  2020-06-22 16:25 ` [PATCHv3 2/5] null_blk: introduce zone capacity for zoned device Keith Busch
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 38+ messages in thread
From: Keith Busch @ 2020-06-22 16:25 UTC (permalink / raw)
  To: linux-nvme, hch, sagi, linux-block, axboe
  Cc: Matias Bjørling, Chaitanya Kulkarni, Javier González,
	Daniel Wagner, Johannes Thumshirn, Martin K . Petersen

From: Matias Bjørling <matias.bjorling@wdc.com>

In the zoned storage model, the sectors within a zone are typically all
writeable. With the introduction of the Zoned Namespace (ZNS) Command
Set in the NVM Express organization, the model was extended to have a
specific writeable capacity.

Extend the zone descriptor data structure with a zone capacity field to
indicate to the user how many sectors in a zone are writeable.

Introduce backward compatibility in the zone report ioctl by extending
the zone report header data structure with a flags field to indicate if
the capacity field is available.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
---
 block/blk-zoned.c              |  1 +
 drivers/block/null_blk_zoned.c |  2 ++
 drivers/scsi/sd_zbc.c          |  1 +
 include/uapi/linux/blkzoned.h  | 15 +++++++++++++--
 4 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 23831fa8701d..81152a260354 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -312,6 +312,7 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
 		return ret;
 
 	rep.nr_zones = ret;
+	rep.flags = BLK_ZONE_REP_CAPACITY;
 	if (copy_to_user(argp, &rep, sizeof(struct blk_zone_report)))
 		return -EFAULT;
 	return 0;
diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c
index cc47606d8ffe..624aac09b005 100644
--- a/drivers/block/null_blk_zoned.c
+++ b/drivers/block/null_blk_zoned.c
@@ -47,6 +47,7 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 
 		zone->start = sector;
 		zone->len = dev->zone_size_sects;
+		zone->capacity = zone->len;
 		zone->wp = zone->start + zone->len;
 		zone->type = BLK_ZONE_TYPE_CONVENTIONAL;
 		zone->cond = BLK_ZONE_COND_NOT_WP;
@@ -59,6 +60,7 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 
 		zone->start = zone->wp = sector;
 		zone->len = dev->zone_size_sects;
+		zone->capacity = zone->len;
 		zone->type = BLK_ZONE_TYPE_SEQWRITE_REQ;
 		zone->cond = BLK_ZONE_COND_EMPTY;
 
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 6f7eba66687e..183a20720da9 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -59,6 +59,7 @@ static int sd_zbc_parse_report(struct scsi_disk *sdkp, u8 *buf,
 		zone.non_seq = 1;
 
 	zone.len = logical_to_sectors(sdp, get_unaligned_be64(&buf[8]));
+	zone.capacity = zone.len;
 	zone.start = logical_to_sectors(sdp, get_unaligned_be64(&buf[16]));
 	zone.wp = logical_to_sectors(sdp, get_unaligned_be64(&buf[24]));
 	if (zone.type != ZBC_ZONE_TYPE_CONV &&
diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
index 0cdef67135f0..42c3366cc25f 100644
--- a/include/uapi/linux/blkzoned.h
+++ b/include/uapi/linux/blkzoned.h
@@ -73,6 +73,15 @@ enum blk_zone_cond {
 	BLK_ZONE_COND_OFFLINE	= 0xF,
 };
 
+/**
+ * enum blk_zone_report_flags - Feature flags of reported zone descriptors.
+ *
+ * @BLK_ZONE_REP_CAPACITY: Zone descriptor has capacity field.
+ */
+enum blk_zone_report_flags {
+	BLK_ZONE_REP_CAPACITY	= (1 << 0),
+};
+
 /**
  * struct blk_zone - Zone descriptor for BLKREPORTZONE ioctl.
  *
@@ -99,7 +108,9 @@ struct blk_zone {
 	__u8	cond;		/* Zone condition */
 	__u8	non_seq;	/* Non-sequential write resources active */
 	__u8	reset;		/* Reset write pointer recommended */
-	__u8	reserved[36];
+	__u8	resv[4];
+	__u64	capacity;	/* Zone capacity in number of sectors */
+	__u8	reserved[24];
 };
 
 /**
@@ -115,7 +126,7 @@ struct blk_zone {
 struct blk_zone_report {
 	__u64		sector;
 	__u32		nr_zones;
-	__u8		reserved[4];
+	__u32		flags;
 	struct blk_zone zones[0];
 };
 
-- 
2.24.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCHv3 2/5] null_blk: introduce zone capacity for zoned device
  2020-06-22 16:25 [PATCHv3 0/5] nvme support for zoned namespace command set Keith Busch
  2020-06-22 16:25 ` [PATCHv3 1/5] block: add capacity field to zone descriptors Keith Busch
@ 2020-06-22 16:25 ` Keith Busch
  2020-06-23  6:16   ` Hannes Reinecke
  2020-06-23  8:45   ` Sagi Grimberg
  2020-06-22 16:25 ` [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support Keith Busch
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 38+ messages in thread
From: Keith Busch @ 2020-06-22 16:25 UTC (permalink / raw)
  To: linux-nvme, hch, sagi, linux-block, axboe
  Cc: Aravind Ramesh, Chaitanya Kulkarni, Matias Bjørling,
	Daniel Wagner, Martin K . Petersen, Johannes Thumshirn

From: Aravind Ramesh <aravind.ramesh@wdc.com>

Allow emulation of a zoned device with a per zone capacity smaller than
the zone size as as defined in the Zoned Namespace (ZNS) Command Set
specification. The zone capacity defaults to the zone size if not
specified and must be smaller than the zone size otherwise.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
---
 drivers/block/null_blk.h       |  1 +
 drivers/block/null_blk_main.c  | 10 +++++++++-
 drivers/block/null_blk_zoned.c | 16 ++++++++++++++--
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/block/null_blk.h b/drivers/block/null_blk.h
index 81b311c9d781..daed4a9c3436 100644
--- a/drivers/block/null_blk.h
+++ b/drivers/block/null_blk.h
@@ -49,6 +49,7 @@ struct nullb_device {
 	unsigned long completion_nsec; /* time in ns to complete a request */
 	unsigned long cache_size; /* disk cache size in MB */
 	unsigned long zone_size; /* zone size in MB if device is zoned */
+	unsigned long zone_capacity; /* zone capacity in MB if device is zoned */
 	unsigned int zone_nr_conv; /* number of conventional zones */
 	unsigned int submit_queues; /* number of submission queues */
 	unsigned int home_node; /* home node for the device */
diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c
index 87b31f9ca362..a2a0e199215b 100644
--- a/drivers/block/null_blk_main.c
+++ b/drivers/block/null_blk_main.c
@@ -200,6 +200,10 @@ static unsigned long g_zone_size = 256;
 module_param_named(zone_size, g_zone_size, ulong, S_IRUGO);
 MODULE_PARM_DESC(zone_size, "Zone size in MB when block device is zoned. Must be power-of-two: Default: 256");
 
+static unsigned long g_zone_capacity;
+module_param_named(zone_capacity, g_zone_capacity, ulong, 0444);
+MODULE_PARM_DESC(zone_capacity, "Zone capacity in MB when block device is zoned. Can be less than or equal to zone size. Default: Zone size");
+
 static unsigned int g_zone_nr_conv;
 module_param_named(zone_nr_conv, g_zone_nr_conv, uint, 0444);
 MODULE_PARM_DESC(zone_nr_conv, "Number of conventional zones when block device is zoned. Default: 0");
@@ -341,6 +345,7 @@ NULLB_DEVICE_ATTR(mbps, uint, NULL);
 NULLB_DEVICE_ATTR(cache_size, ulong, NULL);
 NULLB_DEVICE_ATTR(zoned, bool, NULL);
 NULLB_DEVICE_ATTR(zone_size, ulong, NULL);
+NULLB_DEVICE_ATTR(zone_capacity, ulong, NULL);
 NULLB_DEVICE_ATTR(zone_nr_conv, uint, NULL);
 
 static ssize_t nullb_device_power_show(struct config_item *item, char *page)
@@ -457,6 +462,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
 	&nullb_device_attr_badblocks,
 	&nullb_device_attr_zoned,
 	&nullb_device_attr_zone_size,
+	&nullb_device_attr_zone_capacity,
 	&nullb_device_attr_zone_nr_conv,
 	NULL,
 };
@@ -510,7 +516,8 @@ nullb_group_drop_item(struct config_group *group, struct config_item *item)
 
 static ssize_t memb_group_features_show(struct config_item *item, char *page)
 {
-	return snprintf(page, PAGE_SIZE, "memory_backed,discard,bandwidth,cache,badblocks,zoned,zone_size,zone_nr_conv\n");
+	return snprintf(page, PAGE_SIZE,
+			"memory_backed,discard,bandwidth,cache,badblocks,zoned,zone_size,zone_capacity,zone_nr_conv\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -571,6 +578,7 @@ static struct nullb_device *null_alloc_dev(void)
 	dev->use_per_node_hctx = g_use_per_node_hctx;
 	dev->zoned = g_zoned;
 	dev->zone_size = g_zone_size;
+	dev->zone_capacity = g_zone_capacity;
 	dev->zone_nr_conv = g_zone_nr_conv;
 	return dev;
 }
diff --git a/drivers/block/null_blk_zoned.c b/drivers/block/null_blk_zoned.c
index 624aac09b005..3d25c9ad2383 100644
--- a/drivers/block/null_blk_zoned.c
+++ b/drivers/block/null_blk_zoned.c
@@ -28,6 +28,15 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 		return -EINVAL;
 	}
 
+	if (!dev->zone_capacity)
+		dev->zone_capacity = dev->zone_size;
+
+	if (dev->zone_capacity > dev->zone_size) {
+		pr_err("null_blk: zone capacity (%lu MB) larger than zone size (%lu MB)\n",
+					dev->zone_capacity, dev->zone_size);
+		return -EINVAL;
+	}
+
 	dev->zone_size_sects = dev->zone_size << ZONE_SIZE_SHIFT;
 	dev->nr_zones = dev_size >>
 				(SECTOR_SHIFT + ilog2(dev->zone_size_sects));
@@ -60,7 +69,7 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 
 		zone->start = zone->wp = sector;
 		zone->len = dev->zone_size_sects;
-		zone->capacity = zone->len;
+		zone->capacity = dev->zone_capacity << ZONE_SIZE_SHIFT;
 		zone->type = BLK_ZONE_TYPE_SEQWRITE_REQ;
 		zone->cond = BLK_ZONE_COND_EMPTY;
 
@@ -187,6 +196,9 @@ static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
 			return BLK_STS_IOERR;
 		}
 
+		if (zone->wp + nr_sectors > zone->start + zone->capacity)
+			return BLK_STS_IOERR;
+
 		if (zone->cond != BLK_ZONE_COND_EXP_OPEN)
 			zone->cond = BLK_ZONE_COND_IMP_OPEN;
 
@@ -195,7 +207,7 @@ static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector,
 			return ret;
 
 		zone->wp += nr_sectors;
-		if (zone->wp == zone->start + zone->len)
+		if (zone->wp == zone->start + zone->capacity)
 			zone->cond = BLK_ZONE_COND_FULL;
 		return BLK_STS_OK;
 	default:
-- 
2.24.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-22 16:25 [PATCHv3 0/5] nvme support for zoned namespace command set Keith Busch
  2020-06-22 16:25 ` [PATCHv3 1/5] block: add capacity field to zone descriptors Keith Busch
  2020-06-22 16:25 ` [PATCHv3 2/5] null_blk: introduce zone capacity for zoned device Keith Busch
@ 2020-06-22 16:25 ` Keith Busch
  2020-06-23  6:20   ` Hannes Reinecke
  2020-06-23  8:53   ` Sagi Grimberg
  2020-06-22 16:25 ` [PATCHv3 4/5] nvme: support for multi-command set effects Keith Busch
  2020-06-22 16:25 ` [PATCHv3 5/5] nvme: support for zoned namespaces Keith Busch
  4 siblings, 2 replies; 38+ messages in thread
From: Keith Busch @ 2020-06-22 16:25 UTC (permalink / raw)
  To: linux-nvme, hch, sagi, linux-block, axboe
  Cc: Niklas Cassel, Javier González, Martin K . Petersen,
	Johannes Thumshirn, Matias Bjørling, Daniel Wagner

From: Niklas Cassel <niklas.cassel@wdc.com>

Implements support for the I/O Command Sets command set. The command set
introduces a method to enumerate multiple command sets per namespace. If
the command set is exposed, this method for enumeration will be used
instead of the traditional method that uses the CC.CSS register command
set register for command set identification.

For namespaces where the Command Set Identifier is not supported or
recognized, the specific namespace will not be created.

Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
 drivers/nvme/host/core.c | 48 +++++++++++++++++++++++++++++++++-------
 drivers/nvme/host/nvme.h |  1 +
 include/linux/nvme.h     | 19 ++++++++++++++--
 3 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9491dbcfe81a..45a3cb5a35bd 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1056,8 +1056,13 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
 	return error;
 }
 
+static bool nvme_multi_css(struct nvme_ctrl *ctrl)
+{
+	return (ctrl->ctrl_config & NVME_CC_CSS_MASK) == NVME_CC_CSS_CSI;
+}
+
 static int nvme_process_ns_desc(struct nvme_ctrl *ctrl, struct nvme_ns_ids *ids,
-		struct nvme_ns_id_desc *cur)
+		struct nvme_ns_id_desc *cur, bool *csi_seen)
 {
 	const char *warn_str = "ctrl returned bogus length:";
 	void *data = cur;
@@ -1087,6 +1092,15 @@ static int nvme_process_ns_desc(struct nvme_ctrl *ctrl, struct nvme_ns_ids *ids,
 		}
 		uuid_copy(&ids->uuid, data + sizeof(*cur));
 		return NVME_NIDT_UUID_LEN;
+	case NVME_NIDT_CSI:
+		if (cur->nidl != NVME_NIDT_CSI_LEN) {
+			dev_warn(ctrl->device, "%s %d for NVME_NIDT_CSI\n",
+				 warn_str, cur->nidl);
+			return -1;
+		}
+		memcpy(&ids->csi, data + sizeof(*cur), NVME_NIDT_CSI_LEN);
+		*csi_seen = true;
+		return NVME_NIDT_CSI_LEN;
 	default:
 		/* Skip unknown types */
 		return cur->nidl;
@@ -1097,10 +1111,9 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
 		struct nvme_ns_ids *ids)
 {
 	struct nvme_command c = { };
-	int status;
+	bool csi_seen = false;
+	int status, pos, len;
 	void *data;
-	int pos;
-	int len;
 
 	c.identify.opcode = nvme_admin_identify;
 	c.identify.nsid = cpu_to_le32(nsid);
@@ -1130,13 +1143,19 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
 		if (cur->nidl == 0)
 			break;
 
-		len = nvme_process_ns_desc(ctrl, ids, cur);
+		len = nvme_process_ns_desc(ctrl, ids, cur, &csi_seen);
 		if (len < 0)
 			goto free_data;
 
 		len += sizeof(*cur);
 	}
 free_data:
+	if (!status && nvme_multi_css(ctrl) && !csi_seen) {
+		dev_warn(ctrl->device, "Command set not reported for nsid:%d\n",
+			 nsid);
+		status = -EINVAL;
+	}
+
 	kfree(data);
 	return status;
 }
@@ -1792,7 +1811,7 @@ static int nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
 		memcpy(ids->eui64, id->eui64, sizeof(id->eui64));
 	if (ctrl->vs >= NVME_VS(1, 2, 0))
 		memcpy(ids->nguid, id->nguid, sizeof(id->nguid));
-	if (ctrl->vs >= NVME_VS(1, 3, 0))
+	if (ctrl->vs >= NVME_VS(1, 3, 0) || nvme_multi_css(ctrl))
 		return nvme_identify_ns_descs(ctrl, nsid, ids);
 	return 0;
 }
@@ -1808,7 +1827,8 @@ static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
 {
 	return uuid_equal(&a->uuid, &b->uuid) &&
 		memcmp(&a->nguid, &b->nguid, sizeof(a->nguid)) == 0 &&
-		memcmp(&a->eui64, &b->eui64, sizeof(a->eui64)) == 0;
+		memcmp(&a->eui64, &b->eui64, sizeof(a->eui64)) == 0 &&
+		a->csi == b->csi;
 }
 
 static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
@@ -1930,6 +1950,15 @@ static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	if (ns->lba_shift == 0)
 		ns->lba_shift = 9;
 
+	switch (ns->head->ids.csi) {
+	case NVME_CSI_NVM:
+		break;
+	default:
+		dev_warn(ctrl->device, "unknown csi:%d ns:%d\n",
+			ns->head->ids.csi, ns->head->ns_id);
+		return -ENODEV;
+	}
+
 	if ((ctrl->quirks & NVME_QUIRK_STRIPE_SIZE) &&
 	    is_power_of_2(ctrl->max_hw_sectors))
 		iob = ctrl->max_hw_sectors;
@@ -2264,7 +2293,10 @@ int nvme_enable_ctrl(struct nvme_ctrl *ctrl)
 
 	ctrl->page_size = 1 << page_shift;
 
-	ctrl->ctrl_config = NVME_CC_CSS_NVM;
+	if (NVME_CAP_CSS(ctrl->cap) & NVME_CAP_CSS_CSI)
+		ctrl->ctrl_config = NVME_CC_CSS_CSI;
+	else
+		ctrl->ctrl_config = NVME_CC_CSS_NVM;
 	ctrl->ctrl_config |= (page_shift - 12) << NVME_CC_MPS_SHIFT;
 	ctrl->ctrl_config |= NVME_CC_AMS_RR | NVME_CC_SHN_NONE;
 	ctrl->ctrl_config |= NVME_CC_IOSQES | NVME_CC_IOCQES;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index c0f4226d3299..a84f71459caa 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -339,6 +339,7 @@ struct nvme_ns_ids {
 	u8	eui64[8];
 	u8	nguid[16];
 	uuid_t	uuid;
+	u8	csi;
 };
 
 /*
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 5ce51ab4c50e..81ffe5247505 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -132,6 +132,7 @@ enum {
 #define NVME_CAP_TIMEOUT(cap)	(((cap) >> 24) & 0xff)
 #define NVME_CAP_STRIDE(cap)	(((cap) >> 32) & 0xf)
 #define NVME_CAP_NSSRC(cap)	(((cap) >> 36) & 0x1)
+#define NVME_CAP_CSS(cap)	(((cap) >> 37) & 0xff)
 #define NVME_CAP_MPSMIN(cap)	(((cap) >> 48) & 0xf)
 #define NVME_CAP_MPSMAX(cap)	(((cap) >> 52) & 0xf)
 
@@ -162,7 +163,6 @@ enum {
 
 enum {
 	NVME_CC_ENABLE		= 1 << 0,
-	NVME_CC_CSS_NVM		= 0 << 4,
 	NVME_CC_EN_SHIFT	= 0,
 	NVME_CC_CSS_SHIFT	= 4,
 	NVME_CC_MPS_SHIFT	= 7,
@@ -170,6 +170,9 @@ enum {
 	NVME_CC_SHN_SHIFT	= 14,
 	NVME_CC_IOSQES_SHIFT	= 16,
 	NVME_CC_IOCQES_SHIFT	= 20,
+	NVME_CC_CSS_NVM		= 0 << NVME_CC_CSS_SHIFT,
+	NVME_CC_CSS_CSI		= 6 << NVME_CC_CSS_SHIFT,
+	NVME_CC_CSS_MASK	= 7 << NVME_CC_CSS_SHIFT,
 	NVME_CC_AMS_RR		= 0 << NVME_CC_AMS_SHIFT,
 	NVME_CC_AMS_WRRU	= 1 << NVME_CC_AMS_SHIFT,
 	NVME_CC_AMS_VS		= 7 << NVME_CC_AMS_SHIFT,
@@ -179,6 +182,8 @@ enum {
 	NVME_CC_SHN_MASK	= 3 << NVME_CC_SHN_SHIFT,
 	NVME_CC_IOSQES		= NVME_NVM_IOSQES << NVME_CC_IOSQES_SHIFT,
 	NVME_CC_IOCQES		= NVME_NVM_IOCQES << NVME_CC_IOCQES_SHIFT,
+	NVME_CAP_CSS_NVM	= 1 << 0,
+	NVME_CAP_CSS_CSI	= 1 << 6,
 	NVME_CSTS_RDY		= 1 << 0,
 	NVME_CSTS_CFS		= 1 << 1,
 	NVME_CSTS_NSSRO		= 1 << 4,
@@ -374,6 +379,8 @@ enum {
 	NVME_ID_CNS_CTRL		= 0x01,
 	NVME_ID_CNS_NS_ACTIVE_LIST	= 0x02,
 	NVME_ID_CNS_NS_DESC_LIST	= 0x03,
+	NVME_ID_CNS_CS_NS		= 0x05,
+	NVME_ID_CNS_CS_CTRL		= 0x06,
 	NVME_ID_CNS_NS_PRESENT_LIST	= 0x10,
 	NVME_ID_CNS_NS_PRESENT		= 0x11,
 	NVME_ID_CNS_CTRL_NS_LIST	= 0x12,
@@ -383,6 +390,10 @@ enum {
 	NVME_ID_CNS_UUID_LIST		= 0x17,
 };
 
+enum {
+	NVME_CSI_NVM			= 0,
+};
+
 enum {
 	NVME_DIR_IDENTIFY		= 0x00,
 	NVME_DIR_STREAMS		= 0x01,
@@ -435,11 +446,13 @@ struct nvme_ns_id_desc {
 #define NVME_NIDT_EUI64_LEN	8
 #define NVME_NIDT_NGUID_LEN	16
 #define NVME_NIDT_UUID_LEN	16
+#define NVME_NIDT_CSI_LEN	1
 
 enum {
 	NVME_NIDT_EUI64		= 0x01,
 	NVME_NIDT_NGUID		= 0x02,
 	NVME_NIDT_UUID		= 0x03,
+	NVME_NIDT_CSI		= 0x04,
 };
 
 struct nvme_smart_log {
@@ -972,7 +985,9 @@ struct nvme_identify {
 	__u8			cns;
 	__u8			rsvd3;
 	__le16			ctrlid;
-	__u32			rsvd11[5];
+	__u8			rsvd11[3];
+	__u8			csi;
+	__u32			rsvd12[4];
 };
 
 #define NVME_IDENTIFY_DATA_SIZE 4096
-- 
2.24.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCHv3 4/5] nvme: support for multi-command set effects
  2020-06-22 16:25 [PATCHv3 0/5] nvme support for zoned namespace command set Keith Busch
                   ` (2 preceding siblings ...)
  2020-06-22 16:25 ` [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support Keith Busch
@ 2020-06-22 16:25 ` Keith Busch
  2020-06-23  6:21   ` Hannes Reinecke
  2020-06-23 17:43   ` Sagi Grimberg
  2020-06-22 16:25 ` [PATCHv3 5/5] nvme: support for zoned namespaces Keith Busch
  4 siblings, 2 replies; 38+ messages in thread
From: Keith Busch @ 2020-06-22 16:25 UTC (permalink / raw)
  To: linux-nvme, hch, sagi, linux-block, axboe
  Cc: Keith Busch, Javier González, Martin K . Petersen,
	Johannes Thumshirn, Matias Bjørling, Daniel Wagner

From: Keith Busch <keith.busch@wdc.com>

The Commands Supported and Effects log page was extended with a CSI
field that enables the host to query the log page for each command set
supported. Retrieve this log page for each command set that an attached
namespace supports, and save a pointer to that log in the namespace head.

Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <keith.busch@wdc.com>
---
 drivers/nvme/host/core.c      | 79 ++++++++++++++++++++++++++---------
 drivers/nvme/host/hwmon.c     |  2 +-
 drivers/nvme/host/lightnvm.c  |  4 +-
 drivers/nvme/host/multipath.c |  2 +-
 drivers/nvme/host/nvme.h      | 10 ++++-
 include/linux/nvme.h          |  4 +-
 6 files changed, 76 insertions(+), 25 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 45a3cb5a35bd..64b7b9fc2817 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1363,8 +1363,8 @@ static u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	u32 effects = 0;
 
 	if (ns) {
-		if (ctrl->effects)
-			effects = le32_to_cpu(ctrl->effects->iocs[opcode]);
+		if (ns->head->effects)
+			effects = le32_to_cpu(ns->head->effects->iocs[opcode]);
 		if (effects & ~(NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC))
 			dev_warn(ctrl->device,
 				 "IO command:%02x has unhandled effects:%08x\n",
@@ -2844,7 +2844,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 	return ret;
 }
 
-int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp,
+int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp, u8 csi,
 		void *log, size_t size, u64 offset)
 {
 	struct nvme_command c = { };
@@ -2858,27 +2858,55 @@ int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp,
 	c.get_log_page.numdu = cpu_to_le16(dwlen >> 16);
 	c.get_log_page.lpol = cpu_to_le32(lower_32_bits(offset));
 	c.get_log_page.lpou = cpu_to_le32(upper_32_bits(offset));
+	c.get_log_page.csi = csi;
 
 	return nvme_submit_sync_cmd(ctrl->admin_q, &c, log, size);
 }
 
-static int nvme_get_effects_log(struct nvme_ctrl *ctrl)
+struct nvme_cel *nvme_find_cel(struct nvme_ctrl *ctrl, u8 csi)
 {
+	struct nvme_cel *cel, *ret = NULL;
+
+	spin_lock(&ctrl->lock);
+	list_for_each_entry(cel, &ctrl->cels, entry) {
+		if (cel->csi == csi) {
+			ret = cel;
+			break;
+		}
+	}
+	spin_unlock(&ctrl->lock);
+
+	return ret;
+}
+
+static int nvme_get_effects_log(struct nvme_ctrl *ctrl, u8 csi,
+				struct nvme_effects_log **log)
+{
+	struct nvme_cel *cel = nvme_find_cel(ctrl, csi);
 	int ret;
 
-	if (!ctrl->effects)
-		ctrl->effects = kzalloc(sizeof(*ctrl->effects), GFP_KERNEL);
+	if (cel)
+		goto out;
 
-	if (!ctrl->effects)
-		return 0;
+	cel = kzalloc(sizeof(*cel), GFP_KERNEL);
+	if (!cel)
+		return -ENOMEM;
 
-	ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_CMD_EFFECTS, 0,
-			ctrl->effects, sizeof(*ctrl->effects), 0);
+	ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_CMD_EFFECTS, 0, csi,
+			&cel->log, sizeof(cel->log), 0);
 	if (ret) {
-		kfree(ctrl->effects);
-		ctrl->effects = NULL;
+		kfree(cel);
+		return ret;
 	}
-	return ret;
+
+	cel->csi = csi;
+
+	spin_lock(&ctrl->lock);
+	list_add_tail(&cel->entry, &ctrl->cels);
+	spin_unlock(&ctrl->lock);
+out:
+	*log = &cel->log;
+	return 0;
 }
 
 /*
@@ -2911,7 +2939,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	}
 
 	if (id->lpa & NVME_CTRL_LPA_CMD_EFFECTS_LOG) {
-		ret = nvme_get_effects_log(ctrl);
+		ret = nvme_get_effects_log(ctrl, NVME_CSI_NVM, &ctrl->effects);
 		if (ret < 0)
 			goto out_free;
 	}
@@ -3544,6 +3572,13 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 		goto out_cleanup_srcu;
 	}
 
+	if (head->ids.csi) {
+		ret = nvme_get_effects_log(ctrl, head->ids.csi, &head->effects);
+		if (ret)
+			goto out_cleanup_srcu;
+	} else
+		head->effects = ctrl->effects;
+
 	ret = nvme_mpath_alloc_disk(ctrl, head);
 	if (ret)
 		goto out_cleanup_srcu;
@@ -3884,8 +3919,8 @@ static void nvme_clear_changed_ns_log(struct nvme_ctrl *ctrl)
 	 * raced with us in reading the log page, which could cause us to miss
 	 * updates.
 	 */
-	error = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_CHANGED_NS, 0, log,
-			log_size, 0);
+	error = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_CHANGED_NS, 0,
+			NVME_CSI_NVM, log, log_size, 0);
 	if (error)
 		dev_warn(ctrl->device,
 			"reading changed ns log failed: %d\n", error);
@@ -4029,8 +4064,8 @@ static void nvme_get_fw_slot_info(struct nvme_ctrl *ctrl)
 	if (!log)
 		return;
 
-	if (nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_FW_SLOT, 0, log,
-			sizeof(*log), 0))
+	if (nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_FW_SLOT, 0, NVME_CSI_NVM,
+			log, sizeof(*log), 0))
 		dev_warn(ctrl->device, "Get FW SLOT INFO log error\n");
 	kfree(log);
 }
@@ -4167,11 +4202,16 @@ static void nvme_free_ctrl(struct device *dev)
 	struct nvme_ctrl *ctrl =
 		container_of(dev, struct nvme_ctrl, ctrl_device);
 	struct nvme_subsystem *subsys = ctrl->subsys;
+	struct nvme_cel *cel, *next;
 
 	if (subsys && ctrl->instance != subsys->instance)
 		ida_simple_remove(&nvme_instance_ida, ctrl->instance);
 
-	kfree(ctrl->effects);
+	list_for_each_entry_safe(cel, next, &ctrl->cels, entry) {
+		list_del(&cel->entry);
+		kfree(cel);
+	}
+
 	nvme_mpath_uninit(ctrl);
 	__free_page(ctrl->discard_page);
 
@@ -4202,6 +4242,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	spin_lock_init(&ctrl->lock);
 	mutex_init(&ctrl->scan_lock);
 	INIT_LIST_HEAD(&ctrl->namespaces);
+	INIT_LIST_HEAD(&ctrl->cels);
 	init_rwsem(&ctrl->namespaces_rwsem);
 	ctrl->dev = dev;
 	ctrl->ops = ops;
diff --git a/drivers/nvme/host/hwmon.c b/drivers/nvme/host/hwmon.c
index 2e6477ed420f..23ba8bf678ae 100644
--- a/drivers/nvme/host/hwmon.c
+++ b/drivers/nvme/host/hwmon.c
@@ -62,7 +62,7 @@ static int nvme_hwmon_get_smart_log(struct nvme_hwmon_data *data)
 	int ret;
 
 	ret = nvme_get_log(data->ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
-			   &data->log, sizeof(data->log), 0);
+			   NVME_CSI_NVM, &data->log, sizeof(data->log), 0);
 
 	return ret <= 0 ? ret : -EIO;
 }
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 69608755d415..8e562d0f2c30 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -593,8 +593,8 @@ static int nvme_nvm_get_chk_meta(struct nvm_dev *ndev,
 		dev_meta_off = dev_meta;
 
 		ret = nvme_get_log(ctrl, ns->head->ns_id,
-				NVME_NVM_LOG_REPORT_CHUNK, 0, dev_meta, len,
-				offset);
+				NVME_NVM_LOG_REPORT_CHUNK, 0, NVME_CSI_NVM,
+				dev_meta, len, offset);
 		if (ret) {
 			dev_err(ctrl->device, "Get REPORT CHUNK log error\n");
 			break;
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index da78e499947a..e4bbdf2acc09 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -531,7 +531,7 @@ static int nvme_read_ana_log(struct nvme_ctrl *ctrl)
 	int error;
 
 	mutex_lock(&ctrl->ana_lock);
-	error = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_ANA, 0,
+	error = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_ANA, 0, NVME_CSI_NVM,
 			ctrl->ana_log_buf, ctrl->ana_log_size, 0);
 	if (error) {
 		dev_warn(ctrl->device, "Failed to get ANA log: %d\n", error);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a84f71459caa..4a1982133e9a 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -191,6 +191,12 @@ struct nvme_fault_inject {
 #endif
 };
 
+struct nvme_cel {
+	struct list_head	entry;
+	struct nvme_effects_log	log;
+	u8			csi;
+};
+
 struct nvme_ctrl {
 	bool comp_seen;
 	enum nvme_ctrl_state state;
@@ -257,6 +263,7 @@ struct nvme_ctrl {
 	unsigned long quirks;
 	struct nvme_id_power_state psd[32];
 	struct nvme_effects_log *effects;
+	struct list_head cels;
 	struct work_struct scan_work;
 	struct work_struct async_event_work;
 	struct delayed_work ka_work;
@@ -359,6 +366,7 @@ struct nvme_ns_head {
 	struct kref		ref;
 	bool			shared;
 	int			instance;
+	struct nvme_effects_log *effects;
 #ifdef CONFIG_NVME_MULTIPATH
 	struct gendisk		*disk;
 	struct bio_list		requeue_list;
@@ -557,7 +565,7 @@ int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl);
 int nvme_try_sched_reset(struct nvme_ctrl *ctrl);
 int nvme_delete_ctrl(struct nvme_ctrl *ctrl);
 
-int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp,
+int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp, u8 csi,
 		void *log, size_t size, u64 offset);
 
 extern const struct attribute_group *nvme_ns_id_attr_groups[];
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 81ffe5247505..95cd03e240a1 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -1101,7 +1101,9 @@ struct nvme_get_log_page_command {
 		};
 		__le64 lpo;
 	};
-	__u32			rsvd14[2];
+	__u8			rsvd14[3];
+	__u8			csi;
+	__u32			rsvd15;
 };
 
 struct nvme_directive_cmd {
-- 
2.24.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCHv3 5/5] nvme: support for zoned namespaces
  2020-06-22 16:25 [PATCHv3 0/5] nvme support for zoned namespace command set Keith Busch
                   ` (3 preceding siblings ...)
  2020-06-22 16:25 ` [PATCHv3 4/5] nvme: support for multi-command set effects Keith Busch
@ 2020-06-22 16:25 ` Keith Busch
  2020-06-22 16:48   ` Johannes Thumshirn
                     ` (4 more replies)
  4 siblings, 5 replies; 38+ messages in thread
From: Keith Busch @ 2020-06-22 16:25 UTC (permalink / raw)
  To: linux-nvme, hch, sagi, linux-block, axboe
  Cc: Keith Busch, Martin K . Petersen, Hans Holmberg, Dmitry Fomichev,
	Ajay Joshi, Aravind Ramesh, Niklas Cassel, Matias Bjørling,
	Damien Le Moal

From: Keith Busch <keith.busch@wdc.com>

Add support for NVM Express Zoned Namespaces (ZNS) Command Set defined
in NVM Express TP4053. Zoned namespaces are discovered based on their
Command Set Identifier reported in the namespaces Namespace
Identification Descriptor list. A successfully discovered Zoned
Namespace will be registered with the block layer as a host managed
zoned block device with Zone Append command support. A namespace that
does not support append is not supported by the driver.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Keith Busch <keith.busch@wdc.com>
---
 block/Kconfig              |   5 +-
 drivers/nvme/host/Makefile |   1 +
 drivers/nvme/host/core.c   |  91 ++++++++++++--
 drivers/nvme/host/nvme.h   |  39 ++++++
 drivers/nvme/host/zns.c    | 245 +++++++++++++++++++++++++++++++++++++
 include/linux/nvme.h       | 114 ++++++++++++++++-
 6 files changed, 480 insertions(+), 15 deletions(-)
 create mode 100644 drivers/nvme/host/zns.c

diff --git a/block/Kconfig b/block/Kconfig
index 9357d7302398..bbad5e8bbffe 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -86,9 +86,10 @@ config BLK_DEV_ZONED
 	select MQ_IOSCHED_DEADLINE
 	help
 	Block layer zoned block device support. This option enables
-	support for ZAC/ZBC host-managed and host-aware zoned block devices.
+	support for ZAC/ZBC/ZNS host-managed and host-aware zoned block
+	devices.
 
-	Say yes here if you have a ZAC or ZBC storage device.
+	Say yes here if you have a ZAC, ZBC, or ZNS storage device.
 
 config BLK_DEV_THROTTLING
 	bool "Block layer bio throttling support"
diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index fc7b26be692d..d7f6a87687b8 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -13,6 +13,7 @@ nvme-core-y				:= core.o
 nvme-core-$(CONFIG_TRACING)		+= trace.o
 nvme-core-$(CONFIG_NVME_MULTIPATH)	+= multipath.o
 nvme-core-$(CONFIG_NVM)			+= lightnvm.o
+nvme-core-$(CONFIG_BLK_DEV_ZONED)	+= zns.o
 nvme-core-$(CONFIG_FAULT_INJECTION_DEBUG_FS)	+= fault_inject.o
 nvme-core-$(CONFIG_NVME_HWMON)		+= hwmon.o
 
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 64b7b9fc2817..885db561de17 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -89,7 +89,7 @@ static dev_t nvme_chr_devt;
 static struct class *nvme_class;
 static struct class *nvme_subsys_class;
 
-static int nvme_revalidate_disk(struct gendisk *disk);
+static int _nvme_revalidate_disk(struct gendisk *disk);
 static void nvme_put_subsystem(struct nvme_subsystem *subsys);
 static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 					   unsigned nsid);
@@ -287,6 +287,10 @@ void nvme_complete_rq(struct request *req)
 			nvme_retry_req(req);
 			return;
 		}
+	} else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
+		   req_op(req) == REQ_OP_ZONE_APPEND) {
+		req->__sector = nvme_lba_to_sect(req->q->queuedata,
+			le64_to_cpu(nvme_req(req)->result.u64));
 	}
 
 	nvme_trace_bio_complete(req, status);
@@ -673,7 +677,8 @@ static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns,
 }
 
 static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
-		struct request *req, struct nvme_command *cmnd)
+		struct request *req, struct nvme_command *cmnd,
+		enum nvme_opcode op)
 {
 	struct nvme_ctrl *ctrl = ns->ctrl;
 	u16 control = 0;
@@ -687,7 +692,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 	if (req->cmd_flags & REQ_RAHEAD)
 		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
 
-	cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
+	cmnd->rw.opcode = op;
 	cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
 	cmnd->rw.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
 	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
@@ -716,6 +721,8 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 		case NVME_NS_DPS_PI_TYPE2:
 			control |= NVME_RW_PRINFO_PRCHK_GUARD |
 					NVME_RW_PRINFO_PRCHK_REF;
+			if (op == nvme_cmd_zone_append)
+				control |= NVME_RW_APPEND_PIREMAP;
 			cmnd->rw.reftag = cpu_to_le32(t10_pi_ref_tag(req));
 			break;
 		}
@@ -756,6 +763,19 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
 	case REQ_OP_FLUSH:
 		nvme_setup_flush(ns, cmd);
 		break;
+	case REQ_OP_ZONE_RESET_ALL:
+	case REQ_OP_ZONE_RESET:
+		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_RESET);
+		break;
+	case REQ_OP_ZONE_OPEN:
+		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OPEN);
+		break;
+	case REQ_OP_ZONE_CLOSE:
+		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_CLOSE);
+		break;
+	case REQ_OP_ZONE_FINISH:
+		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
+		break;
 	case REQ_OP_WRITE_ZEROES:
 		ret = nvme_setup_write_zeroes(ns, req, cmd);
 		break;
@@ -763,8 +783,13 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
 		ret = nvme_setup_discard(ns, req, cmd);
 		break;
 	case REQ_OP_READ:
+		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
+		break;
 	case REQ_OP_WRITE:
-		ret = nvme_setup_rw(ns, req, cmd);
+		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
+		break;
+	case REQ_OP_ZONE_APPEND:
+		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
 		break;
 	default:
 		WARN_ON_ONCE(1);
@@ -1391,14 +1416,23 @@ static u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	return effects;
 }
 
-static void nvme_update_formats(struct nvme_ctrl *ctrl)
+static void nvme_update_formats(struct nvme_ctrl *ctrl, u32 *effects)
 {
 	struct nvme_ns *ns;
 
 	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
-		if (ns->disk && nvme_revalidate_disk(ns->disk))
+		if (ns->disk && _nvme_revalidate_disk(ns->disk))
 			nvme_set_queue_dying(ns);
+		else if (blk_queue_is_zoned(ns->disk->queue)) {
+			/*
+			 * IO commands are required to fully revalidate a zoned
+			 * device. Force the command effects to trigger rescan
+			 * work so report zones can run in a context with
+			 * unfrozen IO queues.
+			 */
+			*effects |= NVME_CMD_EFFECTS_NCC;
+		}
 	up_read(&ctrl->namespaces_rwsem);
 }
 
@@ -1410,7 +1444,7 @@ static void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects)
 	 * this command.
 	 */
 	if (effects & NVME_CMD_EFFECTS_LBCC)
-		nvme_update_formats(ctrl);
+		nvme_update_formats(ctrl, &effects);
 	if (effects & (NVME_CMD_EFFECTS_LBCC | NVME_CMD_EFFECTS_CSE_MASK)) {
 		nvme_unfreeze(ctrl);
 		nvme_mpath_unfreeze(ctrl->subsys);
@@ -1525,7 +1559,7 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
  * Issue ioctl requests on the first available path.  Note that unlike normal
  * block layer requests we will not retry failed request on another controller.
  */
-static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk,
+struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk,
 		struct nvme_ns_head **head, int *srcu_idx)
 {
 #ifdef CONFIG_NVME_MULTIPATH
@@ -1545,7 +1579,7 @@ static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk,
 	return disk->private_data;
 }
 
-static void nvme_put_ns_from_disk(struct nvme_ns_head *head, int idx)
+void nvme_put_ns_from_disk(struct nvme_ns_head *head, int idx)
 {
 	if (head)
 		srcu_read_unlock(&head->srcu, idx);
@@ -1938,21 +1972,28 @@ static void nvme_update_disk_info(struct gendisk *disk,
 
 static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 {
+	unsigned lbaf = id->flbas & NVME_NS_FLBAS_LBA_MASK;
 	struct nvme_ns *ns = disk->private_data;
 	struct nvme_ctrl *ctrl = ns->ctrl;
+	int ret;
 	u32 iob;
 
 	/*
 	 * If identify namespace failed, use default 512 byte block size so
 	 * block layer can use before failing read/write for 0 capacity.
 	 */
-	ns->lba_shift = id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ds;
+	ns->lba_shift = id->lbaf[lbaf].ds;
 	if (ns->lba_shift == 0)
 		ns->lba_shift = 9;
 
 	switch (ns->head->ids.csi) {
 	case NVME_CSI_NVM:
 		break;
+	case NVME_CSI_ZNS:
+		ret = nvme_update_zone_info(disk, ns, lbaf);
+		if (ret)
+			return ret;
+		break;
 	default:
 		dev_warn(ctrl->device, "unknown csi:%d ns:%d\n",
 			ns->head->ids.csi, ns->head->ns_id);
@@ -1966,7 +2007,7 @@ static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 		iob = nvme_lba_to_sect(ns, le16_to_cpu(id->noiob));
 
 	ns->features = 0;
-	ns->ms = le16_to_cpu(id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ms);
+	ns->ms = le16_to_cpu(id->lbaf[lbaf].ms);
 	/* the PI implementation requires metadata equal t10 pi tuple size */
 	if (ns->ms == sizeof(struct t10_pi_tuple))
 		ns->pi_type = id->dps & NVME_NS_DPS_PI_MASK;
@@ -2009,7 +2050,7 @@ static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	return 0;
 }
 
-static int nvme_revalidate_disk(struct gendisk *disk)
+static int _nvme_revalidate_disk(struct gendisk *disk)
 {
 	struct nvme_ns *ns = disk->private_data;
 	struct nvme_ctrl *ctrl = ns->ctrl;
@@ -2057,6 +2098,28 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 	return ret;
 }
 
+static int nvme_revalidate_disk(struct gendisk *disk)
+{
+	int ret;
+
+	ret = _nvme_revalidate_disk(disk);
+	if (ret)
+		return ret;
+
+#ifdef CONFIG_BLK_DEV_ZONED
+	if (blk_queue_is_zoned(disk->queue)) {
+		struct nvme_ns *ns = disk->private_data;
+		struct nvme_ctrl *ctrl = ns->ctrl;
+
+		ret = blk_revalidate_disk_zones(disk, NULL);
+		if (!ret)
+			blk_queue_max_zone_append_sectors(disk->queue,
+							  ctrl->max_zone_append);
+	}
+#endif
+	return ret;
+}
+
 static char nvme_pr_type(enum pr_type type)
 {
 	switch (type) {
@@ -2187,6 +2250,7 @@ static const struct block_device_operations nvme_fops = {
 	.release	= nvme_release,
 	.getgeo		= nvme_getgeo,
 	.revalidate_disk= nvme_revalidate_disk,
+	.report_zones	= nvme_report_zones,
 	.pr_ops		= &nvme_pr_ops,
 };
 
@@ -2212,6 +2276,7 @@ const struct block_device_operations nvme_ns_head_ops = {
 	.ioctl		= nvme_ioctl,
 	.compat_ioctl	= nvme_compat_ioctl,
 	.getgeo		= nvme_getgeo,
+	.report_zones	= nvme_report_zones,
 	.pr_ops		= &nvme_pr_ops,
 };
 #endif /* CONFIG_NVME_MULTIPATH */
@@ -4439,6 +4504,8 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_command) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_id_ctrl) != NVME_IDENTIFY_DATA_SIZE);
 	BUILD_BUG_ON(sizeof(struct nvme_id_ns) != NVME_IDENTIFY_DATA_SIZE);
+	BUILD_BUG_ON(sizeof(struct nvme_id_ns_zns) != NVME_IDENTIFY_DATA_SIZE);
+	BUILD_BUG_ON(sizeof(struct nvme_id_ctrl_zns) != NVME_IDENTIFY_DATA_SIZE);
 	BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
 	BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 4a1982133e9a..ecf443efdf91 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -238,6 +238,9 @@ struct nvme_ctrl {
 	u32 max_hw_sectors;
 	u32 max_segments;
 	u32 max_integrity_segments;
+#ifdef CONFIG_BLK_DEV_ZONED
+	u32 max_zone_append;
+#endif
 	u16 crdt[3];
 	u16 oncs;
 	u16 oacs;
@@ -402,6 +405,9 @@ struct nvme_ns {
 	u16 sgs;
 	u32 sws;
 	u8 pi_type;
+#ifdef CONFIG_BLK_DEV_ZONED
+	u64 zsze;
+#endif
 	unsigned long features;
 	unsigned long flags;
 #define NVME_NS_REMOVING	0
@@ -567,6 +573,9 @@ int nvme_delete_ctrl(struct nvme_ctrl *ctrl);
 
 int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp, u8 csi,
 		void *log, size_t size, u64 offset);
+struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk,
+		struct nvme_ns_head **head, int *srcu_idx);
+void nvme_put_ns_from_disk(struct nvme_ns_head *head, int idx);
 
 extern const struct attribute_group *nvme_ns_id_attr_groups[];
 extern const struct block_device_operations nvme_ns_head_ops;
@@ -688,6 +697,36 @@ static inline void nvme_mpath_start_freeze(struct nvme_subsystem *subsys)
 }
 #endif /* CONFIG_NVME_MULTIPATH */
 
+#ifdef CONFIG_BLK_DEV_ZONED
+int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
+			  unsigned lbaf);
+
+int nvme_report_zones(struct gendisk *disk, sector_t sector,
+		      unsigned int nr_zones, report_zones_cb cb, void *data);
+
+blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
+				       struct nvme_command *cmnd,
+				       enum nvme_zone_mgmt_action action);
+#else
+#define nvme_report_zones NULL
+
+static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
+		struct request *req, struct nvme_command *cmnd,
+		enum nvme_zone_mgmt_action action)
+{
+	return BLK_STS_NOTSUPP;
+}
+
+static inline int nvme_update_zone_info(struct gendisk *disk,
+					struct nvme_ns *ns,
+					unsigned lbaf)
+{
+	dev_warn(ns->ctrl->device,
+		 "Please enable CONFIG_BLK_DEV_ZONED to support ZNS devices\n");
+	return -EPROTONOSUPPORT;
+}
+#endif
+
 #ifdef CONFIG_NVM
 int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node);
 void nvme_nvm_unregister(struct nvme_ns *ns);
diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
new file mode 100644
index 000000000000..ee6f49a8aee4
--- /dev/null
+++ b/drivers/nvme/host/zns.c
@@ -0,0 +1,245 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2020 Western Digital Corporation or its affiliates.
+ */
+
+#include <linux/blkdev.h>
+#include <linux/vmalloc.h>
+#include "nvme.h"
+
+static int nvme_set_max_append(struct nvme_ctrl *ctrl)
+{
+	struct nvme_command c = { };
+	struct nvme_id_ctrl_zns *id;
+	int status;
+
+	id = kzalloc(sizeof(*id), GFP_KERNEL);
+	if (!id)
+		return -ENOMEM;
+
+	c.identify.opcode = nvme_admin_identify;
+	c.identify.cns = NVME_ID_CNS_CS_CTRL;
+	c.identify.csi = NVME_CSI_ZNS;
+
+	status = nvme_submit_sync_cmd(ctrl->admin_q, &c, id, sizeof(*id));
+	if (status) {
+		kfree(id);
+		return status;
+	}
+
+	ctrl->max_zone_append = 1 << (id->zamds + 3);
+	kfree(id);
+	return 0;
+}
+
+int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
+			  unsigned lbaf)
+{
+	struct nvme_effects_log *log = ns->head->effects;
+	struct request_queue *q = disk->queue;
+	struct nvme_command c = { };
+	struct nvme_id_ns_zns *id;
+	int status;
+
+	/* Driver requires zone append support */
+	if (!(log->iocs[nvme_cmd_zone_append] & NVME_CMD_EFFECTS_CSUPP)) {
+		dev_warn(ns->ctrl->device,
+			"append not supported for zoned namespace:%d\n",
+			ns->head->ns_id);
+		return -ENODEV;
+	}
+
+	/* Lazily query controller append limit for the first zoned namespace */
+	if (!ns->ctrl->max_zone_append) {
+		status = nvme_set_max_append(ns->ctrl);
+		if (status)
+			return status;
+	}
+
+	id = kzalloc(sizeof(*id), GFP_KERNEL);
+	if (!id)
+		return -ENOMEM;
+
+	c.identify.opcode = nvme_admin_identify;
+	c.identify.nsid = cpu_to_le32(ns->head->ns_id);
+	c.identify.cns = NVME_ID_CNS_CS_NS;
+	c.identify.csi = NVME_CSI_ZNS;
+
+	status = nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, id, sizeof(*id));
+	if (status)
+		goto free_data;
+
+	/*
+	 * We currently do not handle devices requiring any of the zoned
+	 * operation characteristics.
+	 */
+	if (id->zoc) {
+		dev_warn(ns->ctrl->device,
+			"zone operations:%x not supported for namespace:%d\n",
+			le_to_cpu16(id->zoc), ns->head->ns_id);
+		status = -EINVAL;
+		goto free_data;
+	}
+
+	ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
+	if (!ns->zsze) {
+		status = -EINVAL;
+		goto free_data;
+	}
+
+	q->limits.zoned = BLK_ZONED_HM;
+	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
+free_data:
+	kfree(id);
+	return status;
+}
+
+static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
+					  unsigned int nr_zones, size_t *buflen)
+{
+	struct request_queue *q = ns->disk->queue;
+	size_t bufsize;
+	void *buf;
+
+	const size_t min_bufsize = sizeof(struct nvme_zone_report) +
+				   sizeof(struct nvme_zone_descriptor);
+
+	nr_zones = min_t(unsigned int, nr_zones,
+			 get_capacity(ns->disk) >> ilog2(ns->zsze));
+
+	bufsize = sizeof(struct nvme_zone_report) +
+		nr_zones * sizeof(struct nvme_zone_descriptor);
+	bufsize = min_t(size_t, bufsize,
+			queue_max_hw_sectors(q) << SECTOR_SHIFT);
+	bufsize = min_t(size_t, bufsize, queue_max_segments(q) << PAGE_SHIFT);
+
+	while (bufsize >= min_bufsize) {
+		buf = __vmalloc(bufsize,
+				GFP_KERNEL | __GFP_ZERO | __GFP_NORETRY);
+		if (buf) {
+			*buflen = bufsize;
+			return buf;
+		}
+		bufsize >>= 1;
+	}
+	return NULL;
+}
+
+static int __nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
+				  struct nvme_zone_report *report,
+				  size_t buflen)
+{
+	struct nvme_command c = { };
+	int ret;
+
+	c.zmr.opcode = nvme_cmd_zone_mgmt_recv;
+	c.zmr.nsid = cpu_to_le32(ns->head->ns_id);
+	c.zmr.slba = cpu_to_le64(nvme_sect_to_lba(ns, sector));
+	c.zmr.numd = cpu_to_le32(nvme_bytes_to_numd(buflen));
+	c.zmr.zra = NVME_ZRA_ZONE_REPORT;
+	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
+	c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
+
+	ret = nvme_submit_sync_cmd(ns->queue, &c, report, buflen);
+	if (ret)
+		return ret;
+
+	return le64_to_cpu(report->nr_zones);
+}
+
+static int nvme_zone_parse_entry(struct nvme_ns *ns,
+				 struct nvme_zone_descriptor *entry,
+				 unsigned int idx, report_zones_cb cb,
+				 void *data)
+{
+	struct blk_zone zone = { };
+
+	if ((entry->zt & 0xf) != NVME_ZONE_TYPE_SEQWRITE_REQ) {
+		dev_err(ns->ctrl->device, "invalid zone type %#x\n",
+				entry->zt);
+		return -EINVAL;
+	}
+
+	zone.type = BLK_ZONE_TYPE_SEQWRITE_REQ;
+	zone.cond = entry->zs >> 4;
+	zone.len = ns->zsze;
+	zone.capacity = nvme_lba_to_sect(ns, le64_to_cpu(entry->zcap));
+	zone.start = nvme_lba_to_sect(ns, le64_to_cpu(entry->zslba));
+	zone.wp = nvme_lba_to_sect(ns, le64_to_cpu(entry->wp));
+
+	return cb(&zone, idx, data);
+}
+
+static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
+			unsigned int nr_zones, report_zones_cb cb, void *data)
+{
+	struct nvme_zone_report *report;
+	int ret, zone_idx = 0;
+	unsigned int nz, i;
+	size_t buflen;
+
+	report = nvme_zns_alloc_report_buffer(ns, nr_zones, &buflen);
+	if (!report)
+		return -ENOMEM;
+
+	sector &= ~(ns->zsze - 1);
+	while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
+		memset(report, 0, buflen);
+		ret = __nvme_ns_report_zones(ns, sector, report, buflen);
+		if (ret < 0)
+			goto out_free;
+
+		nz = min_t(unsigned int, ret, nr_zones);
+		if (!nz)
+			break;
+
+		for (i = 0; i < nz && zone_idx < nr_zones; i++) {
+			ret = nvme_zone_parse_entry(ns, &report->entries[i],
+						    zone_idx, cb, data);
+			if (ret)
+				goto out_free;
+			zone_idx++;
+		}
+
+		sector += ns->zsze * nz;
+	}
+
+	ret = zone_idx;
+out_free:
+	kvfree(report);
+	return ret;
+}
+
+int nvme_report_zones(struct gendisk *disk, sector_t sector,
+		      unsigned int nr_zones, report_zones_cb cb, void *data)
+{
+	struct nvme_ns_head *head = NULL;
+	struct nvme_ns *ns;
+	int srcu_idx, ret;
+
+	ns = nvme_get_ns_from_disk(disk, &head, &srcu_idx);
+	if (unlikely(!ns))
+		return -EWOULDBLOCK;
+
+	if (ns->head->ids.csi == NVME_CSI_ZNS)
+		ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
+	else
+		ret = -EINVAL;
+	nvme_put_ns_from_disk(head, srcu_idx);
+
+	return ret;
+}
+
+blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
+		struct nvme_command *c, enum nvme_zone_mgmt_action action)
+{
+	c->zms.opcode = nvme_cmd_zone_mgmt_send;
+	c->zms.nsid = cpu_to_le32(ns->head->ns_id);
+	c->zms.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
+	c->zms.zsa = action;
+
+	if (req_op(req) == REQ_OP_ZONE_RESET_ALL)
+		c->zms.select_all = 1;
+
+	return BLK_STS_OK;
+}
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 95cd03e240a1..d862e5d70818 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
- * Definitions for the NVM Express interface
+ * Definitions for the NVM Ex
+ * ress interface
  * Copyright (c) 2011-2014, Intel Corporation.
  */
 
@@ -374,6 +375,30 @@ struct nvme_id_ns {
 	__u8			vs[3712];
 };
 
+struct nvme_zns_lbafe {
+	__le64			zsze;
+	__u8			zdes;
+	__u8			rsvd9[7];
+};
+
+struct nvme_id_ns_zns {
+	__le16			zoc;
+	__le16			ozcs;
+	__le32			mar;
+	__le32			mor;
+	__le32			rrl;
+	__le32			frl;
+	__u8			rsvd20[2796];
+	struct nvme_zns_lbafe	lbafe[16];
+	__u8			rsvd3072[768];
+	__u8			vs[256];
+};
+
+struct nvme_id_ctrl_zns {
+	__u8	zamds;
+	__u8	rsvd1[4095];
+};
+
 enum {
 	NVME_ID_CNS_NS			= 0x00,
 	NVME_ID_CNS_CTRL		= 0x01,
@@ -392,6 +417,7 @@ enum {
 
 enum {
 	NVME_CSI_NVM			= 0,
+	NVME_CSI_ZNS			= 2,
 };
 
 enum {
@@ -532,6 +558,27 @@ struct nvme_ana_rsp_hdr {
 	__le16	rsvd10[3];
 };
 
+struct nvme_zone_descriptor {
+	__u8		zt;
+	__u8		zs;
+	__u8		za;
+	__u8		rsvd3[5];
+	__le64		zcap;
+	__le64		zslba;
+	__le64		wp;
+	__u8		rsvd32[32];
+};
+
+enum {
+	NVME_ZONE_TYPE_SEQWRITE_REQ	= 0x2,
+};
+
+struct nvme_zone_report {
+	__le64		nr_zones;
+	__u8		resv8[56];
+	struct nvme_zone_descriptor entries[];
+};
+
 enum {
 	NVME_SMART_CRIT_SPARE		= 1 << 0,
 	NVME_SMART_CRIT_TEMPERATURE	= 1 << 1,
@@ -626,6 +673,9 @@ enum nvme_opcode {
 	nvme_cmd_resv_report	= 0x0e,
 	nvme_cmd_resv_acquire	= 0x11,
 	nvme_cmd_resv_release	= 0x15,
+	nvme_cmd_zone_mgmt_send	= 0x79,
+	nvme_cmd_zone_mgmt_recv	= 0x7a,
+	nvme_cmd_zone_append	= 0x7d,
 };
 
 #define nvme_opcode_name(opcode)	{ opcode, #opcode }
@@ -764,6 +814,7 @@ struct nvme_rw_command {
 enum {
 	NVME_RW_LR			= 1 << 15,
 	NVME_RW_FUA			= 1 << 14,
+	NVME_RW_APPEND_PIREMAP		= 1 << 9,
 	NVME_RW_DSM_FREQ_UNSPEC		= 0,
 	NVME_RW_DSM_FREQ_TYPICAL	= 1,
 	NVME_RW_DSM_FREQ_RARE		= 2,
@@ -829,6 +880,53 @@ struct nvme_write_zeroes_cmd {
 	__le16			appmask;
 };
 
+enum nvme_zone_mgmt_action {
+	NVME_ZONE_CLOSE		= 0x1,
+	NVME_ZONE_FINISH	= 0x2,
+	NVME_ZONE_OPEN		= 0x3,
+	NVME_ZONE_RESET		= 0x4,
+	NVME_ZONE_OFFLINE	= 0x5,
+	NVME_ZONE_SET_DESC_EXT	= 0x10,
+};
+
+struct nvme_zone_mgmt_send_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__le32			cdw2[2];
+	__le64			metadata;
+	union nvme_data_ptr	dptr;
+	__le64			slba;
+	__le32			cdw12;
+	__u8			zsa;
+	__u8			select_all;
+	__u8			rsvd13[2];
+	__le32			cdw14[2];
+};
+
+struct nvme_zone_mgmt_recv_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__le64			rsvd2[2];
+	union nvme_data_ptr	dptr;
+	__le64			slba;
+	__le32			numd;
+	__u8			zra;
+	__u8			zrasf;
+	__u8			pr;
+	__u8			rsvd13;
+	__le32			cdw14[2];
+};
+
+enum {
+	NVME_ZRA_ZONE_REPORT		= 0,
+	NVME_ZRASF_ZONE_REPORT_ALL	= 0,
+	NVME_REPORT_ZONE_PARTIAL	= 1,
+};
+
 /* Features */
 
 enum {
@@ -1300,6 +1398,8 @@ struct nvme_command {
 		struct nvme_format_cmd format;
 		struct nvme_dsm_cmd dsm;
 		struct nvme_write_zeroes_cmd write_zeroes;
+		struct nvme_zone_mgmt_send_cmd zms;
+		struct nvme_zone_mgmt_recv_cmd zmr;
 		struct nvme_abort_cmd abort;
 		struct nvme_get_log_page_command get_log_page;
 		struct nvmf_common_command fabrics;
@@ -1433,6 +1533,18 @@ enum {
 	NVME_SC_DISCOVERY_RESTART	= 0x190,
 	NVME_SC_AUTH_REQUIRED		= 0x191,
 
+	/*
+	 * I/O Command Set Specific - Zoned commands:
+	 */
+	NVME_SC_ZONE_BOUNDARY_ERROR	= 0x1b8,
+	NVME_SC_ZONE_FULL		= 0x1b9,
+	NVME_SC_ZONE_READ_ONLY		= 0x1ba,
+	NVME_SC_ZONE_OFFLINE		= 0x1bb,
+	NVME_SC_ZONE_INVALID_WRITE	= 0x1bc,
+	NVME_SC_ZONE_TOO_MANY_ACTIVE	= 0x1bd,
+	NVME_SC_ZONE_TOO_MANY_OPEN	= 0x1be,
+	NVME_SC_ZONE_INVALID_TRANSITION	= 0x1bf,
+
 	/*
 	 * Media and Data Integrity Errors:
 	 */
-- 
2.24.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 5/5] nvme: support for zoned namespaces
  2020-06-22 16:25 ` [PATCHv3 5/5] nvme: support for zoned namespaces Keith Busch
@ 2020-06-22 16:48   ` Johannes Thumshirn
  2020-06-23  6:23   ` Hannes Reinecke
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 38+ messages in thread
From: Johannes Thumshirn @ 2020-06-22 16:48 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, hch, sagi, linux-block, axboe
  Cc: Keith Busch, Martin K . Petersen, Hans Holmberg, Dmitry Fomichev,
	Ajay Joshi, Aravind Ramesh, Niklas Cassel, Matias Bjorling,
	Damien Le Moal

On 22/06/2020 18:25, Keith Busch wrote:
[...]
> +static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
> +					  unsigned int nr_zones, size_t *buflen)
> +{
> +	struct request_queue *q = ns->disk->queue;
> +	size_t bufsize;
> +	void *buf;
> +
> +	const size_t min_bufsize = sizeof(struct nvme_zone_report) +
> +				   sizeof(struct nvme_zone_descriptor);
> +
> +	nr_zones = min_t(unsigned int, nr_zones,
> +			 get_capacity(ns->disk) >> ilog2(ns->zsze));
> +
> +	bufsize = sizeof(struct nvme_zone_report) +
> +		nr_zones * sizeof(struct nvme_zone_descriptor);
> +	bufsize = min_t(size_t, bufsize,
> +			queue_max_hw_sectors(q) << SECTOR_SHIFT);
> +	bufsize = min_t(size_t, bufsize, queue_max_segments(q) << PAGE_SHIFT);
> +
> +	while (bufsize >= min_bufsize) {
> +		buf = __vmalloc(bufsize,
> +				GFP_KERNEL | __GFP_ZERO | __GFP_NORETRY);


Nit: The __GFP_ZERO flag isn't needed as nvme_ns_report_zones() 
calls memset() on the report buffer before submitting it.

[...]

> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index 95cd03e240a1..d862e5d70818 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -1,6 +1,7 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  /*
> - * Definitions for the NVM Express interface
> + * Definitions for the NVM Ex
> + * ress interface
Accidental line break.

Otherwise:
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 1/5] block: add capacity field to zone descriptors
  2020-06-22 16:25 ` [PATCHv3 1/5] block: add capacity field to zone descriptors Keith Busch
@ 2020-06-23  6:15   ` Hannes Reinecke
  2020-06-23  8:44   ` Sagi Grimberg
  2020-06-26 12:17   ` Jens Axboe
  2 siblings, 0 replies; 38+ messages in thread
From: Hannes Reinecke @ 2020-06-23  6:15 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, hch, sagi, linux-block, axboe
  Cc: Matias Bjørling, Chaitanya Kulkarni, Javier González,
	Daniel Wagner, Johannes Thumshirn, Martin K . Petersen

On 6/22/20 6:25 PM, Keith Busch wrote:
> From: Matias Bjørling <matias.bjorling@wdc.com>
> 
> In the zoned storage model, the sectors within a zone are typically all
> writeable. With the introduction of the Zoned Namespace (ZNS) Command
> Set in the NVM Express organization, the model was extended to have a
> specific writeable capacity.
> 
> Extend the zone descriptor data structure with a zone capacity field to
> indicate to the user how many sectors in a zone are writeable.
> 
> Introduce backward compatibility in the zone report ioctl by extending
> the zone report header data structure with a flags field to indicate if
> the capacity field is available.
> 
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Reviewed-by: Javier González <javier.gonz@samsung.com>
> Reviewed-by: Daniel Wagner <dwagner@suse.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
> ---
>   block/blk-zoned.c              |  1 +
>   drivers/block/null_blk_zoned.c |  2 ++
>   drivers/scsi/sd_zbc.c          |  1 +
>   include/uapi/linux/blkzoned.h  | 15 +++++++++++++--
>   4 files changed, 17 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 2/5] null_blk: introduce zone capacity for zoned device
  2020-06-22 16:25 ` [PATCHv3 2/5] null_blk: introduce zone capacity for zoned device Keith Busch
@ 2020-06-23  6:16   ` Hannes Reinecke
  2020-06-23  8:45   ` Sagi Grimberg
  1 sibling, 0 replies; 38+ messages in thread
From: Hannes Reinecke @ 2020-06-23  6:16 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, hch, sagi, linux-block, axboe
  Cc: Aravind Ramesh, Chaitanya Kulkarni, Matias Bjørling,
	Daniel Wagner, Martin K . Petersen, Johannes Thumshirn

On 6/22/20 6:25 PM, Keith Busch wrote:
> From: Aravind Ramesh <aravind.ramesh@wdc.com>
> 
> Allow emulation of a zoned device with a per zone capacity smaller than
> the zone size as as defined in the Zoned Namespace (ZNS) Command Set
> specification. The zone capacity defaults to the zone size if not
> specified and must be smaller than the zone size otherwise.
> 
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
> Reviewed-by: Daniel Wagner <dwagner@suse.de>
> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> ---
>   drivers/block/null_blk.h       |  1 +
>   drivers/block/null_blk_main.c  | 10 +++++++++-
>   drivers/block/null_blk_zoned.c | 16 ++++++++++++++--
>   3 files changed, 24 insertions(+), 3 deletions(-)
> Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-22 16:25 ` [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support Keith Busch
@ 2020-06-23  6:20   ` Hannes Reinecke
  2020-06-23  9:20     ` Niklas Cassel
  2020-06-23  8:53   ` Sagi Grimberg
  1 sibling, 1 reply; 38+ messages in thread
From: Hannes Reinecke @ 2020-06-23  6:20 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, hch, sagi, linux-block, axboe
  Cc: Niklas Cassel, Javier González, Martin K . Petersen,
	Johannes Thumshirn, Matias Bjørling, Daniel Wagner

On 6/22/20 6:25 PM, Keith Busch wrote:
> From: Niklas Cassel <niklas.cassel@wdc.com>
> 
> Implements support for the I/O Command Sets command set. The command set
> introduces a method to enumerate multiple command sets per namespace. If
> the command set is exposed, this method for enumeration will be used
> instead of the traditional method that uses the CC.CSS register command
> set register for command set identification.
> 
> For namespaces where the Command Set Identifier is not supported or
> recognized, the specific namespace will not be created.
> 
> Reviewed-by: Javier González <javier.gonz@samsung.com>
> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
> Reviewed-by: Daniel Wagner <dwagner@suse.de>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>   drivers/nvme/host/core.c | 48 +++++++++++++++++++++++++++++++++-------
>   drivers/nvme/host/nvme.h |  1 +
>   include/linux/nvme.h     | 19 ++++++++++++++--
>   3 files changed, 58 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 9491dbcfe81a..45a3cb5a35bd 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1056,8 +1056,13 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
>   	return error;
>   }
>   
> +static bool nvme_multi_css(struct nvme_ctrl *ctrl)
> +{
> +	return (ctrl->ctrl_config & NVME_CC_CSS_MASK) == NVME_CC_CSS_CSI;
> +}
> +
>   static int nvme_process_ns_desc(struct nvme_ctrl *ctrl, struct nvme_ns_ids *ids,
> -		struct nvme_ns_id_desc *cur)
> +		struct nvme_ns_id_desc *cur, bool *csi_seen)
>   {
>   	const char *warn_str = "ctrl returned bogus length:";
>   	void *data = cur;
> @@ -1087,6 +1092,15 @@ static int nvme_process_ns_desc(struct nvme_ctrl *ctrl, struct nvme_ns_ids *ids,
>   		}
>   		uuid_copy(&ids->uuid, data + sizeof(*cur));
>   		return NVME_NIDT_UUID_LEN;
> +	case NVME_NIDT_CSI:
> +		if (cur->nidl != NVME_NIDT_CSI_LEN) {
> +			dev_warn(ctrl->device, "%s %d for NVME_NIDT_CSI\n",
> +				 warn_str, cur->nidl);
> +			return -1;
> +		}
> +		memcpy(&ids->csi, data + sizeof(*cur), NVME_NIDT_CSI_LEN);
> +		*csi_seen = true;
> +		return NVME_NIDT_CSI_LEN;
>   	default:
>   		/* Skip unknown types */
>   		return cur->nidl;
> @@ -1097,10 +1111,9 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
>   		struct nvme_ns_ids *ids)
>   {
>   	struct nvme_command c = { };
> -	int status;
> +	bool csi_seen = false;
> +	int status, pos, len;
>   	void *data;
> -	int pos;
> -	int len;
>   
>   	c.identify.opcode = nvme_admin_identify;
>   	c.identify.nsid = cpu_to_le32(nsid);
> @@ -1130,13 +1143,19 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
>   		if (cur->nidl == 0)
>   			break;
>   
> -		len = nvme_process_ns_desc(ctrl, ids, cur);
> +		len = nvme_process_ns_desc(ctrl, ids, cur, &csi_seen);
>   		if (len < 0)
>   			goto free_data;
>   
>   		len += sizeof(*cur);
>   	}
>   free_data:
> +	if (!status && nvme_multi_css(ctrl) && !csi_seen) {
> +		dev_warn(ctrl->device, "Command set not reported for nsid:%d\n",
> +			 nsid);
> +		status = -EINVAL;
> +	}
> +
>   	kfree(data);
>   	return status;
>   }
> @@ -1792,7 +1811,7 @@ static int nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
>   		memcpy(ids->eui64, id->eui64, sizeof(id->eui64));
>   	if (ctrl->vs >= NVME_VS(1, 2, 0))
>   		memcpy(ids->nguid, id->nguid, sizeof(id->nguid));
> -	if (ctrl->vs >= NVME_VS(1, 3, 0))
> +	if (ctrl->vs >= NVME_VS(1, 3, 0) || nvme_multi_css(ctrl))
>   		return nvme_identify_ns_descs(ctrl, nsid, ids);
>   	return 0;
>   }

Hmm? Are command sets even defined for something earlier than 1.3?

> @@ -1808,7 +1827,8 @@ static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
>   {
>   	return uuid_equal(&a->uuid, &b->uuid) &&
>   		memcmp(&a->nguid, &b->nguid, sizeof(a->nguid)) == 0 &&
> -		memcmp(&a->eui64, &b->eui64, sizeof(a->eui64)) == 0;
> +		memcmp(&a->eui64, &b->eui64, sizeof(a->eui64)) == 0 &&
> +		a->csi == b->csi;
>   }
>   
>   static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> @@ -1930,6 +1950,15 @@ static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
>   	if (ns->lba_shift == 0)
>   		ns->lba_shift = 9;
>   
> +	switch (ns->head->ids.csi) {
> +	case NVME_CSI_NVM:
> +		break;
> +	default:
> +		dev_warn(ctrl->device, "unknown csi:%d ns:%d\n",
> +			ns->head->ids.csi, ns->head->ns_id);
> +		return -ENODEV;
> +	}
> +
>   	if ((ctrl->quirks & NVME_QUIRK_STRIPE_SIZE) &&
>   	    is_power_of_2(ctrl->max_hw_sectors))
>   		iob = ctrl->max_hw_sectors;
> @@ -2264,7 +2293,10 @@ int nvme_enable_ctrl(struct nvme_ctrl *ctrl)
>   
>   	ctrl->page_size = 1 << page_shift;
>   
> -	ctrl->ctrl_config = NVME_CC_CSS_NVM;
> +	if (NVME_CAP_CSS(ctrl->cap) & NVME_CAP_CSS_CSI)
> +		ctrl->ctrl_config = NVME_CC_CSS_CSI;
> +	else
> +		ctrl->ctrl_config = NVME_CC_CSS_NVM;
>   	ctrl->ctrl_config |= (page_shift - 12) << NVME_CC_MPS_SHIFT;
>   	ctrl->ctrl_config |= NVME_CC_AMS_RR | NVME_CC_SHN_NONE;
>   	ctrl->ctrl_config |= NVME_CC_IOSQES | NVME_CC_IOCQES;
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index c0f4226d3299..a84f71459caa 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -339,6 +339,7 @@ struct nvme_ns_ids {
>   	u8	eui64[8];
>   	u8	nguid[16];
>   	uuid_t	uuid;
> +	u8	csi;
>   };
>   
>   /*
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index 5ce51ab4c50e..81ffe5247505 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -132,6 +132,7 @@ enum {
>   #define NVME_CAP_TIMEOUT(cap)	(((cap) >> 24) & 0xff)
>   #define NVME_CAP_STRIDE(cap)	(((cap) >> 32) & 0xf)
>   #define NVME_CAP_NSSRC(cap)	(((cap) >> 36) & 0x1)
> +#define NVME_CAP_CSS(cap)	(((cap) >> 37) & 0xff)
>   #define NVME_CAP_MPSMIN(cap)	(((cap) >> 48) & 0xf)
>   #define NVME_CAP_MPSMAX(cap)	(((cap) >> 52) & 0xf)
>   

Indentation?

> @@ -162,7 +163,6 @@ enum {
>   
>   enum {
>   	NVME_CC_ENABLE		= 1 << 0,
> -	NVME_CC_CSS_NVM		= 0 << 4,
>   	NVME_CC_EN_SHIFT	= 0,
>   	NVME_CC_CSS_SHIFT	= 4,
>   	NVME_CC_MPS_SHIFT	= 7,
> @@ -170,6 +170,9 @@ enum {
>   	NVME_CC_SHN_SHIFT	= 14,
>   	NVME_CC_IOSQES_SHIFT	= 16,
>   	NVME_CC_IOCQES_SHIFT	= 20,
> +	NVME_CC_CSS_NVM		= 0 << NVME_CC_CSS_SHIFT,
> +	NVME_CC_CSS_CSI		= 6 << NVME_CC_CSS_SHIFT,
> +	NVME_CC_CSS_MASK	= 7 << NVME_CC_CSS_SHIFT,
>   	NVME_CC_AMS_RR		= 0 << NVME_CC_AMS_SHIFT,
>   	NVME_CC_AMS_WRRU	= 1 << NVME_CC_AMS_SHIFT,
>   	NVME_CC_AMS_VS		= 7 << NVME_CC_AMS_SHIFT,
> @@ -179,6 +182,8 @@ enum {
>   	NVME_CC_SHN_MASK	= 3 << NVME_CC_SHN_SHIFT,
>   	NVME_CC_IOSQES		= NVME_NVM_IOSQES << NVME_CC_IOSQES_SHIFT,
>   	NVME_CC_IOCQES		= NVME_NVM_IOCQES << NVME_CC_IOCQES_SHIFT,
> +	NVME_CAP_CSS_NVM	= 1 << 0,
> +	NVME_CAP_CSS_CSI	= 1 << 6,
>   	NVME_CSTS_RDY		= 1 << 0,
>   	NVME_CSTS_CFS		= 1 << 1,
>   	NVME_CSTS_NSSRO		= 1 << 4,
> @@ -374,6 +379,8 @@ enum {
>   	NVME_ID_CNS_CTRL		= 0x01,
>   	NVME_ID_CNS_NS_ACTIVE_LIST	= 0x02,
>   	NVME_ID_CNS_NS_DESC_LIST	= 0x03,
> +	NVME_ID_CNS_CS_NS		= 0x05,
> +	NVME_ID_CNS_CS_CTRL		= 0x06,
>   	NVME_ID_CNS_NS_PRESENT_LIST	= 0x10,
>   	NVME_ID_CNS_NS_PRESENT		= 0x11,
>   	NVME_ID_CNS_CTRL_NS_LIST	= 0x12,
> @@ -383,6 +390,10 @@ enum {
>   	NVME_ID_CNS_UUID_LIST		= 0x17,
>   };
>   
> +enum {
> +	NVME_CSI_NVM			= 0,
> +};
> +
>   enum {
>   	NVME_DIR_IDENTIFY		= 0x00,
>   	NVME_DIR_STREAMS		= 0x01,
> @@ -435,11 +446,13 @@ struct nvme_ns_id_desc {
>   #define NVME_NIDT_EUI64_LEN	8
>   #define NVME_NIDT_NGUID_LEN	16
>   #define NVME_NIDT_UUID_LEN	16
> +#define NVME_NIDT_CSI_LEN	1
>   
>   enum {
>   	NVME_NIDT_EUI64		= 0x01,
>   	NVME_NIDT_NGUID		= 0x02,
>   	NVME_NIDT_UUID		= 0x03,
> +	NVME_NIDT_CSI		= 0x04,
>   };
>   
>   struct nvme_smart_log {
> @@ -972,7 +985,9 @@ struct nvme_identify {
>   	__u8			cns;
>   	__u8			rsvd3;
>   	__le16			ctrlid;
> -	__u32			rsvd11[5];
> +	__u8			rsvd11[3];
> +	__u8			csi;
> +	__u32			rsvd12[4];
>   };
>   
>   #define NVME_IDENTIFY_DATA_SIZE 4096
> 
Otherwise looks okay.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 4/5] nvme: support for multi-command set effects
  2020-06-22 16:25 ` [PATCHv3 4/5] nvme: support for multi-command set effects Keith Busch
@ 2020-06-23  6:21   ` Hannes Reinecke
  2020-06-23 17:43   ` Sagi Grimberg
  1 sibling, 0 replies; 38+ messages in thread
From: Hannes Reinecke @ 2020-06-23  6:21 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, hch, sagi, linux-block, axboe
  Cc: Keith Busch, Javier González, Martin K . Petersen,
	Johannes Thumshirn, Matias Bjørling, Daniel Wagner

On 6/22/20 6:25 PM, Keith Busch wrote:
> From: Keith Busch <keith.busch@wdc.com>
> 
> The Commands Supported and Effects log page was extended with a CSI
> field that enables the host to query the log page for each command set
> supported. Retrieve this log page for each command set that an attached
> namespace supports, and save a pointer to that log in the namespace head.
> 
> Reviewed-by: Javier González <javier.gonz@samsung.com>
> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
> Reviewed-by: Daniel Wagner <dwagner@suse.de>
> Signed-off-by: Keith Busch <keith.busch@wdc.com>
> ---
>   drivers/nvme/host/core.c      | 79 ++++++++++++++++++++++++++---------
>   drivers/nvme/host/hwmon.c     |  2 +-
>   drivers/nvme/host/lightnvm.c  |  4 +-
>   drivers/nvme/host/multipath.c |  2 +-
>   drivers/nvme/host/nvme.h      | 10 ++++-
>   include/linux/nvme.h          |  4 +-
>   6 files changed, 76 insertions(+), 25 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 5/5] nvme: support for zoned namespaces
  2020-06-22 16:25 ` [PATCHv3 5/5] nvme: support for zoned namespaces Keith Busch
  2020-06-22 16:48   ` Johannes Thumshirn
@ 2020-06-23  6:23   ` Hannes Reinecke
  2020-06-23 17:45   ` Sagi Grimberg
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 38+ messages in thread
From: Hannes Reinecke @ 2020-06-23  6:23 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, hch, sagi, linux-block, axboe
  Cc: Keith Busch, Martin K . Petersen, Hans Holmberg, Dmitry Fomichev,
	Ajay Joshi, Aravind Ramesh, Niklas Cassel, Matias Bjørling,
	Damien Le Moal

On 6/22/20 6:25 PM, Keith Busch wrote:
> From: Keith Busch <keith.busch@wdc.com>
> 
> Add support for NVM Express Zoned Namespaces (ZNS) Command Set defined
> in NVM Express TP4053. Zoned namespaces are discovered based on their
> Command Set Identifier reported in the namespaces Namespace
> Identification Descriptor list. A successfully discovered Zoned
> Namespace will be registered with the block layer as a host managed
> zoned block device with Zone Append command support. A namespace that
> does not support append is not supported by the driver.
> 
> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Keith Busch <keith.busch@wdc.com>
> ---
>   block/Kconfig              |   5 +-
>   drivers/nvme/host/Makefile |   1 +
>   drivers/nvme/host/core.c   |  91 ++++++++++++--
>   drivers/nvme/host/nvme.h   |  39 ++++++
>   drivers/nvme/host/zns.c    | 245 +++++++++++++++++++++++++++++++++++++
>   include/linux/nvme.h       | 114 ++++++++++++++++-
>   6 files changed, 480 insertions(+), 15 deletions(-)
>   create mode 100644 drivers/nvme/host/zns.c
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 1/5] block: add capacity field to zone descriptors
  2020-06-22 16:25 ` [PATCHv3 1/5] block: add capacity field to zone descriptors Keith Busch
  2020-06-23  6:15   ` Hannes Reinecke
@ 2020-06-23  8:44   ` Sagi Grimberg
  2020-06-26 12:17   ` Jens Axboe
  2 siblings, 0 replies; 38+ messages in thread
From: Sagi Grimberg @ 2020-06-23  8:44 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, hch, linux-block, axboe
  Cc: Matias Bjørling, Chaitanya Kulkarni, Javier González,
	Daniel Wagner, Johannes Thumshirn, Martin K . Petersen

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 2/5] null_blk: introduce zone capacity for zoned device
  2020-06-22 16:25 ` [PATCHv3 2/5] null_blk: introduce zone capacity for zoned device Keith Busch
  2020-06-23  6:16   ` Hannes Reinecke
@ 2020-06-23  8:45   ` Sagi Grimberg
  1 sibling, 0 replies; 38+ messages in thread
From: Sagi Grimberg @ 2020-06-23  8:45 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, hch, linux-block, axboe
  Cc: Aravind Ramesh, Chaitanya Kulkarni, Matias Bjørling,
	Daniel Wagner, Martin K . Petersen, Johannes Thumshirn

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-22 16:25 ` [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support Keith Busch
  2020-06-23  6:20   ` Hannes Reinecke
@ 2020-06-23  8:53   ` Sagi Grimberg
  2020-06-23 11:25     ` Niklas Cassel
  2020-06-26  8:54     ` Christoph Hellwig
  1 sibling, 2 replies; 38+ messages in thread
From: Sagi Grimberg @ 2020-06-23  8:53 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, hch, linux-block, axboe
  Cc: Niklas Cassel, Javier González, Martin K . Petersen,
	Johannes Thumshirn, Matias Bjørling, Daniel Wagner



On 6/22/20 9:25 AM, Keith Busch wrote:
> From: Niklas Cassel <niklas.cassel@wdc.com>
> 
> Implements support for the I/O Command Sets command set. The command set
> introduces a method to enumerate multiple command sets per namespace. If
> the command set is exposed, this method for enumeration will be used
> instead of the traditional method that uses the CC.CSS register command
> set register for command set identification.
> 
> For namespaces where the Command Set Identifier is not supported or
> recognized, the specific namespace will not be created.
> 
> Reviewed-by: Javier González <javier.gonz@samsung.com>
> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
> Reviewed-by: Daniel Wagner <dwagner@suse.de>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>   drivers/nvme/host/core.c | 48 +++++++++++++++++++++++++++++++++-------
>   drivers/nvme/host/nvme.h |  1 +
>   include/linux/nvme.h     | 19 ++++++++++++++--
>   3 files changed, 58 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 9491dbcfe81a..45a3cb5a35bd 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1056,8 +1056,13 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
>   	return error;
>   }
>   
> +static bool nvme_multi_css(struct nvme_ctrl *ctrl)
> +{
> +	return (ctrl->ctrl_config & NVME_CC_CSS_MASK) == NVME_CC_CSS_CSI;
> +}
> +
>   static int nvme_process_ns_desc(struct nvme_ctrl *ctrl, struct nvme_ns_ids *ids,
> -		struct nvme_ns_id_desc *cur)
> +		struct nvme_ns_id_desc *cur, bool *csi_seen)
>   {
>   	const char *warn_str = "ctrl returned bogus length:";
>   	void *data = cur;
> @@ -1087,6 +1092,15 @@ static int nvme_process_ns_desc(struct nvme_ctrl *ctrl, struct nvme_ns_ids *ids,
>   		}
>   		uuid_copy(&ids->uuid, data + sizeof(*cur));
>   		return NVME_NIDT_UUID_LEN;
> +	case NVME_NIDT_CSI:
> +		if (cur->nidl != NVME_NIDT_CSI_LEN) {
> +			dev_warn(ctrl->device, "%s %d for NVME_NIDT_CSI\n",
> +				 warn_str, cur->nidl);
> +			return -1;
> +		}
> +		memcpy(&ids->csi, data + sizeof(*cur), NVME_NIDT_CSI_LEN);
> +		*csi_seen = true;
> +		return NVME_NIDT_CSI_LEN;
>   	default:
>   		/* Skip unknown types */
>   		return cur->nidl;
> @@ -1097,10 +1111,9 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
>   		struct nvme_ns_ids *ids)
>   {
>   	struct nvme_command c = { };
> -	int status;
> +	bool csi_seen = false;
> +	int status, pos, len;
>   	void *data;
> -	int pos;
> -	int len;
>   
>   	c.identify.opcode = nvme_admin_identify;
>   	c.identify.nsid = cpu_to_le32(nsid);
> @@ -1130,13 +1143,19 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
>   		if (cur->nidl == 0)
>   			break;
>   
> -		len = nvme_process_ns_desc(ctrl, ids, cur);
> +		len = nvme_process_ns_desc(ctrl, ids, cur, &csi_seen);
>   		if (len < 0)
>   			goto free_data;
>   
>   		len += sizeof(*cur);
>   	}
>   free_data:
> +	if (!status && nvme_multi_css(ctrl) && !csi_seen) {

We will clear the status if we detect a path error, that is to
avoid needlessly removing the ns for path failures, so you should
check at the goto site.

> +		dev_warn(ctrl->device, "Command set not reported for nsid:%d\n",
> +			 nsid);
> +		status = -EINVAL;
> +	}
> +
>   	kfree(data);
>   	return status;
>   }
> @@ -1792,7 +1811,7 @@ static int nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
>   		memcpy(ids->eui64, id->eui64, sizeof(id->eui64));
>   	if (ctrl->vs >= NVME_VS(1, 2, 0))
>   		memcpy(ids->nguid, id->nguid, sizeof(id->nguid));
> -	if (ctrl->vs >= NVME_VS(1, 3, 0))
> +	if (ctrl->vs >= NVME_VS(1, 3, 0) || nvme_multi_css(ctrl))
>   		return nvme_identify_ns_descs(ctrl, nsid, ids);
>   	return 0;
>   }
> @@ -1808,7 +1827,8 @@ static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
>   {
>   	return uuid_equal(&a->uuid, &b->uuid) &&
>   		memcmp(&a->nguid, &b->nguid, sizeof(a->nguid)) == 0 &&
> -		memcmp(&a->eui64, &b->eui64, sizeof(a->eui64)) == 0;
> +		memcmp(&a->eui64, &b->eui64, sizeof(a->eui64)) == 0 &&
> +		a->csi == b->csi;
>   }
>   
>   static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> @@ -1930,6 +1950,15 @@ static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
>   	if (ns->lba_shift == 0)
>   		ns->lba_shift = 9;
>   
> +	switch (ns->head->ids.csi) {
> +	case NVME_CSI_NVM:
> +		break;
> +	default:
> +		dev_warn(ctrl->device, "unknown csi:%d ns:%d\n",
> +			ns->head->ids.csi, ns->head->ns_id);
> +		return -ENODEV;
> +	}

Not sure we need a switch-case statement for a single case target...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-23  6:20   ` Hannes Reinecke
@ 2020-06-23  9:20     ` Niklas Cassel
  2020-06-23 14:25       ` Keith Busch
  0 siblings, 1 reply; 38+ messages in thread
From: Niklas Cassel @ 2020-06-23  9:20 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Keith Busch, linux-nvme, hch, sagi, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner

On Tue, Jun 23, 2020 at 08:20:23AM +0200, Hannes Reinecke wrote:
> On 6/22/20 6:25 PM, Keith Busch wrote:
> > From: Niklas Cassel <niklas.cassel@wdc.com>
> > 
> > Implements support for the I/O Command Sets command set. The command set
> > introduces a method to enumerate multiple command sets per namespace. If
> > the command set is exposed, this method for enumeration will be used
> > instead of the traditional method that uses the CC.CSS register command
> > set register for command set identification.
> > 
> > For namespaces where the Command Set Identifier is not supported or
> > recognized, the specific namespace will not be created.
> > 
> > Reviewed-by: Javier González <javier.gonz@samsung.com>
> > Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> > Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> > Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
> > Reviewed-by: Daniel Wagner <dwagner@suse.de>
> > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> > ---
> > @@ -1792,7 +1811,7 @@ static int nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
> >   		memcpy(ids->eui64, id->eui64, sizeof(id->eui64));
> >   	if (ctrl->vs >= NVME_VS(1, 2, 0))
> >   		memcpy(ids->nguid, id->nguid, sizeof(id->nguid));
> > -	if (ctrl->vs >= NVME_VS(1, 3, 0))
> > +	if (ctrl->vs >= NVME_VS(1, 3, 0) || nvme_multi_css(ctrl))
> >   		return nvme_identify_ns_descs(ctrl, nsid, ids);
> >   	return 0;
> >   }
> 
> Hmm? Are command sets even defined for something earlier than 1.3?

According to Keith, usually new features are not really tied to a
specific NVMe version.

So if someone implements/enables multiple command sets feature on
an older base spec of NVMe, we still want to support/allow that.


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-23  8:53   ` Sagi Grimberg
@ 2020-06-23 11:25     ` Niklas Cassel
  2020-06-23 14:59       ` Keith Busch
                         ` (2 more replies)
  2020-06-26  8:54     ` Christoph Hellwig
  1 sibling, 3 replies; 38+ messages in thread
From: Niklas Cassel @ 2020-06-23 11:25 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, linux-nvme, hch, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner

On Tue, Jun 23, 2020 at 01:53:47AM -0700, Sagi Grimberg wrote:
> 
> 
> On 6/22/20 9:25 AM, Keith Busch wrote:
> > From: Niklas Cassel <niklas.cassel@wdc.com>
> > 
> > Implements support for the I/O Command Sets command set. The command set
> > introduces a method to enumerate multiple command sets per namespace. If
> > the command set is exposed, this method for enumeration will be used
> > instead of the traditional method that uses the CC.CSS register command
> > set register for command set identification.
> > 
> > For namespaces where the Command Set Identifier is not supported or
> > recognized, the specific namespace will not be created.
> > 
> > Reviewed-by: Javier González <javier.gonz@samsung.com>
> > Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> > Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> > Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
> > Reviewed-by: Daniel Wagner <dwagner@suse.de>
> > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> > ---
> >   drivers/nvme/host/core.c | 48 +++++++++++++++++++++++++++++++++-------
> >   drivers/nvme/host/nvme.h |  1 +
> >   include/linux/nvme.h     | 19 ++++++++++++++--
> >   3 files changed, 58 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 9491dbcfe81a..45a3cb5a35bd 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -1056,8 +1056,13 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
> >   	return error;
> >   }
> > +static bool nvme_multi_css(struct nvme_ctrl *ctrl)
> > +{
> > +	return (ctrl->ctrl_config & NVME_CC_CSS_MASK) == NVME_CC_CSS_CSI;
> > +}
> > +
> >   static int nvme_process_ns_desc(struct nvme_ctrl *ctrl, struct nvme_ns_ids *ids,
> > -		struct nvme_ns_id_desc *cur)
> > +		struct nvme_ns_id_desc *cur, bool *csi_seen)
> >   {
> >   	const char *warn_str = "ctrl returned bogus length:";
> >   	void *data = cur;
> > @@ -1087,6 +1092,15 @@ static int nvme_process_ns_desc(struct nvme_ctrl *ctrl, struct nvme_ns_ids *ids,
> >   		}
> >   		uuid_copy(&ids->uuid, data + sizeof(*cur));
> >   		return NVME_NIDT_UUID_LEN;
> > +	case NVME_NIDT_CSI:
> > +		if (cur->nidl != NVME_NIDT_CSI_LEN) {
> > +			dev_warn(ctrl->device, "%s %d for NVME_NIDT_CSI\n",
> > +				 warn_str, cur->nidl);
> > +			return -1;
> > +		}
> > +		memcpy(&ids->csi, data + sizeof(*cur), NVME_NIDT_CSI_LEN);
> > +		*csi_seen = true;
> > +		return NVME_NIDT_CSI_LEN;
> >   	default:
> >   		/* Skip unknown types */
> >   		return cur->nidl;
> > @@ -1097,10 +1111,9 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
> >   		struct nvme_ns_ids *ids)
> >   {
> >   	struct nvme_command c = { };
> > -	int status;
> > +	bool csi_seen = false;
> > +	int status, pos, len;
> >   	void *data;
> > -	int pos;
> > -	int len;
> >   	c.identify.opcode = nvme_admin_identify;
> >   	c.identify.nsid = cpu_to_le32(nsid);
> > @@ -1130,13 +1143,19 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
> >   		if (cur->nidl == 0)
> >   			break;
> > -		len = nvme_process_ns_desc(ctrl, ids, cur);
> > +		len = nvme_process_ns_desc(ctrl, ids, cur, &csi_seen);
> >   		if (len < 0)
> >   			goto free_data;
> >   		len += sizeof(*cur);
> >   	}
> >   free_data:
> > +	if (!status && nvme_multi_css(ctrl) && !csi_seen) {
> 
> We will clear the status if we detect a path error, that is to
> avoid needlessly removing the ns for path failures, so you should
> check at the goto site.

The problem is that this check has to be done after checking all the ns descs,
so this check to be done as the final thing, at least after processing all the
ns descs. No matter if nvme_process_ns_desc() returned an error, or if
simply NVME_NIDT_CSI wasn't part of the ns desc list, so the loop reached the
end without error.

Even if the nvme command failed and the status was cleared:

                if (status > 0 && !(status & NVME_SC_DNR))
                        status = 0;

we still need to return an error, if (nvme_multi_css(ctrl) && !csi_seen).
(Not reporting a CSI when nvme_multi_css() is enabled, is fatal.)

That is why the code looks like it does.

I guess we could do something like this, which does the same thing,
but perhaps is a bit clearer:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e95f0c498a6b..bef687b9a277 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1160,8 +1160,10 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
                  * Don't treat an error as fatal, as we potentially already
                  * have a NGUID or EUI-64.
                  */
-               if (status > 0 && !(status & NVME_SC_DNR))
+               if (status > 0 && !(status & NVME_SC_DNR)) {
                        status = 0;
+                       goto csi_check;
+               }
                goto free_data;
        }

@@ -1173,17 +1175,17 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,

                len = nvme_process_ns_desc(ctrl, ids, cur, &csi_seen);
                if (len < 0)
-                       goto free_data;
+                       goto csi_check;

                len += sizeof(*cur);
        }
-free_data:
-       if (!status && nvme_multi_css(ctrl) && !csi_seen) {
+csi_check:
+       if (nvme_multi_css(ctrl) && !csi_seen) {
                dev_warn(ctrl->device, "Command set not reported for nsid:%d\n",
                         nsid);
                status = -EINVAL;
        }
-
+free_data:
        kfree(data);
        return status;
 }


> 
> > +		dev_warn(ctrl->device, "Command set not reported for nsid:%d\n",
> > +			 nsid);
> > +		status = -EINVAL;
> > +	}
> > +
> >   	kfree(data);
> >   	return status;
> >   }
> > @@ -1792,7 +1811,7 @@ static int nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
> >   		memcpy(ids->eui64, id->eui64, sizeof(id->eui64));
> >   	if (ctrl->vs >= NVME_VS(1, 2, 0))
> >   		memcpy(ids->nguid, id->nguid, sizeof(id->nguid));
> > -	if (ctrl->vs >= NVME_VS(1, 3, 0))
> > +	if (ctrl->vs >= NVME_VS(1, 3, 0) || nvme_multi_css(ctrl))
> >   		return nvme_identify_ns_descs(ctrl, nsid, ids);
> >   	return 0;
> >   }
> > @@ -1808,7 +1827,8 @@ static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
> >   {
> >   	return uuid_equal(&a->uuid, &b->uuid) &&
> >   		memcmp(&a->nguid, &b->nguid, sizeof(a->nguid)) == 0 &&
> > -		memcmp(&a->eui64, &b->eui64, sizeof(a->eui64)) == 0;
> > +		memcmp(&a->eui64, &b->eui64, sizeof(a->eui64)) == 0 &&
> > +		a->csi == b->csi;
> >   }
> >   static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > @@ -1930,6 +1950,15 @@ static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
> >   	if (ns->lba_shift == 0)
> >   		ns->lba_shift = 9;
> > +	switch (ns->head->ids.csi) {
> > +	case NVME_CSI_NVM:
> > +		break;
> > +	default:
> > +		dev_warn(ctrl->device, "unknown csi:%d ns:%d\n",
> > +			ns->head->ids.csi, ns->head->ns_id);
> > +		return -ENODEV;
> > +	}
> 
> Not sure we need a switch-case statement for a single case target...

I would consider it two cases. A supported CSI or a non-supported CSI
(which means any CSI value != NVME_CSI_NVM).

However, a follow up patch (patch 5/5 in this series) adds another case
to this switch-case statement (NVME_CSI_ZNS).

I guess this patch could have used an if-else statement, and patch 5/5
replaced the if-statement with a switch-case.
However, since a patch in the same series actually adds another case,
I think that it is more clear this way.
(A switch-case with only two cases added, in a patch that is not the last
one in the series, suggests (at least to me), that it will most likely be
extended in a following patch.)


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-23  9:20     ` Niklas Cassel
@ 2020-06-23 14:25       ` Keith Busch
  0 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2020-06-23 14:25 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Hannes Reinecke, linux-nvme, hch, sagi, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner

On Tue, Jun 23, 2020 at 09:20:35AM +0000, Niklas Cassel wrote:
> On Tue, Jun 23, 2020 at 08:20:23AM +0200, Hannes Reinecke wrote:
> > On 6/22/20 6:25 PM, Keith Busch wrote:
> > > From: Niklas Cassel <niklas.cassel@wdc.com>
> > > -	if (ctrl->vs >= NVME_VS(1, 3, 0))
> > > +	if (ctrl->vs >= NVME_VS(1, 3, 0) || nvme_multi_css(ctrl))
> > >   		return nvme_identify_ns_descs(ctrl, nsid, ids);
> > >   	return 0;
> > >   }
> > 
> > Hmm? Are command sets even defined for something earlier than 1.3?
> 
> According to Keith, usually new features are not really tied to a
> specific NVMe version.

Not really according to me; nvme just allows controllers to implement
features ratified after the initial version release.
 
> So if someone implements/enables multiple command sets feature on
> an older base spec of NVMe, we still want to support/allow that.

Right. We don't check the version to see if multi-command sets are
supported when we initially enable it, so we shouldn't need to check the
version later.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-23 11:25     ` Niklas Cassel
@ 2020-06-23 14:59       ` Keith Busch
  2020-06-23 22:10       ` Keith Busch
  2020-06-23 23:20       ` Sagi Grimberg
  2 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2020-06-23 14:59 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Sagi Grimberg, linux-nvme, hch, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner

On Tue, Jun 23, 2020 at 11:25:53AM +0000, Niklas Cassel wrote:
> On Tue, Jun 23, 2020 at 01:53:47AM -0700, Sagi Grimberg wrote:
> > >   static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > > @@ -1930,6 +1950,15 @@ static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
> > >   	if (ns->lba_shift == 0)
> > >   		ns->lba_shift = 9;
> > > +	switch (ns->head->ids.csi) {
> > > +	case NVME_CSI_NVM:
> > > +		break;
> > > +	default:
> > > +		dev_warn(ctrl->device, "unknown csi:%d ns:%d\n",
> > > +			ns->head->ids.csi, ns->head->ns_id);
> > > +		return -ENODEV;
> > > +	}
> > 
> > Not sure we need a switch-case statement for a single case target...
> 
> I would consider it two cases. A supported CSI or a non-supported CSI
> (which means any CSI value != NVME_CSI_NVM).
> 
> However, a follow up patch (patch 5/5 in this series) adds another case
> to this switch-case statement (NVME_CSI_ZNS).
> 
> I guess this patch could have used an if-else statement, and patch 5/5
> replaced the if-statement with a switch-case.
> However, since a patch in the same series actually adds another case,
> I think that it is more clear this way.
> (A switch-case with only two cases added, in a patch that is not the last
> one in the series, suggests (at least to me), that it will most likely be
> extended in a following patch.)

Yeah, this patch is laying the foundation for future command sets.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 4/5] nvme: support for multi-command set effects
  2020-06-22 16:25 ` [PATCHv3 4/5] nvme: support for multi-command set effects Keith Busch
  2020-06-23  6:21   ` Hannes Reinecke
@ 2020-06-23 17:43   ` Sagi Grimberg
  1 sibling, 0 replies; 38+ messages in thread
From: Sagi Grimberg @ 2020-06-23 17:43 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, hch, linux-block, axboe
  Cc: Keith Busch, Javier González, Martin K . Petersen,
	Johannes Thumshirn, Matias Bjørling, Daniel Wagner

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 5/5] nvme: support for zoned namespaces
  2020-06-22 16:25 ` [PATCHv3 5/5] nvme: support for zoned namespaces Keith Busch
  2020-06-22 16:48   ` Johannes Thumshirn
  2020-06-23  6:23   ` Hannes Reinecke
@ 2020-06-23 17:45   ` Sagi Grimberg
  2020-06-24  9:11   ` Javier González
  2020-06-29 13:53   ` Johannes Thumshirn
  4 siblings, 0 replies; 38+ messages in thread
From: Sagi Grimberg @ 2020-06-23 17:45 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, hch, linux-block, axboe
  Cc: Keith Busch, Martin K . Petersen, Hans Holmberg, Dmitry Fomichev,
	Ajay Joshi, Aravind Ramesh, Niklas Cassel, Matias Bjørling,
	Damien Le Moal

Looks good Keith,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-23 11:25     ` Niklas Cassel
  2020-06-23 14:59       ` Keith Busch
@ 2020-06-23 22:10       ` Keith Busch
  2020-06-23 23:17         ` Sagi Grimberg
  2020-06-23 23:20       ` Sagi Grimberg
  2 siblings, 1 reply; 38+ messages in thread
From: Keith Busch @ 2020-06-23 22:10 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Sagi Grimberg, linux-nvme, hch, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner

On Tue, Jun 23, 2020 at 11:25:53AM +0000, Niklas Cassel wrote:
> On Tue, Jun 23, 2020 at 01:53:47AM -0700, Sagi Grimberg wrote:
> > On 6/22/20 9:25 AM, Keith Busch wrote:
> > > -		len = nvme_process_ns_desc(ctrl, ids, cur);
> > > +		len = nvme_process_ns_desc(ctrl, ids, cur, &csi_seen);
> > >   		if (len < 0)
> > >   			goto free_data;
> > >   		len += sizeof(*cur);
> > >   	}
> > >   free_data:
> > > +	if (!status && nvme_multi_css(ctrl) && !csi_seen) {
> > 
> > We will clear the status if we detect a path error, that is to
> > avoid needlessly removing the ns for path failures, so you should
> > check at the goto site.
> 
> The problem is that this check has to be done after checking all the ns descs,
> so this check to be done as the final thing, at least after processing all the
> ns descs. No matter if nvme_process_ns_desc() returned an error, or if
> simply NVME_NIDT_CSI wasn't part of the ns desc list, so the loop reached the
> end without error.
> 
> Even if the nvme command failed and the status was cleared:
> 
>                 if (status > 0 && !(status & NVME_SC_DNR))
>                         status = 0;

This check is so weird. What has DNR got to do with whether or not we
want to continue with this namespace? The commit that adds this says
it's to check for a host failed IO, but a controller can just as validly
set DNR in its error status, in which case we'd still want clear the
status.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-23 22:10       ` Keith Busch
@ 2020-06-23 23:17         ` Sagi Grimberg
  2020-06-24 17:25           ` Keith Busch
  0 siblings, 1 reply; 38+ messages in thread
From: Sagi Grimberg @ 2020-06-23 23:17 UTC (permalink / raw)
  To: Keith Busch, Niklas Cassel
  Cc: linux-nvme, hch, linux-block, axboe, Javier González,
	Martin K . Petersen, Johannes Thumshirn, Matias Bjorling,
	Daniel Wagner


>>>> -		len = nvme_process_ns_desc(ctrl, ids, cur);
>>>> +		len = nvme_process_ns_desc(ctrl, ids, cur, &csi_seen);
>>>>    		if (len < 0)
>>>>    			goto free_data;
>>>>    		len += sizeof(*cur);
>>>>    	}
>>>>    free_data:
>>>> +	if (!status && nvme_multi_css(ctrl) && !csi_seen) {
>>>
>>> We will clear the status if we detect a path error, that is to
>>> avoid needlessly removing the ns for path failures, so you should
>>> check at the goto site.
>>
>> The problem is that this check has to be done after checking all the ns descs,
>> so this check to be done as the final thing, at least after processing all the
>> ns descs. No matter if nvme_process_ns_desc() returned an error, or if
>> simply NVME_NIDT_CSI wasn't part of the ns desc list, so the loop reached the
>> end without error.
>>
>> Even if the nvme command failed and the status was cleared:
>>
>>                  if (status > 0 && !(status & NVME_SC_DNR))
>>                          status = 0;
> 
> This check is so weird. What has DNR got to do with whether or not we
> want to continue with this namespace? The commit that adds this says
> it's to check for a host failed IO, but a controller can just as validly
> set DNR in its error status, in which case we'd still want clear the
> status.

The reason is that if this error is not a DNR error, it means that we
should retry and succeed, we don't want to remove the namespace.

The problem that this solves is the fact that we have various error
recovery conditions (interconnect issues, failures, resets) and the
async works are designed to continue to run, issue I/O and fail. The
scan work, will revalidate namespaces and remove them if it fails to
do so, which is inherently wrong if these are simply inaccessible by
the host at this time.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-23 11:25     ` Niklas Cassel
  2020-06-23 14:59       ` Keith Busch
  2020-06-23 22:10       ` Keith Busch
@ 2020-06-23 23:20       ` Sagi Grimberg
  2 siblings, 0 replies; 38+ messages in thread
From: Sagi Grimberg @ 2020-06-23 23:20 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Keith Busch, linux-nvme, hch, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner



On 6/23/20 4:25 AM, Niklas Cassel wrote:
> On Tue, Jun 23, 2020 at 01:53:47AM -0700, Sagi Grimberg wrote:
>>
>>
>> On 6/22/20 9:25 AM, Keith Busch wrote:
>>> From: Niklas Cassel <niklas.cassel@wdc.com>
>>>
>>> Implements support for the I/O Command Sets command set. The command set
>>> introduces a method to enumerate multiple command sets per namespace. If
>>> the command set is exposed, this method for enumeration will be used
>>> instead of the traditional method that uses the CC.CSS register command
>>> set register for command set identification.
>>>
>>> For namespaces where the Command Set Identifier is not supported or
>>> recognized, the specific namespace will not be created.
>>>
>>> Reviewed-by: Javier González <javier.gonz@samsung.com>
>>> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
>>> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
>>> Reviewed-by: Daniel Wagner <dwagner@suse.de>
>>> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
>>> ---
>>>    drivers/nvme/host/core.c | 48 +++++++++++++++++++++++++++++++++-------
>>>    drivers/nvme/host/nvme.h |  1 +
>>>    include/linux/nvme.h     | 19 ++++++++++++++--
>>>    3 files changed, 58 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 9491dbcfe81a..45a3cb5a35bd 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -1056,8 +1056,13 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
>>>    	return error;
>>>    }
>>> +static bool nvme_multi_css(struct nvme_ctrl *ctrl)
>>> +{
>>> +	return (ctrl->ctrl_config & NVME_CC_CSS_MASK) == NVME_CC_CSS_CSI;
>>> +}
>>> +
>>>    static int nvme_process_ns_desc(struct nvme_ctrl *ctrl, struct nvme_ns_ids *ids,
>>> -		struct nvme_ns_id_desc *cur)
>>> +		struct nvme_ns_id_desc *cur, bool *csi_seen)
>>>    {
>>>    	const char *warn_str = "ctrl returned bogus length:";
>>>    	void *data = cur;
>>> @@ -1087,6 +1092,15 @@ static int nvme_process_ns_desc(struct nvme_ctrl *ctrl, struct nvme_ns_ids *ids,
>>>    		}
>>>    		uuid_copy(&ids->uuid, data + sizeof(*cur));
>>>    		return NVME_NIDT_UUID_LEN;
>>> +	case NVME_NIDT_CSI:
>>> +		if (cur->nidl != NVME_NIDT_CSI_LEN) {
>>> +			dev_warn(ctrl->device, "%s %d for NVME_NIDT_CSI\n",
>>> +				 warn_str, cur->nidl);
>>> +			return -1;
>>> +		}
>>> +		memcpy(&ids->csi, data + sizeof(*cur), NVME_NIDT_CSI_LEN);
>>> +		*csi_seen = true;
>>> +		return NVME_NIDT_CSI_LEN;
>>>    	default:
>>>    		/* Skip unknown types */
>>>    		return cur->nidl;
>>> @@ -1097,10 +1111,9 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
>>>    		struct nvme_ns_ids *ids)
>>>    {
>>>    	struct nvme_command c = { };
>>> -	int status;
>>> +	bool csi_seen = false;
>>> +	int status, pos, len;
>>>    	void *data;
>>> -	int pos;
>>> -	int len;
>>>    	c.identify.opcode = nvme_admin_identify;
>>>    	c.identify.nsid = cpu_to_le32(nsid);
>>> @@ -1130,13 +1143,19 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
>>>    		if (cur->nidl == 0)
>>>    			break;
>>> -		len = nvme_process_ns_desc(ctrl, ids, cur);
>>> +		len = nvme_process_ns_desc(ctrl, ids, cur, &csi_seen);
>>>    		if (len < 0)
>>>    			goto free_data;
>>>    		len += sizeof(*cur);
>>>    	}
>>>    free_data:
>>> +	if (!status && nvme_multi_css(ctrl) && !csi_seen) {
>>
>> We will clear the status if we detect a path error, that is to
>> avoid needlessly removing the ns for path failures, so you should
>> check at the goto site.
> 
> The problem is that this check has to be done after checking all the ns descs,
> so this check to be done as the final thing, at least after processing all the
> ns descs. No matter if nvme_process_ns_desc() returned an error, or if
> simply NVME_NIDT_CSI wasn't part of the ns desc list, so the loop reached the
> end without error.
> 
> Even if the nvme command failed and the status was cleared:
> 
>                  if (status > 0 && !(status & NVME_SC_DNR))
>                          status = 0;
> 
> we still need to return an error, if (nvme_multi_css(ctrl) && !csi_seen).
> (Not reporting a CSI when nvme_multi_css() is enabled, is fatal.)
> 
> That is why the code looks like it does.
> 
> I guess we could do something like this, which does the same thing,
> but perhaps is a bit clearer:
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index e95f0c498a6b..bef687b9a277 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1160,8 +1160,10 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
>                    * Don't treat an error as fatal, as we potentially already
>                    * have a NGUID or EUI-64.
>                    */
> -               if (status > 0 && !(status & NVME_SC_DNR))
> +               if (status > 0 && !(status & NVME_SC_DNR)) {
>                          status = 0;
> +                       goto csi_check;
> +               }

I think its the opposite. If we failed with DNR, you should assume
that either the controller wants the host to retry, or its a
path/transport error. So we need to leave this namespace alone and
assume that when the host is connected back to the controller, the
scan_work will revalidate again.

So you should fail the csi_check only if you see a DNR error status.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 5/5] nvme: support for zoned namespaces
  2020-06-22 16:25 ` [PATCHv3 5/5] nvme: support for zoned namespaces Keith Busch
                     ` (2 preceding siblings ...)
  2020-06-23 17:45   ` Sagi Grimberg
@ 2020-06-24  9:11   ` Javier González
  2020-06-29 13:53   ` Johannes Thumshirn
  4 siblings, 0 replies; 38+ messages in thread
From: Javier González @ 2020-06-24  9:11 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, hch, sagi, linux-block, axboe, Niklas Cassel,
	Damien Le Moal, Ajay Joshi, Keith Busch, Martin K . Petersen,
	Dmitry Fomichev, Aravind Ramesh, Hans Holmberg,
	Matias Bjørling

On 22.06.2020 09:25, Keith Busch wrote:
>From: Keith Busch <keith.busch@wdc.com>
>
>Add support for NVM Express Zoned Namespaces (ZNS) Command Set defined
>in NVM Express TP4053. Zoned namespaces are discovered based on their
>Command Set Identifier reported in the namespaces Namespace
>Identification Descriptor list. A successfully discovered Zoned
>Namespace will be registered with the block layer as a host managed
>zoned block device with Zone Append command support. A namespace that
>does not support append is not supported by the driver.
>
>Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
>Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
>Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
>Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
>Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
>Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
>Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
>Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>Signed-off-by: Keith Busch <keith.busch@wdc.com>
>---
> block/Kconfig              |   5 +-
> drivers/nvme/host/Makefile |   1 +
> drivers/nvme/host/core.c   |  91 ++++++++++++--
> drivers/nvme/host/nvme.h   |  39 ++++++
> drivers/nvme/host/zns.c    | 245 +++++++++++++++++++++++++++++++++++++
> include/linux/nvme.h       | 114 ++++++++++++++++-
> 6 files changed, 480 insertions(+), 15 deletions(-)
> create mode 100644 drivers/nvme/host/zns.c
>
>diff --git a/block/Kconfig b/block/Kconfig
>index 9357d7302398..bbad5e8bbffe 100644
>--- a/block/Kconfig
>+++ b/block/Kconfig
>@@ -86,9 +86,10 @@ config BLK_DEV_ZONED
> 	select MQ_IOSCHED_DEADLINE
> 	help
> 	Block layer zoned block device support. This option enables
>-	support for ZAC/ZBC host-managed and host-aware zoned block devices.
>+	support for ZAC/ZBC/ZNS host-managed and host-aware zoned block
>+	devices.
>
>-	Say yes here if you have a ZAC or ZBC storage device.
>+	Say yes here if you have a ZAC, ZBC, or ZNS storage device.
>
> config BLK_DEV_THROTTLING
> 	bool "Block layer bio throttling support"
>diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
>index fc7b26be692d..d7f6a87687b8 100644
>--- a/drivers/nvme/host/Makefile
>+++ b/drivers/nvme/host/Makefile
>@@ -13,6 +13,7 @@ nvme-core-y				:= core.o
> nvme-core-$(CONFIG_TRACING)		+= trace.o
> nvme-core-$(CONFIG_NVME_MULTIPATH)	+= multipath.o
> nvme-core-$(CONFIG_NVM)			+= lightnvm.o
>+nvme-core-$(CONFIG_BLK_DEV_ZONED)	+= zns.o
> nvme-core-$(CONFIG_FAULT_INJECTION_DEBUG_FS)	+= fault_inject.o
> nvme-core-$(CONFIG_NVME_HWMON)		+= hwmon.o
>
>diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>index 64b7b9fc2817..885db561de17 100644
>--- a/drivers/nvme/host/core.c
>+++ b/drivers/nvme/host/core.c
>@@ -89,7 +89,7 @@ static dev_t nvme_chr_devt;
> static struct class *nvme_class;
> static struct class *nvme_subsys_class;
>
>-static int nvme_revalidate_disk(struct gendisk *disk);
>+static int _nvme_revalidate_disk(struct gendisk *disk);
> static void nvme_put_subsystem(struct nvme_subsystem *subsys);
> static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
> 					   unsigned nsid);
>@@ -287,6 +287,10 @@ void nvme_complete_rq(struct request *req)
> 			nvme_retry_req(req);
> 			return;
> 		}
>+	} else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
>+		   req_op(req) == REQ_OP_ZONE_APPEND) {
>+		req->__sector = nvme_lba_to_sect(req->q->queuedata,
>+			le64_to_cpu(nvme_req(req)->result.u64));
> 	}
>
> 	nvme_trace_bio_complete(req, status);
>@@ -673,7 +677,8 @@ static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns,
> }
>
> static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>-		struct request *req, struct nvme_command *cmnd)
>+		struct request *req, struct nvme_command *cmnd,
>+		enum nvme_opcode op)
> {
> 	struct nvme_ctrl *ctrl = ns->ctrl;
> 	u16 control = 0;
>@@ -687,7 +692,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
> 	if (req->cmd_flags & REQ_RAHEAD)
> 		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
>
>-	cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
>+	cmnd->rw.opcode = op;
> 	cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
> 	cmnd->rw.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
> 	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
>@@ -716,6 +721,8 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
> 		case NVME_NS_DPS_PI_TYPE2:
> 			control |= NVME_RW_PRINFO_PRCHK_GUARD |
> 					NVME_RW_PRINFO_PRCHK_REF;
>+			if (op == nvme_cmd_zone_append)
>+				control |= NVME_RW_APPEND_PIREMAP;
> 			cmnd->rw.reftag = cpu_to_le32(t10_pi_ref_tag(req));
> 			break;
> 		}
>@@ -756,6 +763,19 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
> 	case REQ_OP_FLUSH:
> 		nvme_setup_flush(ns, cmd);
> 		break;
>+	case REQ_OP_ZONE_RESET_ALL:
>+	case REQ_OP_ZONE_RESET:
>+		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_RESET);
>+		break;
>+	case REQ_OP_ZONE_OPEN:
>+		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_OPEN);
>+		break;
>+	case REQ_OP_ZONE_CLOSE:
>+		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_CLOSE);
>+		break;
>+	case REQ_OP_ZONE_FINISH:
>+		ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
>+		break;
> 	case REQ_OP_WRITE_ZEROES:
> 		ret = nvme_setup_write_zeroes(ns, req, cmd);
> 		break;
>@@ -763,8 +783,13 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
> 		ret = nvme_setup_discard(ns, req, cmd);
> 		break;
> 	case REQ_OP_READ:
>+		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
>+		break;
> 	case REQ_OP_WRITE:
>-		ret = nvme_setup_rw(ns, req, cmd);
>+		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
>+		break;
>+	case REQ_OP_ZONE_APPEND:
>+		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
> 		break;
> 	default:
> 		WARN_ON_ONCE(1);
>@@ -1391,14 +1416,23 @@ static u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> 	return effects;
> }
>
>-static void nvme_update_formats(struct nvme_ctrl *ctrl)
>+static void nvme_update_formats(struct nvme_ctrl *ctrl, u32 *effects)
> {
> 	struct nvme_ns *ns;
>
> 	down_read(&ctrl->namespaces_rwsem);
> 	list_for_each_entry(ns, &ctrl->namespaces, list)
>-		if (ns->disk && nvme_revalidate_disk(ns->disk))
>+		if (ns->disk && _nvme_revalidate_disk(ns->disk))
> 			nvme_set_queue_dying(ns);
>+		else if (blk_queue_is_zoned(ns->disk->queue)) {
>+			/*
>+			 * IO commands are required to fully revalidate a zoned
>+			 * device. Force the command effects to trigger rescan
>+			 * work so report zones can run in a context with
>+			 * unfrozen IO queues.
>+			 */
>+			*effects |= NVME_CMD_EFFECTS_NCC;
>+		}
> 	up_read(&ctrl->namespaces_rwsem);
> }
>
>@@ -1410,7 +1444,7 @@ static void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects)
> 	 * this command.
> 	 */
> 	if (effects & NVME_CMD_EFFECTS_LBCC)
>-		nvme_update_formats(ctrl);
>+		nvme_update_formats(ctrl, &effects);
> 	if (effects & (NVME_CMD_EFFECTS_LBCC | NVME_CMD_EFFECTS_CSE_MASK)) {
> 		nvme_unfreeze(ctrl);
> 		nvme_mpath_unfreeze(ctrl->subsys);
>@@ -1525,7 +1559,7 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
>  * Issue ioctl requests on the first available path.  Note that unlike normal
>  * block layer requests we will not retry failed request on another controller.
>  */
>-static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk,
>+struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk,
> 		struct nvme_ns_head **head, int *srcu_idx)
> {
> #ifdef CONFIG_NVME_MULTIPATH
>@@ -1545,7 +1579,7 @@ static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk,
> 	return disk->private_data;
> }
>
>-static void nvme_put_ns_from_disk(struct nvme_ns_head *head, int idx)
>+void nvme_put_ns_from_disk(struct nvme_ns_head *head, int idx)
> {
> 	if (head)
> 		srcu_read_unlock(&head->srcu, idx);
>@@ -1938,21 +1972,28 @@ static void nvme_update_disk_info(struct gendisk *disk,
>
> static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
> {
>+	unsigned lbaf = id->flbas & NVME_NS_FLBAS_LBA_MASK;
> 	struct nvme_ns *ns = disk->private_data;
> 	struct nvme_ctrl *ctrl = ns->ctrl;
>+	int ret;
> 	u32 iob;
>
> 	/*
> 	 * If identify namespace failed, use default 512 byte block size so
> 	 * block layer can use before failing read/write for 0 capacity.
> 	 */
>-	ns->lba_shift = id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ds;
>+	ns->lba_shift = id->lbaf[lbaf].ds;
> 	if (ns->lba_shift == 0)
> 		ns->lba_shift = 9;
>
> 	switch (ns->head->ids.csi) {
> 	case NVME_CSI_NVM:
> 		break;
>+	case NVME_CSI_ZNS:
>+		ret = nvme_update_zone_info(disk, ns, lbaf);
>+		if (ret)
>+			return ret;
>+		break;
> 	default:
> 		dev_warn(ctrl->device, "unknown csi:%d ns:%d\n",
> 			ns->head->ids.csi, ns->head->ns_id);
>@@ -1966,7 +2007,7 @@ static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
> 		iob = nvme_lba_to_sect(ns, le16_to_cpu(id->noiob));
>
> 	ns->features = 0;
>-	ns->ms = le16_to_cpu(id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ms);
>+	ns->ms = le16_to_cpu(id->lbaf[lbaf].ms);
> 	/* the PI implementation requires metadata equal t10 pi tuple size */
> 	if (ns->ms == sizeof(struct t10_pi_tuple))
> 		ns->pi_type = id->dps & NVME_NS_DPS_PI_MASK;
>@@ -2009,7 +2050,7 @@ static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
> 	return 0;
> }
>
>-static int nvme_revalidate_disk(struct gendisk *disk)
>+static int _nvme_revalidate_disk(struct gendisk *disk)
> {
> 	struct nvme_ns *ns = disk->private_data;
> 	struct nvme_ctrl *ctrl = ns->ctrl;
>@@ -2057,6 +2098,28 @@ static int nvme_revalidate_disk(struct gendisk *disk)
> 	return ret;
> }
>
>+static int nvme_revalidate_disk(struct gendisk *disk)
>+{
>+	int ret;
>+
>+	ret = _nvme_revalidate_disk(disk);
>+	if (ret)
>+		return ret;
>+
>+#ifdef CONFIG_BLK_DEV_ZONED
>+	if (blk_queue_is_zoned(disk->queue)) {
>+		struct nvme_ns *ns = disk->private_data;
>+		struct nvme_ctrl *ctrl = ns->ctrl;
>+
>+		ret = blk_revalidate_disk_zones(disk, NULL);
>+		if (!ret)
>+			blk_queue_max_zone_append_sectors(disk->queue,
>+							  ctrl->max_zone_append);
>+	}
>+#endif
>+	return ret;
>+}
>+
> static char nvme_pr_type(enum pr_type type)
> {
> 	switch (type) {
>@@ -2187,6 +2250,7 @@ static const struct block_device_operations nvme_fops = {
> 	.release	= nvme_release,
> 	.getgeo		= nvme_getgeo,
> 	.revalidate_disk= nvme_revalidate_disk,
>+	.report_zones	= nvme_report_zones,
> 	.pr_ops		= &nvme_pr_ops,
> };
>
>@@ -2212,6 +2276,7 @@ const struct block_device_operations nvme_ns_head_ops = {
> 	.ioctl		= nvme_ioctl,
> 	.compat_ioctl	= nvme_compat_ioctl,
> 	.getgeo		= nvme_getgeo,
>+	.report_zones	= nvme_report_zones,
> 	.pr_ops		= &nvme_pr_ops,
> };
> #endif /* CONFIG_NVME_MULTIPATH */
>@@ -4439,6 +4504,8 @@ static inline void _nvme_check_size(void)
> 	BUILD_BUG_ON(sizeof(struct nvme_command) != 64);
> 	BUILD_BUG_ON(sizeof(struct nvme_id_ctrl) != NVME_IDENTIFY_DATA_SIZE);
> 	BUILD_BUG_ON(sizeof(struct nvme_id_ns) != NVME_IDENTIFY_DATA_SIZE);
>+	BUILD_BUG_ON(sizeof(struct nvme_id_ns_zns) != NVME_IDENTIFY_DATA_SIZE);
>+	BUILD_BUG_ON(sizeof(struct nvme_id_ctrl_zns) != NVME_IDENTIFY_DATA_SIZE);
> 	BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
> 	BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
> 	BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
>diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>index 4a1982133e9a..ecf443efdf91 100644
>--- a/drivers/nvme/host/nvme.h
>+++ b/drivers/nvme/host/nvme.h
>@@ -238,6 +238,9 @@ struct nvme_ctrl {
> 	u32 max_hw_sectors;
> 	u32 max_segments;
> 	u32 max_integrity_segments;
>+#ifdef CONFIG_BLK_DEV_ZONED
>+	u32 max_zone_append;
>+#endif
> 	u16 crdt[3];
> 	u16 oncs;
> 	u16 oacs;
>@@ -402,6 +405,9 @@ struct nvme_ns {
> 	u16 sgs;
> 	u32 sws;
> 	u8 pi_type;
>+#ifdef CONFIG_BLK_DEV_ZONED
>+	u64 zsze;
>+#endif
> 	unsigned long features;
> 	unsigned long flags;
> #define NVME_NS_REMOVING	0
>@@ -567,6 +573,9 @@ int nvme_delete_ctrl(struct nvme_ctrl *ctrl);
>
> int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp, u8 csi,
> 		void *log, size_t size, u64 offset);
>+struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk,
>+		struct nvme_ns_head **head, int *srcu_idx);
>+void nvme_put_ns_from_disk(struct nvme_ns_head *head, int idx);
>
> extern const struct attribute_group *nvme_ns_id_attr_groups[];
> extern const struct block_device_operations nvme_ns_head_ops;
>@@ -688,6 +697,36 @@ static inline void nvme_mpath_start_freeze(struct nvme_subsystem *subsys)
> }
> #endif /* CONFIG_NVME_MULTIPATH */
>
>+#ifdef CONFIG_BLK_DEV_ZONED
>+int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>+			  unsigned lbaf);
>+
>+int nvme_report_zones(struct gendisk *disk, sector_t sector,
>+		      unsigned int nr_zones, report_zones_cb cb, void *data);
>+
>+blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>+				       struct nvme_command *cmnd,
>+				       enum nvme_zone_mgmt_action action);
>+#else
>+#define nvme_report_zones NULL
>+
>+static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
>+		struct request *req, struct nvme_command *cmnd,
>+		enum nvme_zone_mgmt_action action)
>+{
>+	return BLK_STS_NOTSUPP;
>+}
>+
>+static inline int nvme_update_zone_info(struct gendisk *disk,
>+					struct nvme_ns *ns,
>+					unsigned lbaf)
>+{
>+	dev_warn(ns->ctrl->device,
>+		 "Please enable CONFIG_BLK_DEV_ZONED to support ZNS devices\n");
>+	return -EPROTONOSUPPORT;
>+}
>+#endif
>+
> #ifdef CONFIG_NVM
> int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node);
> void nvme_nvm_unregister(struct nvme_ns *ns);
>diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>new file mode 100644
>index 000000000000..ee6f49a8aee4
>--- /dev/null
>+++ b/drivers/nvme/host/zns.c
>@@ -0,0 +1,245 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/*
>+ * Copyright (C) 2020 Western Digital Corporation or its affiliates.
>+ */
>+
>+#include <linux/blkdev.h>
>+#include <linux/vmalloc.h>
>+#include "nvme.h"
>+
>+static int nvme_set_max_append(struct nvme_ctrl *ctrl)
>+{
>+	struct nvme_command c = { };
>+	struct nvme_id_ctrl_zns *id;
>+	int status;
>+
>+	id = kzalloc(sizeof(*id), GFP_KERNEL);
>+	if (!id)
>+		return -ENOMEM;
>+
>+	c.identify.opcode = nvme_admin_identify;
>+	c.identify.cns = NVME_ID_CNS_CS_CTRL;
>+	c.identify.csi = NVME_CSI_ZNS;
>+
>+	status = nvme_submit_sync_cmd(ctrl->admin_q, &c, id, sizeof(*id));
>+	if (status) {
>+		kfree(id);
>+		return status;
>+	}
>+
>+	ctrl->max_zone_append = 1 << (id->zamds + 3);
>+	kfree(id);
>+	return 0;
>+}
>+
>+int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
>+			  unsigned lbaf)
>+{
>+	struct nvme_effects_log *log = ns->head->effects;
>+	struct request_queue *q = disk->queue;
>+	struct nvme_command c = { };
>+	struct nvme_id_ns_zns *id;
>+	int status;
>+
>+	/* Driver requires zone append support */
>+	if (!(log->iocs[nvme_cmd_zone_append] & NVME_CMD_EFFECTS_CSUPP)) {
>+		dev_warn(ns->ctrl->device,
>+			"append not supported for zoned namespace:%d\n",
>+			ns->head->ns_id);
>+		return -ENODEV;
>+	}
>+
>+	/* Lazily query controller append limit for the first zoned namespace */
>+	if (!ns->ctrl->max_zone_append) {
>+		status = nvme_set_max_append(ns->ctrl);
>+		if (status)
>+			return status;
>+	}
>+
>+	id = kzalloc(sizeof(*id), GFP_KERNEL);
>+	if (!id)
>+		return -ENOMEM;
>+
>+	c.identify.opcode = nvme_admin_identify;
>+	c.identify.nsid = cpu_to_le32(ns->head->ns_id);
>+	c.identify.cns = NVME_ID_CNS_CS_NS;
>+	c.identify.csi = NVME_CSI_ZNS;
>+
>+	status = nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, id, sizeof(*id));
>+	if (status)
>+		goto free_data;
>+
>+	/*
>+	 * We currently do not handle devices requiring any of the zoned
>+	 * operation characteristics.
>+	 */
>+	if (id->zoc) {
>+		dev_warn(ns->ctrl->device,
>+			"zone operations:%x not supported for namespace:%d\n",
>+			le_to_cpu16(id->zoc), ns->head->ns_id);
>+		status = -EINVAL;
>+		goto free_data;
>+	}
>+
>+	ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
>+	if (!ns->zsze) {
>+		status = -EINVAL;
>+		goto free_data;
>+	}
>+
>+	q->limits.zoned = BLK_ZONED_HM;
>+	blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
>+free_data:
>+	kfree(id);
>+	return status;
>+}
>+
>+static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
>+					  unsigned int nr_zones, size_t *buflen)
>+{
>+	struct request_queue *q = ns->disk->queue;
>+	size_t bufsize;
>+	void *buf;
>+
>+	const size_t min_bufsize = sizeof(struct nvme_zone_report) +
>+				   sizeof(struct nvme_zone_descriptor);
>+
>+	nr_zones = min_t(unsigned int, nr_zones,
>+			 get_capacity(ns->disk) >> ilog2(ns->zsze));
>+
>+	bufsize = sizeof(struct nvme_zone_report) +
>+		nr_zones * sizeof(struct nvme_zone_descriptor);
>+	bufsize = min_t(size_t, bufsize,
>+			queue_max_hw_sectors(q) << SECTOR_SHIFT);
>+	bufsize = min_t(size_t, bufsize, queue_max_segments(q) << PAGE_SHIFT);
>+
>+	while (bufsize >= min_bufsize) {
>+		buf = __vmalloc(bufsize,
>+				GFP_KERNEL | __GFP_ZERO | __GFP_NORETRY);
>+		if (buf) {
>+			*buflen = bufsize;
>+			return buf;
>+		}
>+		bufsize >>= 1;
>+	}
>+	return NULL;
>+}
>+
>+static int __nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>+				  struct nvme_zone_report *report,
>+				  size_t buflen)
>+{
>+	struct nvme_command c = { };
>+	int ret;
>+
>+	c.zmr.opcode = nvme_cmd_zone_mgmt_recv;
>+	c.zmr.nsid = cpu_to_le32(ns->head->ns_id);
>+	c.zmr.slba = cpu_to_le64(nvme_sect_to_lba(ns, sector));
>+	c.zmr.numd = cpu_to_le32(nvme_bytes_to_numd(buflen));
>+	c.zmr.zra = NVME_ZRA_ZONE_REPORT;
>+	c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>+	c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
>+
>+	ret = nvme_submit_sync_cmd(ns->queue, &c, report, buflen);
>+	if (ret)
>+		return ret;
>+
>+	return le64_to_cpu(report->nr_zones);
>+}
>+
>+static int nvme_zone_parse_entry(struct nvme_ns *ns,
>+				 struct nvme_zone_descriptor *entry,
>+				 unsigned int idx, report_zones_cb cb,
>+				 void *data)
>+{
>+	struct blk_zone zone = { };
>+
>+	if ((entry->zt & 0xf) != NVME_ZONE_TYPE_SEQWRITE_REQ) {
>+		dev_err(ns->ctrl->device, "invalid zone type %#x\n",
>+				entry->zt);
>+		return -EINVAL;
>+	}
>+
>+	zone.type = BLK_ZONE_TYPE_SEQWRITE_REQ;
>+	zone.cond = entry->zs >> 4;
>+	zone.len = ns->zsze;
>+	zone.capacity = nvme_lba_to_sect(ns, le64_to_cpu(entry->zcap));
>+	zone.start = nvme_lba_to_sect(ns, le64_to_cpu(entry->zslba));
>+	zone.wp = nvme_lba_to_sect(ns, le64_to_cpu(entry->wp));
>+
>+	return cb(&zone, idx, data);
>+}
>+
>+static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>+			unsigned int nr_zones, report_zones_cb cb, void *data)
>+{
>+	struct nvme_zone_report *report;
>+	int ret, zone_idx = 0;
>+	unsigned int nz, i;
>+	size_t buflen;
>+
>+	report = nvme_zns_alloc_report_buffer(ns, nr_zones, &buflen);
>+	if (!report)
>+		return -ENOMEM;
>+
>+	sector &= ~(ns->zsze - 1);
>+	while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
>+		memset(report, 0, buflen);
>+		ret = __nvme_ns_report_zones(ns, sector, report, buflen);
>+		if (ret < 0)
>+			goto out_free;
>+
>+		nz = min_t(unsigned int, ret, nr_zones);
>+		if (!nz)
>+			break;
>+
>+		for (i = 0; i < nz && zone_idx < nr_zones; i++) {
>+			ret = nvme_zone_parse_entry(ns, &report->entries[i],
>+						    zone_idx, cb, data);
>+			if (ret)
>+				goto out_free;
>+			zone_idx++;
>+		}
>+
>+		sector += ns->zsze * nz;
>+	}
>+
>+	ret = zone_idx;
>+out_free:
>+	kvfree(report);
>+	return ret;
>+}
>+
>+int nvme_report_zones(struct gendisk *disk, sector_t sector,
>+		      unsigned int nr_zones, report_zones_cb cb, void *data)
>+{
>+	struct nvme_ns_head *head = NULL;
>+	struct nvme_ns *ns;
>+	int srcu_idx, ret;
>+
>+	ns = nvme_get_ns_from_disk(disk, &head, &srcu_idx);
>+	if (unlikely(!ns))
>+		return -EWOULDBLOCK;
>+
>+	if (ns->head->ids.csi == NVME_CSI_ZNS)
>+		ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
>+	else
>+		ret = -EINVAL;
>+	nvme_put_ns_from_disk(head, srcu_idx);
>+
>+	return ret;
>+}
>+
>+blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>+		struct nvme_command *c, enum nvme_zone_mgmt_action action)
>+{
>+	c->zms.opcode = nvme_cmd_zone_mgmt_send;
>+	c->zms.nsid = cpu_to_le32(ns->head->ns_id);
>+	c->zms.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
>+	c->zms.zsa = action;
>+
>+	if (req_op(req) == REQ_OP_ZONE_RESET_ALL)
>+		c->zms.select_all = 1;
>+
>+	return BLK_STS_OK;
>+}
>diff --git a/include/linux/nvme.h b/include/linux/nvme.h
>index 95cd03e240a1..d862e5d70818 100644
>--- a/include/linux/nvme.h
>+++ b/include/linux/nvme.h
>@@ -1,6 +1,7 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> /*
>- * Definitions for the NVM Express interface
>+ * Definitions for the NVM Ex
>+ * ress interface
>  * Copyright (c) 2011-2014, Intel Corporation.
>  */
>
>@@ -374,6 +375,30 @@ struct nvme_id_ns {
> 	__u8			vs[3712];
> };
>
>+struct nvme_zns_lbafe {
>+	__le64			zsze;
>+	__u8			zdes;
>+	__u8			rsvd9[7];
>+};
>+
>+struct nvme_id_ns_zns {
>+	__le16			zoc;
>+	__le16			ozcs;
>+	__le32			mar;
>+	__le32			mor;
>+	__le32			rrl;
>+	__le32			frl;
>+	__u8			rsvd20[2796];
>+	struct nvme_zns_lbafe	lbafe[16];
>+	__u8			rsvd3072[768];
>+	__u8			vs[256];
>+};
>+
>+struct nvme_id_ctrl_zns {
>+	__u8	zamds;
>+	__u8	rsvd1[4095];
>+};
>+
> enum {
> 	NVME_ID_CNS_NS			= 0x00,
> 	NVME_ID_CNS_CTRL		= 0x01,
>@@ -392,6 +417,7 @@ enum {
>
> enum {
> 	NVME_CSI_NVM			= 0,
>+	NVME_CSI_ZNS			= 2,
> };
>
> enum {
>@@ -532,6 +558,27 @@ struct nvme_ana_rsp_hdr {
> 	__le16	rsvd10[3];
> };
>
>+struct nvme_zone_descriptor {
>+	__u8		zt;
>+	__u8		zs;
>+	__u8		za;
>+	__u8		rsvd3[5];
>+	__le64		zcap;
>+	__le64		zslba;
>+	__le64		wp;
>+	__u8		rsvd32[32];
>+};
>+
>+enum {
>+	NVME_ZONE_TYPE_SEQWRITE_REQ	= 0x2,
>+};
>+
>+struct nvme_zone_report {
>+	__le64		nr_zones;
>+	__u8		resv8[56];
>+	struct nvme_zone_descriptor entries[];
>+};
>+
> enum {
> 	NVME_SMART_CRIT_SPARE		= 1 << 0,
> 	NVME_SMART_CRIT_TEMPERATURE	= 1 << 1,
>@@ -626,6 +673,9 @@ enum nvme_opcode {
> 	nvme_cmd_resv_report	= 0x0e,
> 	nvme_cmd_resv_acquire	= 0x11,
> 	nvme_cmd_resv_release	= 0x15,
>+	nvme_cmd_zone_mgmt_send	= 0x79,
>+	nvme_cmd_zone_mgmt_recv	= 0x7a,
>+	nvme_cmd_zone_append	= 0x7d,
> };
>
> #define nvme_opcode_name(opcode)	{ opcode, #opcode }
>@@ -764,6 +814,7 @@ struct nvme_rw_command {
> enum {
> 	NVME_RW_LR			= 1 << 15,
> 	NVME_RW_FUA			= 1 << 14,
>+	NVME_RW_APPEND_PIREMAP		= 1 << 9,
> 	NVME_RW_DSM_FREQ_UNSPEC		= 0,
> 	NVME_RW_DSM_FREQ_TYPICAL	= 1,
> 	NVME_RW_DSM_FREQ_RARE		= 2,
>@@ -829,6 +880,53 @@ struct nvme_write_zeroes_cmd {
> 	__le16			appmask;
> };
>
>+enum nvme_zone_mgmt_action {
>+	NVME_ZONE_CLOSE		= 0x1,
>+	NVME_ZONE_FINISH	= 0x2,
>+	NVME_ZONE_OPEN		= 0x3,
>+	NVME_ZONE_RESET		= 0x4,
>+	NVME_ZONE_OFFLINE	= 0x5,
>+	NVME_ZONE_SET_DESC_EXT	= 0x10,
>+};
>+
>+struct nvme_zone_mgmt_send_cmd {
>+	__u8			opcode;
>+	__u8			flags;
>+	__u16			command_id;
>+	__le32			nsid;
>+	__le32			cdw2[2];
>+	__le64			metadata;
>+	union nvme_data_ptr	dptr;
>+	__le64			slba;
>+	__le32			cdw12;
>+	__u8			zsa;
>+	__u8			select_all;
>+	__u8			rsvd13[2];
>+	__le32			cdw14[2];
>+};
>+
>+struct nvme_zone_mgmt_recv_cmd {
>+	__u8			opcode;
>+	__u8			flags;
>+	__u16			command_id;
>+	__le32			nsid;
>+	__le64			rsvd2[2];
>+	union nvme_data_ptr	dptr;
>+	__le64			slba;
>+	__le32			numd;
>+	__u8			zra;
>+	__u8			zrasf;
>+	__u8			pr;
>+	__u8			rsvd13;
>+	__le32			cdw14[2];
>+};
>+
>+enum {
>+	NVME_ZRA_ZONE_REPORT		= 0,
>+	NVME_ZRASF_ZONE_REPORT_ALL	= 0,
>+	NVME_REPORT_ZONE_PARTIAL	= 1,
>+};
>+
> /* Features */
>
> enum {
>@@ -1300,6 +1398,8 @@ struct nvme_command {
> 		struct nvme_format_cmd format;
> 		struct nvme_dsm_cmd dsm;
> 		struct nvme_write_zeroes_cmd write_zeroes;
>+		struct nvme_zone_mgmt_send_cmd zms;
>+		struct nvme_zone_mgmt_recv_cmd zmr;
> 		struct nvme_abort_cmd abort;
> 		struct nvme_get_log_page_command get_log_page;
> 		struct nvmf_common_command fabrics;
>@@ -1433,6 +1533,18 @@ enum {
> 	NVME_SC_DISCOVERY_RESTART	= 0x190,
> 	NVME_SC_AUTH_REQUIRED		= 0x191,
>
>+	/*
>+	 * I/O Command Set Specific - Zoned commands:
>+	 */
>+	NVME_SC_ZONE_BOUNDARY_ERROR	= 0x1b8,
>+	NVME_SC_ZONE_FULL		= 0x1b9,
>+	NVME_SC_ZONE_READ_ONLY		= 0x1ba,
>+	NVME_SC_ZONE_OFFLINE		= 0x1bb,
>+	NVME_SC_ZONE_INVALID_WRITE	= 0x1bc,
>+	NVME_SC_ZONE_TOO_MANY_ACTIVE	= 0x1bd,
>+	NVME_SC_ZONE_TOO_MANY_OPEN	= 0x1be,
>+	NVME_SC_ZONE_INVALID_TRANSITION	= 0x1bf,
>+
> 	/*
> 	 * Media and Data Integrity Errors:
> 	 */
>-- 
>2.24.1
>
>
>_______________________________________________
>Linux-nvme mailing list
>Linux-nvme@lists.infradead.org
>http://lists.infradead.org/mailman/listinfo/linux-nvme

Looks good!

Reviewed-by: Javier González <javier.gonz@samsung.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-23 23:17         ` Sagi Grimberg
@ 2020-06-24 17:25           ` Keith Busch
  2020-06-24 17:46             ` Sagi Grimberg
  0 siblings, 1 reply; 38+ messages in thread
From: Keith Busch @ 2020-06-24 17:25 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Niklas Cassel, linux-nvme, hch, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner

On Tue, Jun 23, 2020 at 04:17:30PM -0700, Sagi Grimberg wrote:
> 
> > > > > -		len = nvme_process_ns_desc(ctrl, ids, cur);
> > > > > +		len = nvme_process_ns_desc(ctrl, ids, cur, &csi_seen);
> > > > >    		if (len < 0)
> > > > >    			goto free_data;
> > > > >    		len += sizeof(*cur);
> > > > >    	}
> > > > >    free_data:
> > > > > +	if (!status && nvme_multi_css(ctrl) && !csi_seen) {
> > > > 
> > > > We will clear the status if we detect a path error, that is to
> > > > avoid needlessly removing the ns for path failures, so you should
> > > > check at the goto site.
> > > 
> > > The problem is that this check has to be done after checking all the ns descs,
> > > so this check to be done as the final thing, at least after processing all the
> > > ns descs. No matter if nvme_process_ns_desc() returned an error, or if
> > > simply NVME_NIDT_CSI wasn't part of the ns desc list, so the loop reached the
> > > end without error.
> > > 
> > > Even if the nvme command failed and the status was cleared:
> > > 
> > >                  if (status > 0 && !(status & NVME_SC_DNR))
> > >                          status = 0;
> > 
> > This check is so weird. What has DNR got to do with whether or not we
> > want to continue with this namespace? The commit that adds this says
> > it's to check for a host failed IO, but a controller can just as validly
> > set DNR in its error status, in which case we'd still want clear the
> > status.
> 
> The reason is that if this error is not a DNR error, it means that we
> should retry and succeed, we don't want to remove the namespace.

And what if it is a DNR error? For example, the controller simply
doesn't support this CNS value. While the controller should support it,
we definitely don't need it for NVM command set namespaces.
 
> The problem that this solves is the fact that we have various error
> recovery conditions (interconnect issues, failures, resets) and the
> async works are designed to continue to run, issue I/O and fail. The
> scan work, will revalidate namespaces and remove them if it fails to
> do so, which is inherently wrong if these are simply inaccessible by
> the host at this time.

Relying on a specific bit in the status code seems a bit indirect for
this kind of condition.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-24 17:25           ` Keith Busch
@ 2020-06-24 17:46             ` Sagi Grimberg
  2020-06-24 18:03               ` Keith Busch
  0 siblings, 1 reply; 38+ messages in thread
From: Sagi Grimberg @ 2020-06-24 17:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: Niklas Cassel, linux-nvme, hch, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner



On 6/24/20 10:25 AM, Keith Busch wrote:
> On Tue, Jun 23, 2020 at 04:17:30PM -0700, Sagi Grimberg wrote:
>>
>>>>>> -		len = nvme_process_ns_desc(ctrl, ids, cur);
>>>>>> +		len = nvme_process_ns_desc(ctrl, ids, cur, &csi_seen);
>>>>>>     		if (len < 0)
>>>>>>     			goto free_data;
>>>>>>     		len += sizeof(*cur);
>>>>>>     	}
>>>>>>     free_data:
>>>>>> +	if (!status && nvme_multi_css(ctrl) && !csi_seen) {
>>>>>
>>>>> We will clear the status if we detect a path error, that is to
>>>>> avoid needlessly removing the ns for path failures, so you should
>>>>> check at the goto site.
>>>>
>>>> The problem is that this check has to be done after checking all the ns descs,
>>>> so this check to be done as the final thing, at least after processing all the
>>>> ns descs. No matter if nvme_process_ns_desc() returned an error, or if
>>>> simply NVME_NIDT_CSI wasn't part of the ns desc list, so the loop reached the
>>>> end without error.
>>>>
>>>> Even if the nvme command failed and the status was cleared:
>>>>
>>>>                   if (status > 0 && !(status & NVME_SC_DNR))
>>>>                           status = 0;
>>>
>>> This check is so weird. What has DNR got to do with whether or not we
>>> want to continue with this namespace? The commit that adds this says
>>> it's to check for a host failed IO, but a controller can just as validly
>>> set DNR in its error status, in which case we'd still want clear the
>>> status.
>>
>> The reason is that if this error is not a DNR error, it means that we
>> should retry and succeed, we don't want to remove the namespace.
> 
> And what if it is a DNR error? For example, the controller simply
> doesn't support this CNS value. While the controller should support it,
> we definitely don't need it for NVM command set namespaces.

Maybe I mis-undersatnd the comment, but if you see a DNR error, it means
that the controller replied an error and its final, so then you can have
extra checks.

I think the point here is that if we got a non-dnr status, we should not
take any actions on this namespace because we need to retry first
(either because the controller is unavailable or it needs the host to
retry for another reason).

>> The problem that this solves is the fact that we have various error
>> recovery conditions (interconnect issues, failures, resets) and the
>> async works are designed to continue to run, issue I/O and fail. The
>> scan work, will revalidate namespaces and remove them if it fails to
>> do so, which is inherently wrong if these are simply inaccessible by
>> the host at this time.
> 
> Relying on a specific bit in the status code seems a bit indirect for
> this kind of condition.

I actually think this approach covers exactly what we are trying to
achieve. But I'll let others comment on this.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-24 17:46             ` Sagi Grimberg
@ 2020-06-24 18:03               ` Keith Busch
  2020-06-24 18:28                 ` Sagi Grimberg
  0 siblings, 1 reply; 38+ messages in thread
From: Keith Busch @ 2020-06-24 18:03 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Niklas Cassel, linux-nvme, hch, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner

On Wed, Jun 24, 2020 at 10:46:03AM -0700, Sagi Grimberg wrote:
> On 6/24/20 10:25 AM, Keith Busch wrote:
> > On Tue, Jun 23, 2020 at 04:17:30PM -0700, Sagi Grimberg wrote:
> > And what if it is a DNR error? For example, the controller simply
> > doesn't support this CNS value. While the controller should support it,
> > we definitely don't need it for NVM command set namespaces.
> 
> Maybe I mis-undersatnd the comment, but if you see a DNR error, it means
> that the controller replied an error and its final, so then you can have
> extra checks.

If the controller does not support the CNS value, it may return Invalid
Field with DNR set. That error currently gets propogated back to
nvme_init_ns_head(), which then abandons the namespace. Just as the code
coments say, we had been historically been clearing such errors because
we have other ways to identify the namespace, but now we're not clearing
that error.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-24 18:03               ` Keith Busch
@ 2020-06-24 18:28                 ` Sagi Grimberg
  2020-06-24 18:33                   ` Sagi Grimberg
  0 siblings, 1 reply; 38+ messages in thread
From: Sagi Grimberg @ 2020-06-24 18:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Niklas Cassel, linux-nvme, hch, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner


>>> And what if it is a DNR error? For example, the controller simply
>>> doesn't support this CNS value. While the controller should support it,
>>> we definitely don't need it for NVM command set namespaces.
>>
>> Maybe I mis-undersatnd the comment, but if you see a DNR error, it means
>> that the controller replied an error and its final, so then you can have
>> extra checks.
> 
> If the controller does not support the CNS value, it may return Invalid
> Field with DNR set. That error currently gets propogated back to
> nvme_init_ns_head(), which then abandons the namespace. Just as the code
> coments say, we had been historically been clearing such errors because
> we have other ways to identify the namespace, but now we're not clearing
> that error.

I don't understand what you are saying Keith.

My comment was for this block:
--
	if (!status && nvme_multi_css(ctrl) && !csi_seen) {
		dev_warn(ctrl->device, "Command set not reported for nsid:%d\n",
			 nsid);
		status = -EINVAL;
	}
--

I was saying that !status doesn't necessarily mean success, but it can
also mean that its an retry-able error status (due to transport or
controller). If we see a retry-able error we should still clear the
status so we don't abandon the namespace.

This for example would achieve that:
--
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ba512c45b40f..46d8a8379aff 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1126,12 +1126,21 @@ static int nvme_identify_ns_descs(struct 
nvme_ctrl *ctrl, unsigned nsid,
         if (status) {
                 dev_warn(ctrl->device,
                         "Identify Descriptors failed (%d)\n", status);
-                /*
-                 * Don't treat an error as fatal, as we potentially already
-                 * have a NGUID or EUI-64.
-                 */
-               if (status > 0 && !(status & NVME_SC_DNR))
-                       status = 0;
+
+               if (status > 0 && !(status & NVME_SC_DNR)) {
+                       if (nvme_multi_css(ctrl) && !csi_seen) {
+                               dev_warn(ctrl->device,
+                                       "Command set not reported for 
nsid:%d\n",
+                                       nsid);
+                               status = -EINVAL;
+                       } else {
+                               /*
+                                * Don't treat an error as fatal, as we
+                                * potentially already have a NGUID or 
EUI-64.
+                                */
+                               status = 0;
+                       }
+               }
                 goto free_data;
         }
--

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-24 18:28                 ` Sagi Grimberg
@ 2020-06-24 18:33                   ` Sagi Grimberg
  2020-06-24 18:40                     ` Keith Busch
  0 siblings, 1 reply; 38+ messages in thread
From: Sagi Grimberg @ 2020-06-24 18:33 UTC (permalink / raw)
  To: Keith Busch
  Cc: Niklas Cassel, linux-nvme, hch, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner


>> If the controller does not support the CNS value, it may return Invalid
>> Field with DNR set. That error currently gets propogated back to
>> nvme_init_ns_head(), which then abandons the namespace. Just as the code
>> coments say, we had been historically been clearing such errors because
>> we have other ways to identify the namespace, but now we're not clearing
>> that error.
> 
> I don't understand what you are saying Keith.
> 
> My comment was for this block:
> -- 
>      if (!status && nvme_multi_css(ctrl) && !csi_seen) {
>          dev_warn(ctrl->device, "Command set not reported for nsid:%d\n",
>               nsid);
>          status = -EINVAL;
>      }
> -- 
> 
> I was saying that !status doesn't necessarily mean success, but it can
> also mean that its an retry-able error status (due to transport or
> controller). If we see a retry-able error we should still clear the
> status so we don't abandon the namespace.
> 
> This for example would achieve that:

Sorry, meant this:
--
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ba512c45b40f..3187cf768d08 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1130,8 +1130,14 @@ static int nvme_identify_ns_descs(struct 
nvme_ctrl *ctrl, unsigned nsid,
                   * Don't treat an error as fatal, as we potentially 
already
                   * have a NGUID or EUI-64.
                   */
-               if (status > 0 && !(status & NVME_SC_DNR))
+               if (status > 0 && !(status & NVME_SC_DNR)) {
                         status = 0;
+               } else if (status == 0 && nvme_multi_css(ctrl) && 
!csi_seen) {
+                               dev_warn(ctrl->device,
+                                       "Command set not reported for 
nsid:%d\n",
+                                       nsid);
+                               status = -EINVAL;
+               }
                 goto free_data;
         }
--

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-24 18:33                   ` Sagi Grimberg
@ 2020-06-24 18:40                     ` Keith Busch
  2020-06-24 19:03                       ` Sagi Grimberg
  0 siblings, 1 reply; 38+ messages in thread
From: Keith Busch @ 2020-06-24 18:40 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Niklas Cassel, linux-nvme, hch, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner

On Wed, Jun 24, 2020 at 11:33:25AM -0700, Sagi Grimberg wrote:
> > > If the controller does not support the CNS value, it may return Invalid
> > > Field with DNR set. That error currently gets propogated back to
> > > nvme_init_ns_head(), which then abandons the namespace. Just as the code
> > > coments say, we had been historically been clearing such errors because
> > > we have other ways to identify the namespace, but now we're not clearing
> > > that error.
> > 
> > I don't understand what you are saying Keith.
> > 
> > My comment was for this block:
> > -- 
> >      if (!status && nvme_multi_css(ctrl) && !csi_seen) {
> >          dev_warn(ctrl->device, "Command set not reported for nsid:%d\n",
> >               nsid);
> >          status = -EINVAL;
> >      }
> > -- 
> > 
> > I was saying that !status doesn't necessarily mean success, but it can
> > also mean that its an retry-able error status (due to transport or
> > controller). If we see a retry-able error we should still clear the
> > status so we don't abandon the namespace.
> > 
> > This for example would achieve that:

We're not talking about the same thing. I am only talking about what
introduced the DNR check, from commit 59c7c3caaaf87.

I know you added it because you want to abort comparing identifiers on a
rescan when retrieving the identifiers failed. That's fine, but I am
saying this fails namespace creation in the first place for some types
of devices that used to succeed.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-24 18:40                     ` Keith Busch
@ 2020-06-24 19:03                       ` Sagi Grimberg
  2020-06-24 21:49                         ` Keith Busch
  0 siblings, 1 reply; 38+ messages in thread
From: Sagi Grimberg @ 2020-06-24 19:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Niklas Cassel, linux-nvme, hch, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner


>>>> If the controller does not support the CNS value, it may return Invalid
>>>> Field with DNR set. That error currently gets propogated back to
>>>> nvme_init_ns_head(), which then abandons the namespace. Just as the code
>>>> coments say, we had been historically been clearing such errors because
>>>> we have other ways to identify the namespace, but now we're not clearing
>>>> that error.
>>>
>>> I don't understand what you are saying Keith.
>>>
>>> My comment was for this block:
>>> -- 
>>>       if (!status && nvme_multi_css(ctrl) && !csi_seen) {
>>>           dev_warn(ctrl->device, "Command set not reported for nsid:%d\n",
>>>                nsid);
>>>           status = -EINVAL;
>>>       }
>>> -- 
>>>
>>> I was saying that !status doesn't necessarily mean success, but it can
>>> also mean that its an retry-able error status (due to transport or
>>> controller). If we see a retry-able error we should still clear the
>>> status so we don't abandon the namespace.
>>>
>>> This for example would achieve that:
> 
> We're not talking about the same thing. I am only talking about what
> introduced the DNR check, from commit 59c7c3caaaf87.
> 
> I know you added it because you want to abort comparing identifiers on a
> rescan when retrieving the identifiers failed. That's fine, but I am
> saying this fails namespace creation in the first place for some types
> of devices that used to succeed.

OK, now I think I understand (thanks for clarifying that the discussion
is not on patch 3/5, but rather on 59c7c3caaaf87).

So the original proposal was to check NVME_SC_HOST_PATH_ERROR (and now
we have NVME_SC_HOST_ABORTED_CMD) but with the review Christoph gave
it seemed to make more sense that we generalize and check the DNR bit.

We could restrict it back to checking the status is
NVME_SC_HOST_PATH_ERROR or NVME_SC_HOST_ABORTED_CMD if you think it
creates problems. However, if we keep the code as is, the original
comment still holds.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-24 19:03                       ` Sagi Grimberg
@ 2020-06-24 21:49                         ` Keith Busch
  2020-06-24 22:54                           ` Sagi Grimberg
  0 siblings, 1 reply; 38+ messages in thread
From: Keith Busch @ 2020-06-24 21:49 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Niklas Cassel, linux-nvme, hch, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner

On Wed, Jun 24, 2020 at 12:03:41PM -0700, Sagi Grimberg wrote:
> 
> > > > > If the controller does not support the CNS value, it may return Invalid
> > > > > Field with DNR set. That error currently gets propogated back to
> > > > > nvme_init_ns_head(), which then abandons the namespace. Just as the code
> > > > > coments say, we had been historically been clearing such errors because
> > > > > we have other ways to identify the namespace, but now we're not clearing
> > > > > that error.
> > > > 
> > > > I don't understand what you are saying Keith.
> > > > 
> > > > My comment was for this block:
> > > > -- 
> > > >       if (!status && nvme_multi_css(ctrl) && !csi_seen) {
> > > >           dev_warn(ctrl->device, "Command set not reported for nsid:%d\n",
> > > >                nsid);
> > > >           status = -EINVAL;
> > > >       }
> > > > -- 
> > > > 
> > > > I was saying that !status doesn't necessarily mean success, but it can
> > > > also mean that its an retry-able error status (due to transport or
> > > > controller). If we see a retry-able error we should still clear the
> > > > status so we don't abandon the namespace.
> > > > 
> > > > This for example would achieve that:
> > 
> > We're not talking about the same thing. I am only talking about what
> > introduced the DNR check, from commit 59c7c3caaaf87.
> > 
> > I know you added it because you want to abort comparing identifiers on a
> > rescan when retrieving the identifiers failed. That's fine, but I am
> > saying this fails namespace creation in the first place for some types
> > of devices that used to succeed.
> 
> OK, now I think I understand (thanks for clarifying that the discussion
> is not on patch 3/5, but rather on 59c7c3caaaf87).
> 
> So the original proposal was to check NVME_SC_HOST_PATH_ERROR (and now
> we have NVME_SC_HOST_ABORTED_CMD) but with the review Christoph gave
> it seemed to make more sense that we generalize and check the DNR bit.

Okay, I didn't question this approach when it first went through, so
sorry about this digression, but I really don't get how this DNR check
helps at all.

The code currently returns an error if DNR is set. Based on the commit
messages, it sounds like you need that error to skip comparing
identifiers through nvme's scan_work calling revalidate_disk():
suppressing the error has revalidate_disk() fail with -ENODEV when
comparing identifiers fails.

Since we do return the error when DNR is set, we skip comparing
identifiers and return blk_status_to_errno(nvme_error_status(ret))
instead. How is that errno an improvement?

And then if DNR is not set, we suppress the error and proceed with
comparing identifiers??

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-24 21:49                         ` Keith Busch
@ 2020-06-24 22:54                           ` Sagi Grimberg
  2020-06-24 23:54                             ` Keith Busch
  0 siblings, 1 reply; 38+ messages in thread
From: Sagi Grimberg @ 2020-06-24 22:54 UTC (permalink / raw)
  To: Keith Busch
  Cc: Niklas Cassel, linux-nvme, hch, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner



On 6/24/20 2:49 PM, Keith Busch wrote:
> On Wed, Jun 24, 2020 at 12:03:41PM -0700, Sagi Grimberg wrote:
>>
>>>>>> If the controller does not support the CNS value, it may return Invalid
>>>>>> Field with DNR set. That error currently gets propogated back to
>>>>>> nvme_init_ns_head(), which then abandons the namespace. Just as the code
>>>>>> coments say, we had been historically been clearing such errors because
>>>>>> we have other ways to identify the namespace, but now we're not clearing
>>>>>> that error.
>>>>>
>>>>> I don't understand what you are saying Keith.
>>>>>
>>>>> My comment was for this block:
>>>>> -- 
>>>>>        if (!status && nvme_multi_css(ctrl) && !csi_seen) {
>>>>>            dev_warn(ctrl->device, "Command set not reported for nsid:%d\n",
>>>>>                 nsid);
>>>>>            status = -EINVAL;
>>>>>        }
>>>>> -- 
>>>>>
>>>>> I was saying that !status doesn't necessarily mean success, but it can
>>>>> also mean that its an retry-able error status (due to transport or
>>>>> controller). If we see a retry-able error we should still clear the
>>>>> status so we don't abandon the namespace.
>>>>>
>>>>> This for example would achieve that:
>>>
>>> We're not talking about the same thing. I am only talking about what
>>> introduced the DNR check, from commit 59c7c3caaaf87.
>>>
>>> I know you added it because you want to abort comparing identifiers on a
>>> rescan when retrieving the identifiers failed. That's fine, but I am
>>> saying this fails namespace creation in the first place for some types
>>> of devices that used to succeed.
>>
>> OK, now I think I understand (thanks for clarifying that the discussion
>> is not on patch 3/5, but rather on 59c7c3caaaf87).
>>
>> So the original proposal was to check NVME_SC_HOST_PATH_ERROR (and now
>> we have NVME_SC_HOST_ABORTED_CMD) but with the review Christoph gave
>> it seemed to make more sense that we generalize and check the DNR bit.
> 
> Okay, I didn't question this approach when it first went through, so
> sorry about this digression, but I really don't get how this DNR check
> helps at all.
> 
> The code currently returns an error if DNR is set.

Right.

> Based on the commit
> messages, it sounds like you need that error to skip comparing
> identifiers through nvme's scan_work calling revalidate_disk():
> suppressing the error has revalidate_disk() fail with -ENODEV when
> comparing identifiers fails.

Your understanding is correct.

> Since we do return the error when DNR is set, we skip comparing
> identifiers and return blk_status_to_errno(nvme_error_status(ret))
> instead. How is that errno an improvement?

See nvme_revalidate_disk:
out:
         /*
          * Only fail the function if we got a fatal error back from the
          * device, otherwise ignore the error and just move on.
          */
         if (ret == -ENOMEM || (ret > 0 && !(ret & NVME_SC_DNR)))
                 ret = 0;
         else if (ret > 0)
                 ret = blk_status_to_errno(nvme_error_status(ret));
         return ret;

So we don't actually propagate the error back if its a non-DNR (hence
retry-able error). The errno was there before in order to not leak NVMe
errors to the block layer.

> And then if DNR is not set, we suppress the error and proceed with
> comparing identifiers??

Wait, I think that I re-read it it's backwards. The intent was to indeed
clear the error for the DNR case and propagate the error for the non-DNR
case!

The code needs to be:
--
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2afed32d3892..3e84ab6c2bd3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1130,7 +1130,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl 
*ctrl, unsigned nsid,
                   * Don't treat an error as fatal, as we potentially 
already
                   * have a NGUID or EUI-64.
                   */
-               if (status > 0 && !(status & NVME_SC_DNR))
+               if (status > 0 && (status & NVME_SC_DNR))
                         status = 0;
                 goto free_data;
         }
--

This way, if the controller failed the identify it will be with DNR
status and we will silently ignore, and if the transport failed its
a non-DNR status, and we propagate the status back, skip the id compare,
and then silently ignore the error in nvme_revalidate_disk (as above).

Looking into the original fix we had internally, this was the case, and
got fat-fingered in I can only assume...

Will send a fix right away, good catch keith!

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-24 22:54                           ` Sagi Grimberg
@ 2020-06-24 23:54                             ` Keith Busch
  0 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2020-06-24 23:54 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Niklas Cassel, linux-nvme, hch, linux-block, axboe,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjorling, Daniel Wagner

On Wed, Jun 24, 2020 at 03:54:05PM -0700, Sagi Grimberg wrote:
> The code needs to be:
> --
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 2afed32d3892..3e84ab6c2bd3 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1130,7 +1130,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl
> *ctrl, unsigned nsid,
>                   * Don't treat an error as fatal, as we potentially already
>                   * have a NGUID or EUI-64.
>                   */
> -               if (status > 0 && !(status & NVME_SC_DNR))
> +               if (status > 0 && (status & NVME_SC_DNR))
>                         status = 0;
>                 goto free_data;
>         }
> --

Aha, I was assuming the code was the way you wanted it, hence my
confusion :)

The above makes sense walking through different scenarios. I needed
to reconcile this in order to understand how we'll address it with
Nikla's patch that start this conversation.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support
  2020-06-23  8:53   ` Sagi Grimberg
  2020-06-23 11:25     ` Niklas Cassel
@ 2020-06-26  8:54     ` Christoph Hellwig
  1 sibling, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2020-06-26  8:54 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, linux-nvme, hch, linux-block, axboe, Niklas Cassel,
	Javier González, Martin K . Petersen, Johannes Thumshirn,
	Matias Bjørling, Daniel Wagner

On Tue, Jun 23, 2020 at 01:53:47AM -0700, Sagi Grimberg wrote:
>>   	if (ns->lba_shift == 0)
>>   		ns->lba_shift = 9;
>>   +	switch (ns->head->ids.csi) {
>> +	case NVME_CSI_NVM:
>> +		break;
>> +	default:
>> +		dev_warn(ctrl->device, "unknown csi:%d ns:%d\n",
>> +			ns->head->ids.csi, ns->head->ns_id);
>> +		return -ENODEV;
>> +	}
>
> Not sure we need a switch-case statement for a single case target...

I think a switch makes inherent sense when there is an identifier that
can have multiple values, even if there only is one for now.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 1/5] block: add capacity field to zone descriptors
  2020-06-22 16:25 ` [PATCHv3 1/5] block: add capacity field to zone descriptors Keith Busch
  2020-06-23  6:15   ` Hannes Reinecke
  2020-06-23  8:44   ` Sagi Grimberg
@ 2020-06-26 12:17   ` Jens Axboe
  2 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-26 12:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux NVMe, Christoph Hellwig, sagig, linux-block,
	Matias Bjørling, Chaitanya Kulkarni, Javier González,
	Daniel Wagner, Johannes Thumshirn, Martin K . Petersen

On Mon, Jun 22, 2020 at 10:25 AM Keith Busch <kbusch@kernel.org> wrote:
>
> From: Matias Bjørling <matias.bjorling@wdc.com>
>
> In the zoned storage model, the sectors within a zone are typically all
> writeable. With the introduction of the Zoned Namespace (ZNS) Command
> Set in the NVM Express organization, the model was extended to have a
> specific writeable capacity.
>
> Extend the zone descriptor data structure with a zone capacity field to
> indicate to the user how many sectors in a zone are writeable.
>
> Introduce backward compatibility in the zone report ioctl by extending
> the zone report header data structure with a flags field to indicate if
> the capacity field is available.

Reviewed-by: Jens Axboe <axboe@kernel.dk>

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHv3 5/5] nvme: support for zoned namespaces
  2020-06-22 16:25 ` [PATCHv3 5/5] nvme: support for zoned namespaces Keith Busch
                     ` (3 preceding siblings ...)
  2020-06-24  9:11   ` Javier González
@ 2020-06-29 13:53   ` Johannes Thumshirn
  4 siblings, 0 replies; 38+ messages in thread
From: Johannes Thumshirn @ 2020-06-29 13:53 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, hch, sagi, linux-block, axboe
  Cc: Keith Busch, Martin K . Petersen, Hans Holmberg, Dmitry Fomichev,
	Ajay Joshi, Aravind Ramesh, Niklas Cassel, Matias Bjorling,
	Damien Le Moal

On 22/06/2020 18:25, Keith Busch wrote:
> +			le_to_cpu16(id->zoc), ns->head->ns_id);

That should probably be le16_to_cpu()

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, back to index

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-22 16:25 [PATCHv3 0/5] nvme support for zoned namespace command set Keith Busch
2020-06-22 16:25 ` [PATCHv3 1/5] block: add capacity field to zone descriptors Keith Busch
2020-06-23  6:15   ` Hannes Reinecke
2020-06-23  8:44   ` Sagi Grimberg
2020-06-26 12:17   ` Jens Axboe
2020-06-22 16:25 ` [PATCHv3 2/5] null_blk: introduce zone capacity for zoned device Keith Busch
2020-06-23  6:16   ` Hannes Reinecke
2020-06-23  8:45   ` Sagi Grimberg
2020-06-22 16:25 ` [PATCHv3 3/5] nvme: implement I/O Command Sets Command Set support Keith Busch
2020-06-23  6:20   ` Hannes Reinecke
2020-06-23  9:20     ` Niklas Cassel
2020-06-23 14:25       ` Keith Busch
2020-06-23  8:53   ` Sagi Grimberg
2020-06-23 11:25     ` Niklas Cassel
2020-06-23 14:59       ` Keith Busch
2020-06-23 22:10       ` Keith Busch
2020-06-23 23:17         ` Sagi Grimberg
2020-06-24 17:25           ` Keith Busch
2020-06-24 17:46             ` Sagi Grimberg
2020-06-24 18:03               ` Keith Busch
2020-06-24 18:28                 ` Sagi Grimberg
2020-06-24 18:33                   ` Sagi Grimberg
2020-06-24 18:40                     ` Keith Busch
2020-06-24 19:03                       ` Sagi Grimberg
2020-06-24 21:49                         ` Keith Busch
2020-06-24 22:54                           ` Sagi Grimberg
2020-06-24 23:54                             ` Keith Busch
2020-06-23 23:20       ` Sagi Grimberg
2020-06-26  8:54     ` Christoph Hellwig
2020-06-22 16:25 ` [PATCHv3 4/5] nvme: support for multi-command set effects Keith Busch
2020-06-23  6:21   ` Hannes Reinecke
2020-06-23 17:43   ` Sagi Grimberg
2020-06-22 16:25 ` [PATCHv3 5/5] nvme: support for zoned namespaces Keith Busch
2020-06-22 16:48   ` Johannes Thumshirn
2020-06-23  6:23   ` Hannes Reinecke
2020-06-23 17:45   ` Sagi Grimberg
2020-06-24  9:11   ` Javier González
2020-06-29 13:53   ` Johannes Thumshirn

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git