All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
@ 2020-09-23 18:20 Dmitry Fomichev
  2020-09-23 18:20 ` [PATCH v4 01/14] hw/block/nvme: Report actual LBA data shift in LBAF Dmitry Fomichev
                   ` (14 more replies)
  0 siblings, 15 replies; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-23 18:20 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

v3 -> v4

 - Fix bugs introduced in v2/v3 for QD > 1 operation. Now, all writes
   to a zone happen at the new write pointer variable, zone->w_ptr,
   that is advanced right after submitting the backend i/o. The existing
   zone->d.wp variable is updated upon the successful write completion
   and it is used for zone reporting. Some code has been split from
   nvme_finalize_zoned_write() function to a new function,
   nvme_advance_zone_wp().

 - Make the code compile under mingw. Switch to using QEMU API for
   mmap/msync, i.e. memory_region...(). Since mmap is not available in
   mingw (even though there is mman-win32 library available on Github),
   conditional compilation is added around these calls to avoid
   undefined symbols under mingw. A better fix would be to add stub
   functions to softmmu/memory.c for the case when CONFIG_POSIX is not
   defined, but such change is beyond the scope of this patchset and it
   can be made in a separate patch.

 - Correct permission mask used to open zone metadata file.

 - Fold "Define 64 bit cqe.result" patch into ZNS commit.

 - Use clz64/clz32 instead of defining nvme_ilog2() function.

 - Simplify rpt_empty_id_struct() code, move nvme_fill_data() back
   to ZNS patch.

 - Fix a power-on processing bug.

 - Rename NVME_CMD_ZONE_APND to NVME_CMD_ZONE_APPEND.

 - Make the list of review comments addressed in v2 of the series
   (see below).

v2 -> v3:

 - Moved nvme_fill_data() function to the NSTypes patch as it is
   now used there to output empty namespace identify structs.
 - Fixed typo in Maxim's email address.

v1 -> v2:

 - Rebased on top of qemu-nvme/next branch.
 - Incorporated feedback from Klaus and Alistair.
    * Allow a subset of CSE log to be read, not the entire log
    * Assign admin command entries in CSE log to ACS fields
    * Set LPA bit 1 to indicate support of CSE log page
    * Rename CC.CSS value CSS_ALL_NSTYPES (110b) to CSS_CSI
    * Move the code to assign lbaf.ds to a separate patch
    * Remove the change in firmware revision
    * Change "driver" to "device" in comments and annotations
    * Rename ZAMDS to ZASL
    * Correct a few format expressions and some wording in
      trace event definitions
    * Remove validation code to return NVME_CAP_EXCEEDED error
    * Make ZASL to be equal to MDTS if "zone_append_size_limit"
      module parameter is not set
    * Clean up nvme_zoned_init_ctrl() to make size calculations
      less confusing
    * Avoid changing module parameters, use separate n/s variables
      if additional calculations are necessary to convert parameters
      to running values
    * Use NVME_DEFAULT_ZONE_SIZE to assign the default zone size value
    * Use default 0 for zone capacity meaning that zone capacity will
      be equal to zone size by default
    * Issue warnings if user MAR/MOR values are too large and have
      to be adjusted
    * Use unsigned values for MAR/MOR
 - Dropped "Simulate Zone Active excursions" patch.
   Excursion behavior may depend on the internal controller
   architecture and therefore be vendor-specific.
 - Dropped support for Zone Attributes and zoned AENs for now.
   These features can be added in a future series.
 - NS Types support is extended to handle active/inactive namespaces.
 - Update the write pointer after backing storage I/O completion, not
   before. This makes the emulation to run correctly in case of
   backing device failures.
 - Avoid division in the I/O path if the device zone size is
   a power of two (the most common case). Zone index then can be
   calculated by using bit shift.
 - A few reported bugs have been fixed.
 - Indentation in function definitions has been changed to make it
   the same as the rest of the code.


Zoned Namespace (ZNS) Command Set is a newly introduced command set
published by the NVM Express, Inc. organization as TP 4053. The main
design goals of ZNS are to provide hardware designers the means to
reduce NVMe controller complexity and to allow achieving a better I/O
latency and throughput. SSDs that implement this interface are
commonly known as ZNS SSDs.

This command set is implementing a zoned storage model, similarly to
ZAC/ZBC. As such, there is already support in Linux, allowing one to
perform the majority of tasks needed for managing ZNS SSDs.

The Zoned Namespace Command Set relies on another TP, known as
Namespace Types (NVMe TP 4056), which introduces support for having
multiple command sets per namespace.

Both ZNS and Namespace Types specifications can be downloaded by
visiting the following link -

https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs.zip

This patch series adds Namespace Types support and zoned namespace
emulation capability to the existing NVMe PCI device.

The patchset is organized as follows -

The first several patches are preparatory and are added to allow for
an easier review of the subsequent commits. The group of patches that
follows adds NS Types support with only NVM Command Set being
available. Finally, the last group of commits makes definitions and
adds new code to support Zoned Namespace Command Set.

Based-on: Message-ID: <20200729220638.344477-17-its@irrelevant.dk>
Dmitry Fomichev (11):
  hw/block/nvme: Report actual LBA data shift in LBAF
  hw/block/nvme: Add Commands Supported and Effects log
  hw/block/nvme: Define trace events related to NS Types
  hw/block/nvme: Make Zoned NS Command Set definitions
  hw/block/nvme: Define Zoned NS Command Set trace events
  hw/block/nvme: Support Zoned Namespace Command Set
  hw/block/nvme: Introduce max active and open zone limits
  hw/block/nvme: Support Zone Descriptor Extensions
  hw/block/nvme: Add injection of Offline/Read-Only zones
  hw/block/nvme: Use zone metadata file for persistence
  hw/block/nvme: Document zoned parameters in usage text

Niklas Cassel (3):
  hw/block/nvme: Introduce the Namespace Types definitions
  hw/block/nvme: Add support for Namespace Types
  hw/block/nvme: Add support for active/inactive namespaces

 block/nvme.c          |    2 +-
 hw/block/nvme.c       | 1987 +++++++++++++++++++++++++++++++++++++++--
 hw/block/nvme.h       |  180 ++++
 hw/block/trace-events |   39 +
 include/block/nvme.h  |  210 ++++-
 5 files changed, 2351 insertions(+), 67 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v4 01/14] hw/block/nvme: Report actual LBA data shift in LBAF
  2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
@ 2020-09-23 18:20 ` Dmitry Fomichev
  2020-09-24 12:12   ` Klaus Jensen
  2020-09-23 18:20 ` [PATCH v4 02/14] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-23 18:20 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

Calculate the data shift value to report based on the set value of
logical_block_size device property.

In the process, use a local variable to calculate the LBA format
index instead of the hardcoded value 0. This makes the code more
readable and it will make it easier to add support for multiple LBA
formats in the future.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 63078f6009..f60e968c4a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -2203,6 +2203,7 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
 {
     int64_t bs_size;
     NvmeIdNs *id_ns = &ns->id_ns;
+    int lba_index;
 
     bs_size = blk_getlength(n->conf.blk);
     if (bs_size < 0) {
@@ -2212,7 +2213,8 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
 
     n->ns_size = bs_size;
 
-    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+    lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
+    id_ns->lbaf[lba_index].ds = 31 - clz32(n->conf.logical_block_size);
     id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(n, ns));
 
     /* no thin provisioning */
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 02/14] hw/block/nvme: Add Commands Supported and Effects log
  2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
  2020-09-23 18:20 ` [PATCH v4 01/14] hw/block/nvme: Report actual LBA data shift in LBAF Dmitry Fomichev
@ 2020-09-23 18:20 ` Dmitry Fomichev
  2020-09-23 18:20 ` [PATCH v4 03/14] hw/block/nvme: Introduce the Namespace Types definitions Dmitry Fomichev
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-23 18:20 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

This log page becomes necessary to implement to allow checking for
Zone Append command support in Zoned Namespace Command Set.

This commit adds the code to report this log page for NVM Command
Set only. The parts that are specific to zoned operation will be
added later in the series.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c       | 44 ++++++++++++++++++++++++++++++++++++++++++-
 hw/block/trace-events |  2 ++
 include/block/nvme.h  | 19 +++++++++++++++++++
 3 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index f60e968c4a..7807f296d3 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -957,6 +957,46 @@ static uint16_t nvme_error_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
                         DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
+                                 uint64_t off, NvmeRequest *req)
+{
+    NvmeCmd *cmd = &req->cmd;
+    uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
+    uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
+    NvmeEffectsLog cmd_eff_log = {};
+    uint32_t *iocs = cmd_eff_log.iocs;
+    uint32_t *acs = cmd_eff_log.acs;
+    uint32_t trans_len;
+
+    trace_pci_nvme_cmd_supp_and_effects_log_read();
+
+    if (off >= sizeof(cmd_eff_log)) {
+        trace_pci_nvme_err_invalid_effects_log_offset(off);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    acs[NVME_ADM_CMD_DELETE_SQ] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_CREATE_SQ] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_DELETE_CQ] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_CREATE_CQ] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_IDENTIFY] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_SET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFFECTS_CSUPP;
+
+    iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
+    iocs[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFFECTS_CSUPP |
+                                  NVME_CMD_EFFECTS_LBCC;
+    iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
+    iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
+
+    trans_len = MIN(sizeof(cmd_eff_log) - off, buf_len);
+
+    return nvme_dma_prp(n, ((uint8_t *)&cmd_eff_log) + off, trans_len,
+                        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
+}
+
 static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeCmd *cmd = &req->cmd;
@@ -1000,6 +1040,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
         return nvme_smart_info(n, rae, len, off, req);
     case NVME_LOG_FW_SLOT_INFO:
         return nvme_fw_log_info(n, len, off, req);
+    case NVME_LOG_CMD_EFFECTS:
+        return nvme_cmd_effects(n, len, off, req);
     default:
         trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -2350,7 +2392,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     id->acl = 3;
     id->aerl = n->params.aerl;
     id->frmw = (NVME_NUM_FW_SLOTS << 1) | NVME_FRMW_SLOT1_RO;
-    id->lpa = NVME_LPA_EXTENDED;
+    id->lpa = NVME_LPA_CSE | NVME_LPA_EXTENDED;
 
     /* recommended default value (~70 C) */
     id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 72cf2d15cb..79c9da652d 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -83,6 +83,7 @@ pci_nvme_mmio_start_success(void) "setting controller enable bit succeeded"
 pci_nvme_mmio_stopped(void) "cleared controller enable bit"
 pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
 pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
+pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
 
 # nvme traces for error conditions
 pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
@@ -95,6 +96,7 @@ pci_nvme_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid namespace %u not w
 pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
 pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
 pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
+pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
 pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 65e68a82c8..71126ae1ff 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -734,10 +734,27 @@ enum NvmeSmartWarn {
     NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,
 };
 
+typedef struct NvmeEffectsLog {
+    uint32_t    acs[256];
+    uint32_t    iocs[256];
+    uint8_t     resv[2048];
+} NvmeEffectsLog;
+
+enum {
+    NVME_CMD_EFFECTS_CSUPP             = 1 << 0,
+    NVME_CMD_EFFECTS_LBCC              = 1 << 1,
+    NVME_CMD_EFFECTS_NCC               = 1 << 2,
+    NVME_CMD_EFFECTS_NIC               = 1 << 3,
+    NVME_CMD_EFFECTS_CCC               = 1 << 4,
+    NVME_CMD_EFFECTS_CSE_MASK          = 3 << 16,
+    NVME_CMD_EFFECTS_UUID_SEL          = 1 << 19,
+};
+
 enum NvmeLogIdentifier {
     NVME_LOG_ERROR_INFO     = 0x01,
     NVME_LOG_SMART_INFO     = 0x02,
     NVME_LOG_FW_SLOT_INFO   = 0x03,
+    NVME_LOG_CMD_EFFECTS    = 0x05,
 };
 
 typedef struct QEMU_PACKED NvmePSD {
@@ -849,6 +866,7 @@ enum NvmeIdCtrlFrmw {
 };
 
 enum NvmeIdCtrlLpa {
+    NVME_LPA_CSE      = 1 << 1,
     NVME_LPA_EXTENDED = 1 << 2,
 };
 
@@ -1048,6 +1066,7 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 03/14] hw/block/nvme: Introduce the Namespace Types definitions
  2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
  2020-09-23 18:20 ` [PATCH v4 01/14] hw/block/nvme: Report actual LBA data shift in LBAF Dmitry Fomichev
  2020-09-23 18:20 ` [PATCH v4 02/14] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
@ 2020-09-23 18:20 ` Dmitry Fomichev
  2020-09-23 18:20 ` [PATCH v4 04/14] hw/block/nvme: Define trace events related to NS Types Dmitry Fomichev
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-23 18:20 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

From: Niklas Cassel <niklas.cassel@wdc.com>

Define the structures and constants required to implement
Namespace Types support.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c      |  2 +-
 hw/block/nvme.h      |  3 ++
 include/block/nvme.h | 74 +++++++++++++++++++++++++++++++++++---------
 3 files changed, 64 insertions(+), 15 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 7807f296d3..96cd520feb 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1259,7 +1259,7 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
      * here.
      */
     ns_descrs->uuid.hdr.nidt = NVME_NIDT_UUID;
-    ns_descrs->uuid.hdr.nidl = NVME_NIDT_UUID_LEN;
+    ns_descrs->uuid.hdr.nidl = NVME_NIDL_UUID;
     stl_be_p(&ns_descrs->uuid.v, nsid);
 
     return nvme_dma_prp(n, list, NVME_IDENTIFY_DATA_SIZE, prp1, prp2,
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 52ba794f2e..c1deac9667 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -64,6 +64,9 @@ typedef struct NvmeCQueue {
 
 typedef struct NvmeNamespace {
     NvmeIdNs        id_ns;
+    uint32_t        nsid;
+    uint8_t         csi;
+    QemuUUID        uuid;
 } NvmeNamespace;
 
 static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 71126ae1ff..74cc11782e 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -51,6 +51,11 @@ enum NvmeCapMask {
     CAP_PMR_MASK       = 0x1,
 };
 
+enum NvmeCapCssBits {
+    CAP_CSS_NVM        = 0x01,
+    CAP_CSS_CSI_SUPP   = 0x40,
+};
+
 #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
 #define NVME_CAP_CQR(cap)   (((cap) >> CAP_CQR_SHIFT)    & CAP_CQR_MASK)
 #define NVME_CAP_AMS(cap)   (((cap) >> CAP_AMS_SHIFT)    & CAP_AMS_MASK)
@@ -102,6 +107,12 @@ enum NvmeCcMask {
     CC_IOCQES_MASK  = 0xf,
 };
 
+enum NvmeCcCss {
+    CSS_NVM_ONLY        = 0,
+    CSS_CSI             = 6,
+    CSS_ADMIN_ONLY      = 7,
+};
+
 #define NVME_CC_EN(cc)     ((cc >> CC_EN_SHIFT)     & CC_EN_MASK)
 #define NVME_CC_CSS(cc)    ((cc >> CC_CSS_SHIFT)    & CC_CSS_MASK)
 #define NVME_CC_MPS(cc)    ((cc >> CC_MPS_SHIFT)    & CC_MPS_MASK)
@@ -110,6 +121,21 @@ enum NvmeCcMask {
 #define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK)
 #define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK)
 
+#define NVME_SET_CC_EN(cc, val)     \
+    (cc |= (uint32_t)((val) & CC_EN_MASK) << CC_EN_SHIFT)
+#define NVME_SET_CC_CSS(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_CSS_MASK) << CC_CSS_SHIFT)
+#define NVME_SET_CC_MPS(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_MPS_MASK) << CC_MPS_SHIFT)
+#define NVME_SET_CC_AMS(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_AMS_MASK) << CC_AMS_SHIFT)
+#define NVME_SET_CC_SHN(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_SHN_MASK) << CC_SHN_SHIFT)
+#define NVME_SET_CC_IOSQES(cc, val) \
+    (cc |= (uint32_t)((val) & CC_IOSQES_MASK) << CC_IOSQES_SHIFT)
+#define NVME_SET_CC_IOCQES(cc, val) \
+    (cc |= (uint32_t)((val) & CC_IOCQES_MASK) << CC_IOCQES_SHIFT)
+
 enum NvmeCstsShift {
     CSTS_RDY_SHIFT      = 0,
     CSTS_CFS_SHIFT      = 1,
@@ -524,8 +550,13 @@ typedef struct QEMU_PACKED NvmeIdentify {
     uint64_t    rsvd2[2];
     uint64_t    prp1;
     uint64_t    prp2;
-    uint32_t    cns;
-    uint32_t    rsvd11[5];
+    uint8_t     cns;
+    uint8_t     rsvd10;
+    uint16_t    ctrlid;
+    uint16_t    nvmsetid;
+    uint8_t     rsvd11;
+    uint8_t     csi;
+    uint32_t    rsvd12[4];
 } NvmeIdentify;
 
 typedef struct QEMU_PACKED NvmeRwCmd {
@@ -645,6 +676,7 @@ enum NvmeStatusCodes {
     NVME_MD_SGL_LEN_INVALID     = 0x0010,
     NVME_SGL_DESCR_TYPE_INVALID = 0x0011,
     NVME_INVALID_USE_OF_CMB     = 0x0012,
+    NVME_CMD_SET_CMB_REJECTED   = 0x002b,
     NVME_LBA_RANGE              = 0x0080,
     NVME_CAP_EXCEEDED           = 0x0081,
     NVME_NS_NOT_READY           = 0x0082,
@@ -771,11 +803,15 @@ typedef struct QEMU_PACKED NvmePSD {
 
 #define NVME_IDENTIFY_DATA_SIZE 4096
 
-enum {
-    NVME_ID_CNS_NS             = 0x0,
-    NVME_ID_CNS_CTRL           = 0x1,
-    NVME_ID_CNS_NS_ACTIVE_LIST = 0x2,
-    NVME_ID_CNS_NS_DESCR_LIST  = 0x3,
+enum NvmeIdCns {
+    NVME_ID_CNS_NS                = 0x00,
+    NVME_ID_CNS_CTRL              = 0x01,
+    NVME_ID_CNS_NS_ACTIVE_LIST    = 0x02,
+    NVME_ID_CNS_NS_DESCR_LIST     = 0x03,
+    NVME_ID_CNS_CS_NS             = 0x05,
+    NVME_ID_CNS_CS_CTRL           = 0x06,
+    NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
+    NVME_ID_CNS_IO_COMMAND_SET    = 0x1c,
 };
 
 typedef struct QEMU_PACKED NvmeIdCtrl {
@@ -922,6 +958,7 @@ enum NvmeFeatureIds {
     NVME_WRITE_ATOMICITY            = 0xa,
     NVME_ASYNCHRONOUS_EVENT_CONF    = 0xb,
     NVME_TIMESTAMP                  = 0xe,
+    NVME_COMMAND_SET_PROFILE        = 0x19,
     NVME_SOFTWARE_PROGRESS_MARKER   = 0x80,
     NVME_FID_MAX                    = 0x100,
 };
@@ -1006,18 +1043,26 @@ typedef struct QEMU_PACKED NvmeIdNsDescr {
     uint8_t rsvd2[2];
 } NvmeIdNsDescr;
 
-enum {
-    NVME_NIDT_EUI64_LEN =  8,
-    NVME_NIDT_NGUID_LEN = 16,
-    NVME_NIDT_UUID_LEN  = 16,
+enum NvmeNsIdentifierLength {
+    NVME_NIDL_EUI64             = 8,
+    NVME_NIDL_NGUID             = 16,
+    NVME_NIDL_UUID              = 16,
+    NVME_NIDL_CSI               = 1,
 };
 
 enum NvmeNsIdentifierType {
-    NVME_NIDT_EUI64 = 0x1,
-    NVME_NIDT_NGUID = 0x2,
-    NVME_NIDT_UUID  = 0x3,
+    NVME_NIDT_EUI64             = 0x01,
+    NVME_NIDT_NGUID             = 0x02,
+    NVME_NIDT_UUID              = 0x03,
+    NVME_NIDT_CSI               = 0x04,
 };
 
+enum NvmeCsi {
+    NVME_CSI_NVM                = 0x00,
+};
+
+#define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
+
 /*Deallocate Logical Block Features*/
 #define NVME_ID_NS_DLFEAT_GUARD_CRC(dlfeat)       ((dlfeat) & 0x10)
 #define NVME_ID_NS_DLFEAT_WRITE_ZEROES(dlfeat)    ((dlfeat) & 0x08)
@@ -1068,6 +1113,7 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 04/14] hw/block/nvme: Define trace events related to NS Types
  2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (2 preceding siblings ...)
  2020-09-23 18:20 ` [PATCH v4 03/14] hw/block/nvme: Introduce the Namespace Types definitions Dmitry Fomichev
@ 2020-09-23 18:20 ` Dmitry Fomichev
  2020-09-23 18:20 ` [PATCH v4 05/14] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-23 18:20 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

A few trace events are defined that are relevant to implementing
Namespace Types (NVMe TP 4056).

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
---
 hw/block/trace-events | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/hw/block/trace-events b/hw/block/trace-events
index 79c9da652d..2414dcbc79 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -46,8 +46,12 @@ pci_nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size,
 pci_nvme_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
 pci_nvme_del_cq(uint16_t cqid) "deleted completion queue, cqid=%"PRIu16""
 pci_nvme_identify_ctrl(void) "identify controller"
+pci_nvme_identify_ctrl_csi(uint8_t csi) "identify controller, csi=0x%"PRIx8""
 pci_nvme_identify_ns(uint32_t ns) "nsid %"PRIu32""
+pci_nvme_identify_ns_csi(uint32_t ns, uint8_t csi) "nsid=%"PRIu32", csi=0x%"PRIx8""
 pci_nvme_identify_nslist(uint32_t ns) "nsid %"PRIu32""
+pci_nvme_identify_nslist_csi(uint16_t ns, uint8_t csi) "nsid=%"PRIu16", csi=0x%"PRIx8""
+pci_nvme_identify_cmd_set(void) "identify i/o command set"
 pci_nvme_identify_ns_descr_list(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_get_log(uint16_t cid, uint8_t lid, uint8_t lsp, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" lsp 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
 pci_nvme_getfeat(uint16_t cid, uint8_t fid, uint8_t sel, uint32_t cdw11) "cid %"PRIu16" fid 0x%"PRIx8" sel 0x%"PRIx8" cdw11 0x%"PRIx32""
@@ -84,6 +88,8 @@ pci_nvme_mmio_stopped(void) "cleared controller enable bit"
 pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
 pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
 pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
+pci_nvme_css_nvm_cset_selected_by_host(uint32_t cc) "NVM command set selected by host, bar.cc=0x%"PRIx32""
+pci_nvme_css_all_csets_sel_by_host(uint32_t cc) "all supported command sets selected by host, bar.cc=0x%"PRIx32""
 
 # nvme traces for error conditions
 pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
@@ -97,6 +103,9 @@ pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
 pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
 pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
 pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
+pci_nvme_err_change_css_when_enabled(void) "changing CC.CSS while controller is enabled"
+pci_nvme_err_only_nvm_cmd_set_avail(void) "setting 110b CC.CSS, but only NVM command set is enabled"
+pci_nvme_err_invalid_iocsci(uint32_t idx) "unsupported command set combination index %"PRIu32""
 pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
@@ -152,6 +161,7 @@ pci_nvme_ub_db_wr_invalid_cq(uint32_t qid) "completion queue doorbell write for
 pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t new_head) "completion queue doorbell write value beyond queue size, cqid=%"PRIu32", new_head=%"PRIu16", ignoring"
 pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write for nonexistent queue, sqid=%"PRIu32", ignoring"
 pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission queue doorbell write value beyond queue size, sqid=%"PRIu32", new_head=%"PRIu16", ignoring"
+pci_nvme_ub_unknown_css_value(void) "unknown value in cc.css field"
 
 # xen-block.c
 xen_block_realize(const char *type, uint32_t disk, uint32_t partition) "%s d%up%u"
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 05/14] hw/block/nvme: Add support for Namespace Types
  2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (3 preceding siblings ...)
  2020-09-23 18:20 ` [PATCH v4 04/14] hw/block/nvme: Define trace events related to NS Types Dmitry Fomichev
@ 2020-09-23 18:20 ` Dmitry Fomichev
  2020-09-23 18:20 ` [PATCH v4 06/14] hw/block/nvme: Add support for active/inactive namespaces Dmitry Fomichev
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-23 18:20 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

From: Niklas Cassel <niklas.cassel@wdc.com>

Namespace Types introduce a new command set, "I/O Command Sets",
that allows the host to retrieve the command sets associated with
a namespace. Introduce support for the command set and enable
detection for the NVM Command Set.

The new workflows for identify commands rely heavily on zero-filled
identify structs. E.g., certain CNS commands are defined to return
a zero-filled identify struct when an inactive namespace NSID
is supplied.

Add a helper function in order to avoid code duplication when
reporting zero-filled identify structures.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 204 +++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 184 insertions(+), 20 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 96cd520feb..e0f885498d 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1153,6 +1153,15 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
     return NVME_SUCCESS;
 }
 
+static uint16_t nvme_rpt_empty_id_struct(NvmeCtrl *n, uint64_t prp1,
+                                         uint64_t prp2, NvmeRequest *req)
+{
+    uint8_t id[NVME_IDENTIFY_DATA_SIZE] = {};
+
+    return nvme_dma_prp(n, id, sizeof(id), prp1, prp2,
+                        DMA_DIRECTION_FROM_DEVICE, req);
+}
+
 static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -1165,6 +1174,21 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
                         prp2, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+    uint64_t prp1 = le64_to_cpu(c->prp1);
+    uint64_t prp2 = le64_to_cpu(c->prp2);
+
+    trace_pci_nvme_identify_ctrl_csi(c->csi);
+
+    if (c->csi == NVME_CSI_NVM) {
+        return nvme_rpt_empty_id_struct(n, prp1, prp2, req);
+    }
+
+    return NVME_INVALID_FIELD | NVME_DNR;
+}
+
 static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeNamespace *ns;
@@ -1181,11 +1205,37 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
     }
 
     ns = &n->namespaces[nsid - 1];
+    assert(nsid == ns->nsid);
 
     return nvme_dma_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns), prp1,
                         prp2, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+    NvmeNamespace *ns;
+    uint32_t nsid = le32_to_cpu(c->nsid);
+    uint64_t prp1 = le64_to_cpu(c->prp1);
+    uint64_t prp2 = le64_to_cpu(c->prp2);
+
+    trace_pci_nvme_identify_ns_csi(nsid, c->csi);
+
+    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
+        trace_pci_nvme_err_invalid_ns(nsid, n->num_namespaces);
+        return NVME_INVALID_NSID | NVME_DNR;
+    }
+
+    ns = &n->namespaces[nsid - 1];
+    assert(nsid == ns->nsid);
+
+    if (c->csi == NVME_CSI_NVM) {
+        return nvme_rpt_empty_id_struct(n, prp1, prp2, req);
+    }
+
+    return NVME_INVALID_FIELD | NVME_DNR;
+}
+
 static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -1225,23 +1275,51 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
     return ret;
 }
 
+static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
+    uint32_t min_nsid = le32_to_cpu(c->nsid);
+    uint64_t prp1 = le64_to_cpu(c->prp1);
+    uint64_t prp2 = le64_to_cpu(c->prp2);
+    uint32_t *list;
+    uint16_t ret;
+    int i, j = 0;
+
+    trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
+
+    if (c->csi != NVME_CSI_NVM) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    list = g_malloc0(data_len);
+    for (i = 0; i < n->num_namespaces; i++) {
+        if (i < min_nsid) {
+            continue;
+        }
+        list[j++] = cpu_to_le32(i + 1);
+        if (j == data_len / sizeof(uint32_t)) {
+            break;
+        }
+    }
+    ret = nvme_dma_prp(n, (uint8_t *)list, data_len, prp1, prp2,
+                       DMA_DIRECTION_FROM_DEVICE, req);
+    g_free(list);
+    return ret;
+}
+
 static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+    NvmeNamespace *ns;
     uint32_t nsid = le32_to_cpu(c->nsid);
     uint64_t prp1 = le64_to_cpu(c->prp1);
     uint64_t prp2 = le64_to_cpu(c->prp2);
-
-    uint8_t list[NVME_IDENTIFY_DATA_SIZE];
-
-    struct data {
-        struct {
-            NvmeIdNsDescr hdr;
-            uint8_t v[16];
-        } uuid;
-    };
-
-    struct data *ns_descrs = (struct data *)list;
+    void *buf_ptr;
+    NvmeIdNsDescr *desc;
+    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
+    uint8_t *buf;
+    uint16_t status;
 
     trace_pci_nvme_identify_ns_descr_list(nsid);
 
@@ -1250,7 +1328,11 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
         return NVME_INVALID_NSID | NVME_DNR;
     }
 
-    memset(list, 0x0, sizeof(list));
+    ns = &n->namespaces[nsid - 1];
+    assert(nsid == ns->nsid);
+
+    buf = g_malloc0(data_len);
+    buf_ptr = buf;
 
     /*
      * Because the NGUID and EUI64 fields are 0 in the Identify Namespace data
@@ -1258,12 +1340,44 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
      * Namespace Identification Descriptor. Add a very basic Namespace UUID
      * here.
      */
-    ns_descrs->uuid.hdr.nidt = NVME_NIDT_UUID;
-    ns_descrs->uuid.hdr.nidl = NVME_NIDL_UUID;
-    stl_be_p(&ns_descrs->uuid.v, nsid);
+    desc = buf_ptr;
+    desc->nidt = NVME_NIDT_UUID;
+    desc->nidl = NVME_NIDL_UUID;
+    buf_ptr += sizeof(*desc);
+    memcpy(buf_ptr, ns->uuid.data, NVME_NIDL_UUID);
+    buf_ptr += NVME_NIDL_UUID;
 
-    return nvme_dma_prp(n, list, NVME_IDENTIFY_DATA_SIZE, prp1, prp2,
-                        DMA_DIRECTION_FROM_DEVICE, req);
+    desc = buf_ptr;
+    desc->nidt = NVME_NIDT_CSI;
+    desc->nidl = NVME_NIDL_CSI;
+    buf_ptr += sizeof(*desc);
+    *(uint8_t *)buf_ptr = NVME_CSI_NVM;
+
+    status = nvme_dma_prp(n, buf, data_len, prp1, prp2,
+                          DMA_DIRECTION_FROM_DEVICE, req);
+    g_free(buf);
+    return status;
+}
+
+static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+    uint64_t prp1 = le64_to_cpu(c->prp1);
+    uint64_t prp2 = le64_to_cpu(c->prp2);
+    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
+    uint32_t *list;
+    uint8_t *ptr;
+    uint16_t status;
+
+    trace_pci_nvme_identify_cmd_set();
+
+    list = g_malloc0(data_len);
+    ptr = (uint8_t *)list;
+    NVME_SET_CSI(*ptr, NVME_CSI_NVM);
+    status = nvme_dma_prp(n, (uint8_t *)list, data_len, prp1, prp2,
+                          DMA_DIRECTION_FROM_DEVICE, req);
+    g_free(list);
+    return status;
 }
 
 static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
@@ -1273,12 +1387,20 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
     switch (le32_to_cpu(c->cns)) {
     case NVME_ID_CNS_NS:
         return nvme_identify_ns(n, req);
+    case NVME_ID_CNS_CS_NS:
+        return nvme_identify_ns_csi(n, req);
     case NVME_ID_CNS_CTRL:
         return nvme_identify_ctrl(n, req);
+    case NVME_ID_CNS_CS_CTRL:
+        return nvme_identify_ctrl_csi(n, req);
     case NVME_ID_CNS_NS_ACTIVE_LIST:
         return nvme_identify_nslist(n, req);
+    case NVME_ID_CNS_CS_NS_ACTIVE_LIST:
+        return nvme_identify_nslist_csi(n, req);
     case NVME_ID_CNS_NS_DESCR_LIST:
         return nvme_identify_ns_descr_list(n, req);
+    case NVME_ID_CNS_IO_COMMAND_SET:
+        return nvme_identify_cmd_set(n, req);
     default:
         trace_pci_nvme_err_invalid_identify_cns(le32_to_cpu(c->cns));
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1460,6 +1582,9 @@ defaults:
             result |= NVME_INTVC_NOCOALESCING;
         }
 
+        break;
+    case NVME_COMMAND_SET_PROFILE:
+        result = 0;
         break;
     default:
         result = nvme_feature_default[fid];
@@ -1584,6 +1709,12 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
         break;
     case NVME_TIMESTAMP:
         return nvme_set_feature_timestamp(n, req);
+    case NVME_COMMAND_SET_PROFILE:
+        if (dw11 & 0x1ff) {
+            trace_pci_nvme_err_invalid_iocsci(dw11 & 0x1ff);
+            return NVME_CMD_SET_CMB_REJECTED | NVME_DNR;
+        }
+        break;
     default:
         return NVME_FEAT_NOT_CHANGEABLE | NVME_DNR;
     }
@@ -1845,6 +1976,30 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
         break;
     case 0x14:  /* CC */
         trace_pci_nvme_mmio_cfg(data & 0xffffffff);
+
+        if (NVME_CC_CSS(data) != NVME_CC_CSS(n->bar.cc)) {
+            if (NVME_CC_EN(n->bar.cc)) {
+                NVME_GUEST_ERR(pci_nvme_err_change_css_when_enabled,
+                               "changing selected command set when enabled");
+            } else {
+                switch (NVME_CC_CSS(data)) {
+                case CSS_NVM_ONLY:
+                    trace_pci_nvme_css_nvm_cset_selected_by_host(data &
+                                                                 0xffffffff);
+                    break;
+                case CSS_CSI:
+                    NVME_SET_CC_CSS(n->bar.cc, CSS_CSI);
+                    trace_pci_nvme_css_all_csets_sel_by_host(data & 0xffffffff);
+                    break;
+                case CSS_ADMIN_ONLY:
+                    break;
+                default:
+                    NVME_GUEST_ERR(pci_nvme_ub_unknown_css_value,
+                                   "unknown value in CC.CSS field");
+                }
+            }
+        }
+
         /* Windows first sends data, then sends enable bit */
         if (!NVME_CC_EN(data) && !NVME_CC_EN(n->bar.cc) &&
             !NVME_CC_SHN(data) && !NVME_CC_SHN(n->bar.cc))
@@ -2255,6 +2410,8 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
 
     n->ns_size = bs_size;
 
+    ns->csi = NVME_CSI_NVM;
+    qemu_uuid_generate(&ns->uuid); /* TODO make UUIDs persistent */
     lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
     id_ns->lbaf[lba_index].ds = 31 - clz32(n->conf.logical_block_size);
     id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(n, ns));
@@ -2419,7 +2576,11 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
     NVME_CAP_SET_CQR(n->bar.cap, 1);
     NVME_CAP_SET_TO(n->bar.cap, 0xf);
-    NVME_CAP_SET_CSS(n->bar.cap, 1);
+    /*
+     * The device now always supports NS Types, but all commands
+     * that support CSI field will only handle NVM Command Set.
+     */
+    NVME_CAP_SET_CSS(n->bar.cap, (CAP_CSS_NVM | CAP_CSS_CSI_SUPP));
     NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
 
     n->bar.vs = NVME_SPEC_VER;
@@ -2429,6 +2590,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
 static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 {
     NvmeCtrl *n = NVME(pci_dev);
+    NvmeNamespace *ns;
     Error *local_err = NULL;
 
     int i;
@@ -2454,8 +2616,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 
     nvme_init_ctrl(n, pci_dev);
 
-    for (i = 0; i < n->num_namespaces; i++) {
-        nvme_init_namespace(n, &n->namespaces[i], &local_err);
+    ns = n->namespaces;
+    for (i = 0; i < n->num_namespaces; i++, ns++) {
+        ns->nsid = i + 1;
+        nvme_init_namespace(n, ns, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
             return;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 06/14] hw/block/nvme: Add support for active/inactive namespaces
  2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (4 preceding siblings ...)
  2020-09-23 18:20 ` [PATCH v4 05/14] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
@ 2020-09-23 18:20 ` Dmitry Fomichev
  2020-09-24 12:12   ` Klaus Jensen
  2020-09-23 18:20 ` [PATCH v4 07/14] hw/block/nvme: Make Zoned NS Command Set definitions Dmitry Fomichev
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-23 18:20 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

From: Niklas Cassel <niklas.cassel@wdc.com>

In NVMe, a namespace is active if it exists and is attached to the
controller.

CAP.CSS (together with the I/O Command Set data structure) defines what
command sets are supported by the controller.

CC.CSS (together with Set Profile) can be set to enable a subset of the
available command sets. The namespaces belonging to a disabled command set
will not be able to attach to the controller, and will thus be inactive.

E.g., if the user sets CC.CSS to Admin Only, NVM namespaces should be
marked as inactive.

The identify namespace, the identify namespace CSI specific, and the namespace
list commands have two different versions, one that only shows active
namespaces, and the other version that shows existing namespaces, regardless
of whether the namespace is attached or not.

Add an attached member to struct NvmeNamespace, and implement the missing CNS
commands.

The added functionality will also simplify the implementation of namespace
management in the future, since namespace management can also attach and
detach namespaces.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c      | 54 ++++++++++++++++++++++++++++++++++++--------
 hw/block/nvme.h      |  1 +
 include/block/nvme.h | 20 +++++++++-------
 3 files changed, 57 insertions(+), 18 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index e0f885498d..6c231b20f9 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1189,7 +1189,8 @@ static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
     return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req,
+                                 bool only_active)
 {
     NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -1207,11 +1208,16 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
     ns = &n->namespaces[nsid - 1];
     assert(nsid == ns->nsid);
 
+    if (only_active && !ns->attached) {
+        return nvme_rpt_empty_id_struct(n, prp1, prp2, req);
+    }
+
     return nvme_dma_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns), prp1,
                         prp2, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
+                                     bool only_active)
 {
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
     NvmeNamespace *ns;
@@ -1229,6 +1235,10 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
     ns = &n->namespaces[nsid - 1];
     assert(nsid == ns->nsid);
 
+    if (only_active && !ns->attached) {
+        return nvme_rpt_empty_id_struct(n, prp1, prp2, req);
+    }
+
     if (c->csi == NVME_CSI_NVM) {
         return nvme_rpt_empty_id_struct(n, prp1, prp2, req);
     }
@@ -1236,7 +1246,8 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
     return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req,
+                                     bool only_active)
 {
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
     static const int data_len = NVME_IDENTIFY_DATA_SIZE;
@@ -1261,7 +1272,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
 
     list = g_malloc0(data_len);
     for (i = 0; i < n->num_namespaces; i++) {
-        if (i < min_nsid) {
+        if (i < min_nsid || (only_active && !n->namespaces[i].attached)) {
             continue;
         }
         list[j++] = cpu_to_le32(i + 1);
@@ -1275,7 +1286,8 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
     return ret;
 }
 
-static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req,
+                                         bool only_active)
 {
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
     static const int data_len = NVME_IDENTIFY_DATA_SIZE;
@@ -1294,7 +1306,8 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
 
     list = g_malloc0(data_len);
     for (i = 0; i < n->num_namespaces; i++) {
-        if (i < min_nsid) {
+        if (i < min_nsid || c->csi != n->namespaces[i].csi ||
+            (only_active && !n->namespaces[i].attached)) {
             continue;
         }
         list[j++] = cpu_to_le32(i + 1);
@@ -1386,17 +1399,25 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
 
     switch (le32_to_cpu(c->cns)) {
     case NVME_ID_CNS_NS:
-        return nvme_identify_ns(n, req);
+        return nvme_identify_ns(n, req, true);
     case NVME_ID_CNS_CS_NS:
-        return nvme_identify_ns_csi(n, req);
+        return nvme_identify_ns_csi(n, req, true);
+    case NVME_ID_CNS_NS_PRESENT:
+        return nvme_identify_ns(n, req, false);
+    case NVME_ID_CNS_CS_NS_PRESENT:
+        return nvme_identify_ns_csi(n, req, false);
     case NVME_ID_CNS_CTRL:
         return nvme_identify_ctrl(n, req);
     case NVME_ID_CNS_CS_CTRL:
         return nvme_identify_ctrl_csi(n, req);
     case NVME_ID_CNS_NS_ACTIVE_LIST:
-        return nvme_identify_nslist(n, req);
+        return nvme_identify_nslist(n, req, true);
     case NVME_ID_CNS_CS_NS_ACTIVE_LIST:
-        return nvme_identify_nslist_csi(n, req);
+        return nvme_identify_nslist_csi(n, req, true);
+    case NVME_ID_CNS_NS_PRESENT_LIST:
+        return nvme_identify_nslist(n, req, false);
+    case NVME_ID_CNS_CS_NS_PRESENT_LIST:
+        return nvme_identify_nslist_csi(n, req, false);
     case NVME_ID_CNS_NS_DESCR_LIST:
         return nvme_identify_ns_descr_list(n, req);
     case NVME_ID_CNS_IO_COMMAND_SET:
@@ -1838,6 +1859,7 @@ static int nvme_start_ctrl(NvmeCtrl *n)
 {
     uint32_t page_bits = NVME_CC_MPS(n->bar.cc) + 12;
     uint32_t page_size = 1 << page_bits;
+    int i;
 
     if (unlikely(n->cq[0])) {
         trace_pci_nvme_err_startfail_cq();
@@ -1924,6 +1946,18 @@ static int nvme_start_ctrl(NvmeCtrl *n)
     nvme_init_sq(&n->admin_sq, n, n->bar.asq, 0, 0,
         NVME_AQA_ASQS(n->bar.aqa) + 1);
 
+    for (i = 0; i < n->num_namespaces; i++) {
+        n->namespaces[i].attached = false;
+        switch (n->namespaces[i].csi) {
+        case NVME_CSI_NVM:
+            if (NVME_CC_CSS(n->bar.cc) == CSS_NVM_ONLY ||
+                NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
+                n->namespaces[i].attached = true;
+            }
+            break;
+        }
+    }
+
     nvme_set_timestamp(n, 0ULL);
 
     QTAILQ_INIT(&n->aer_queue);
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index c1deac9667..71e4344471 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -66,6 +66,7 @@ typedef struct NvmeNamespace {
     NvmeIdNs        id_ns;
     uint32_t        nsid;
     uint8_t         csi;
+    bool            attached;
     QemuUUID        uuid;
 } NvmeNamespace;
 
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 74cc11782e..f8356394f5 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -804,14 +804,18 @@ typedef struct QEMU_PACKED NvmePSD {
 #define NVME_IDENTIFY_DATA_SIZE 4096
 
 enum NvmeIdCns {
-    NVME_ID_CNS_NS                = 0x00,
-    NVME_ID_CNS_CTRL              = 0x01,
-    NVME_ID_CNS_NS_ACTIVE_LIST    = 0x02,
-    NVME_ID_CNS_NS_DESCR_LIST     = 0x03,
-    NVME_ID_CNS_CS_NS             = 0x05,
-    NVME_ID_CNS_CS_CTRL           = 0x06,
-    NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
-    NVME_ID_CNS_IO_COMMAND_SET    = 0x1c,
+    NVME_ID_CNS_NS                    = 0x00,
+    NVME_ID_CNS_CTRL                  = 0x01,
+    NVME_ID_CNS_NS_ACTIVE_LIST        = 0x02,
+    NVME_ID_CNS_NS_DESCR_LIST         = 0x03,
+    NVME_ID_CNS_CS_NS                 = 0x05,
+    NVME_ID_CNS_CS_CTRL               = 0x06,
+    NVME_ID_CNS_CS_NS_ACTIVE_LIST     = 0x07,
+    NVME_ID_CNS_NS_PRESENT_LIST       = 0x10,
+    NVME_ID_CNS_NS_PRESENT            = 0x11,
+    NVME_ID_CNS_CS_NS_PRESENT_LIST    = 0x1a,
+    NVME_ID_CNS_CS_NS_PRESENT         = 0x1b,
+    NVME_ID_CNS_IO_COMMAND_SET        = 0x1c,
 };
 
 typedef struct QEMU_PACKED NvmeIdCtrl {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 07/14] hw/block/nvme: Make Zoned NS Command Set definitions
  2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (5 preceding siblings ...)
  2020-09-23 18:20 ` [PATCH v4 06/14] hw/block/nvme: Add support for active/inactive namespaces Dmitry Fomichev
@ 2020-09-23 18:20 ` Dmitry Fomichev
  2020-09-23 18:20 ` [PATCH v4 08/14] hw/block/nvme: Define Zoned NS Command Set trace events Dmitry Fomichev
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-23 18:20 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

Define values and structures that are needed to support Zoned
Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.

All new protocol definitions are located in include/block/nvme.h
and everything added that is specific to this implementation is kept
in hw/block/nvme.h.

In order to improve scalability, all open, closed and full zones
are organized in separate linked lists. Consequently, almost all
zone operations don't require scanning of the entire zone array
(which potentially can be quite large) - it is only necessary to
enumerate one or more zone lists. Zone lists are designed to be
position-independent as they can be persisted to the backing file
as a part of zone metadata. NvmeZoneList struct defined in this patch
serves as a head of every zone list.

NvmeZone structure encapsulates NvmeZoneDescriptor defined in Zoned
Command Set specification and adds a few more fields that are
internal to this implementation.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.h      | 124 +++++++++++++++++++++++++++++++++++++++++++
 include/block/nvme.h | 107 +++++++++++++++++++++++++++++++++++++
 2 files changed, 231 insertions(+)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 71e4344471..d9f307f0ed 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -3,6 +3,9 @@
 
 #include "block/nvme.h"
 
+#define NVME_DEFAULT_ZONE_SIZE   128 /* MiB */
+#define NVME_DEFAULT_MAX_ZA_SIZE 128 /* KiB */
+
 typedef struct NvmeParams {
     char     *serial;
     uint32_t num_queues; /* deprecated since 5.1 */
@@ -12,6 +15,13 @@ typedef struct NvmeParams {
     uint8_t  aerl;
     uint32_t aer_max_queued;
     uint8_t  mdts;
+
+    bool        zoned;
+    bool        cross_zone_read;
+    uint8_t     fill_pattern;
+    uint32_t    zasl_kb;
+    uint64_t    zone_size_mb;
+    uint64_t    zone_capacity_mb;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
@@ -24,6 +34,7 @@ typedef struct NvmeRequest {
     struct NvmeNamespace    *ns;
     BlockAIOCB              *aiocb;
     uint16_t                status;
+    int64_t                 fill_ofs;
     NvmeCqe                 cqe;
     NvmeCmd                 cmd;
     BlockAcctCookie         acct;
@@ -62,12 +73,36 @@ typedef struct NvmeCQueue {
     QTAILQ_HEAD(, NvmeRequest) req_list;
 } NvmeCQueue;
 
+typedef struct NvmeZone {
+    NvmeZoneDescr   d;
+    uint64_t        w_ptr;
+    uint32_t        next;
+    uint32_t        prev;
+    uint8_t         rsvd80[8];
+} NvmeZone;
+
+#define NVME_ZONE_LIST_NIL    UINT_MAX
+
+typedef struct NvmeZoneList {
+    uint32_t        head;
+    uint32_t        tail;
+    uint32_t        size;
+    uint8_t         rsvd12[4];
+} NvmeZoneList;
+
 typedef struct NvmeNamespace {
     NvmeIdNs        id_ns;
     uint32_t        nsid;
     uint8_t         csi;
     bool            attached;
     QemuUUID        uuid;
+
+    NvmeIdNsZoned   *id_ns_zoned;
+    NvmeZone        *zone_array;
+    NvmeZoneList    *exp_open_zones;
+    NvmeZoneList    *imp_open_zones;
+    NvmeZoneList    *closed_zones;
+    NvmeZoneList    *full_zones;
 } NvmeNamespace;
 
 static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
@@ -126,6 +161,15 @@ typedef struct NvmeCtrl {
     QTAILQ_HEAD(, NvmeAsyncEvent) aer_queue;
     int         aer_queued;
 
+    int             zone_file_fd;
+    uint32_t        num_zones;
+    uint64_t        zone_size;
+    uint64_t        zone_capacity;
+    uint64_t        zone_array_size;
+    uint32_t        zone_size_log2;
+    uint32_t        zasl_bs;
+    uint8_t         zasl;
+
     NvmeNamespace   *namespaces;
     NvmeSQueue      **sq;
     NvmeCQueue      **cq;
@@ -141,4 +185,84 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
     return n->ns_size >> nvme_ns_lbads(ns);
 }
 
+static inline uint8_t nvme_get_zone_state(NvmeZone *zone)
+{
+    return zone->d.zs >> 4;
+}
+
+static inline void nvme_set_zone_state(NvmeZone *zone, enum NvmeZoneState state)
+{
+    zone->d.zs = state << 4;
+}
+
+static inline uint64_t nvme_zone_rd_boundary(NvmeCtrl *n, NvmeZone *zone)
+{
+    return zone->d.zslba + n->zone_size;
+}
+
+static inline uint64_t nvme_zone_wr_boundary(NvmeZone *zone)
+{
+    return zone->d.zslba + zone->d.zcap;
+}
+
+static inline bool nvme_wp_is_valid(NvmeZone *zone)
+{
+    uint8_t st = nvme_get_zone_state(zone);
+
+    return st != NVME_ZONE_STATE_FULL &&
+           st != NVME_ZONE_STATE_READ_ONLY &&
+           st != NVME_ZONE_STATE_OFFLINE;
+}
+
+/*
+ * Initialize a zone list head.
+ */
+static inline void nvme_init_zone_list(NvmeZoneList *zl)
+{
+    zl->head = NVME_ZONE_LIST_NIL;
+    zl->tail = NVME_ZONE_LIST_NIL;
+    zl->size = 0;
+}
+
+/*
+ * Initialize the number of entries contained in a zone list.
+ */
+static inline uint32_t nvme_zone_list_size(NvmeZoneList *zl)
+{
+    return zl->size;
+}
+
+/*
+ * Check if the zone is not currently included into any zone list.
+ */
+static inline bool nvme_zone_not_in_list(NvmeZone *zone)
+{
+    return (bool)(zone->prev == 0 && zone->next == 0);
+}
+
+/*
+ * Return the zone at the head of zone list or NULL if the list is empty.
+ */
+static inline NvmeZone *nvme_peek_zone_head(NvmeNamespace *ns, NvmeZoneList *zl)
+{
+    if (zl->head == NVME_ZONE_LIST_NIL) {
+        return NULL;
+    }
+    return &ns->zone_array[zl->head];
+}
+
+/*
+ * Return the next zone in the list.
+ */
+static inline NvmeZone *nvme_next_zone_in_list(NvmeNamespace *ns, NvmeZone *z,
+                                               NvmeZoneList *zl)
+{
+    assert(!nvme_zone_not_in_list(z));
+
+    if (z->next == NVME_ZONE_LIST_NIL) {
+        return NULL;
+    }
+    return &ns->zone_array[z->next];
+}
+
 #endif /* HW_NVME_H */
diff --git a/include/block/nvme.h b/include/block/nvme.h
index f8356394f5..ced66cf4eb 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -488,6 +488,9 @@ enum NvmeIoCommands {
     NVME_CMD_COMPARE            = 0x05,
     NVME_CMD_WRITE_ZEROES       = 0x08,
     NVME_CMD_DSM                = 0x09,
+    NVME_CMD_ZONE_MGMT_SEND     = 0x79,
+    NVME_CMD_ZONE_MGMT_RECV     = 0x7a,
+    NVME_CMD_ZONE_APPEND        = 0x7d,
 };
 
 typedef struct QEMU_PACKED NvmeDeleteQ {
@@ -677,6 +680,7 @@ enum NvmeStatusCodes {
     NVME_SGL_DESCR_TYPE_INVALID = 0x0011,
     NVME_INVALID_USE_OF_CMB     = 0x0012,
     NVME_CMD_SET_CMB_REJECTED   = 0x002b,
+    NVME_INVALID_CMD_SET        = 0x002c,
     NVME_LBA_RANGE              = 0x0080,
     NVME_CAP_EXCEEDED           = 0x0081,
     NVME_NS_NOT_READY           = 0x0082,
@@ -701,6 +705,14 @@ enum NvmeStatusCodes {
     NVME_CONFLICTING_ATTRS      = 0x0180,
     NVME_INVALID_PROT_INFO      = 0x0181,
     NVME_WRITE_TO_RO            = 0x0182,
+    NVME_ZONE_BOUNDARY_ERROR    = 0x01b8,
+    NVME_ZONE_FULL              = 0x01b9,
+    NVME_ZONE_READ_ONLY         = 0x01ba,
+    NVME_ZONE_OFFLINE           = 0x01bb,
+    NVME_ZONE_INVALID_WRITE     = 0x01bc,
+    NVME_ZONE_TOO_MANY_ACTIVE   = 0x01bd,
+    NVME_ZONE_TOO_MANY_OPEN     = 0x01be,
+    NVME_ZONE_INVAL_TRANSITION  = 0x01bf,
     NVME_WRITE_FAULT            = 0x0280,
     NVME_UNRECOVERED_READ       = 0x0281,
     NVME_E2E_GUARD_ERROR        = 0x0282,
@@ -885,6 +897,11 @@ typedef struct QEMU_PACKED NvmeIdCtrl {
     uint8_t     vs[1024];
 } NvmeIdCtrl;
 
+typedef struct NvmeIdCtrlZoned {
+    uint8_t     zasl;
+    uint8_t     rsvd1[4095];
+} NvmeIdCtrlZoned;
+
 enum NvmeIdCtrlOacs {
     NVME_OACS_SECURITY  = 1 << 0,
     NVME_OACS_FORMAT    = 1 << 1,
@@ -1009,6 +1026,12 @@ typedef struct QEMU_PACKED NvmeLBAF {
     uint8_t     rp;
 } NvmeLBAF;
 
+typedef struct QEMU_PACKED NvmeLBAFE {
+    uint64_t    zsze;
+    uint8_t     zdes;
+    uint8_t     rsvd9[7];
+} NvmeLBAFE;
+
 #define NVME_NSID_BROADCAST 0xffffffff
 
 typedef struct QEMU_PACKED NvmeIdNs {
@@ -1063,10 +1086,24 @@ enum NvmeNsIdentifierType {
 
 enum NvmeCsi {
     NVME_CSI_NVM                = 0x00,
+    NVME_CSI_ZONED              = 0x02,
 };
 
 #define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
 
+typedef struct QEMU_PACKED NvmeIdNsZoned {
+    uint16_t    zoc;
+    uint16_t    ozcs;
+    uint32_t    mar;
+    uint32_t    mor;
+    uint32_t    rrl;
+    uint32_t    frl;
+    uint8_t     rsvd20[2796];
+    NvmeLBAFE   lbafe[16];
+    uint8_t     rsvd3072[768];
+    uint8_t     vs[256];
+} NvmeIdNsZoned;
+
 /*Deallocate Logical Block Features*/
 #define NVME_ID_NS_DLFEAT_GUARD_CRC(dlfeat)       ((dlfeat) & 0x10)
 #define NVME_ID_NS_DLFEAT_WRITE_ZEROES(dlfeat)    ((dlfeat) & 0x08)
@@ -1098,6 +1135,71 @@ enum NvmeIdNsDps {
     DPS_FIRST_EIGHT = 8,
 };
 
+enum NvmeZoneAttr {
+    NVME_ZA_FINISHED_BY_CTLR         = 1 << 0,
+    NVME_ZA_FINISH_RECOMMENDED       = 1 << 1,
+    NVME_ZA_RESET_RECOMMENDED        = 1 << 2,
+    NVME_ZA_ZD_EXT_VALID             = 1 << 7,
+};
+
+typedef struct QEMU_PACKED NvmeZoneReportHeader {
+    uint64_t    nr_zones;
+    uint8_t     rsvd[56];
+} NvmeZoneReportHeader;
+
+enum NvmeZoneReceiveAction {
+    NVME_ZONE_REPORT                 = 0,
+    NVME_ZONE_REPORT_EXTENDED        = 1,
+};
+
+enum NvmeZoneReportType {
+    NVME_ZONE_REPORT_ALL             = 0,
+    NVME_ZONE_REPORT_EMPTY           = 1,
+    NVME_ZONE_REPORT_IMPLICITLY_OPEN = 2,
+    NVME_ZONE_REPORT_EXPLICITLY_OPEN = 3,
+    NVME_ZONE_REPORT_CLOSED          = 4,
+    NVME_ZONE_REPORT_FULL            = 5,
+    NVME_ZONE_REPORT_READ_ONLY       = 6,
+    NVME_ZONE_REPORT_OFFLINE         = 7,
+};
+
+enum NvmeZoneType {
+    NVME_ZONE_TYPE_RESERVED          = 0x00,
+    NVME_ZONE_TYPE_SEQ_WRITE         = 0x02,
+};
+
+enum NvmeZoneSendAction {
+    NVME_ZONE_ACTION_RSD             = 0x00,
+    NVME_ZONE_ACTION_CLOSE           = 0x01,
+    NVME_ZONE_ACTION_FINISH          = 0x02,
+    NVME_ZONE_ACTION_OPEN            = 0x03,
+    NVME_ZONE_ACTION_RESET           = 0x04,
+    NVME_ZONE_ACTION_OFFLINE         = 0x05,
+    NVME_ZONE_ACTION_SET_ZD_EXT      = 0x10,
+};
+
+typedef struct QEMU_PACKED NvmeZoneDescr {
+    uint8_t     zt;
+    uint8_t     zs;
+    uint8_t     za;
+    uint8_t     rsvd3[5];
+    uint64_t    zcap;
+    uint64_t    zslba;
+    uint64_t    wp;
+    uint8_t     rsvd32[32];
+} NvmeZoneDescr;
+
+enum NvmeZoneState {
+    NVME_ZONE_STATE_RESERVED         = 0x00,
+    NVME_ZONE_STATE_EMPTY            = 0x01,
+    NVME_ZONE_STATE_IMPLICITLY_OPEN  = 0x02,
+    NVME_ZONE_STATE_EXPLICITLY_OPEN  = 0x03,
+    NVME_ZONE_STATE_CLOSED           = 0x04,
+    NVME_ZONE_STATE_READ_ONLY        = 0x0D,
+    NVME_ZONE_STATE_FULL             = 0x0E,
+    NVME_ZONE_STATE_OFFLINE          = 0x0F,
+};
+
 static inline void _nvme_check_size(void)
 {
     QEMU_BUILD_BUG_ON(sizeof(NvmeBar) != 4096);
@@ -1117,9 +1219,14 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrlZoned) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeLBAF) != 4);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeLBAFE) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsZoned) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeZoneDescr) != 64);
 }
 #endif
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 08/14] hw/block/nvme: Define Zoned NS Command Set trace events
  2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (6 preceding siblings ...)
  2020-09-23 18:20 ` [PATCH v4 07/14] hw/block/nvme: Make Zoned NS Command Set definitions Dmitry Fomichev
@ 2020-09-23 18:20 ` Dmitry Fomichev
  2020-09-23 18:20 ` [PATCH v4 09/14] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-23 18:20 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

The Zoned Namespace Command Set / Namespace Types implementation that
is being introduced in this series adds a good number of trace events.
Combine all tracepoint definitions into a separate patch to make
reviewing more convenient.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/trace-events | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/hw/block/trace-events b/hw/block/trace-events
index 2414dcbc79..53c7a2fd1f 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -90,6 +90,17 @@ pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
 pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
 pci_nvme_css_nvm_cset_selected_by_host(uint32_t cc) "NVM command set selected by host, bar.cc=0x%"PRIx32""
 pci_nvme_css_all_csets_sel_by_host(uint32_t cc) "all supported command sets selected by host, bar.cc=0x%"PRIx32""
+pci_nvme_open_zone(uint64_t slba, uint32_t zone_idx, int all) "open zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_close_zone(uint64_t slba, uint32_t zone_idx, int all) "close zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_finish_zone(uint64_t slba, uint32_t zone_idx, int all) "finish zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_reset_zone(uint64_t slba, uint32_t zone_idx, int all) "reset zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_offline_zone(uint64_t slba, uint32_t zone_idx, int all) "offline zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_set_descriptor_extension(uint64_t slba, uint32_t zone_idx) "set zone descriptor extension, slba=%"PRIu64", idx=%"PRIu32""
+pci_nvme_zd_extension_set(uint32_t zone_idx) "set descriptor extension for zone_idx=%"PRIu32""
+pci_nvme_power_on_close(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Closed state"
+pci_nvme_power_on_reset(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Empty state"
+pci_nvme_power_on_full(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Full state"
+pci_nvme_mapped_zone_file(char *zfile_name, int ret) "mapped zone file %s, error %d"
 
 # nvme traces for error conditions
 pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
@@ -102,9 +113,23 @@ pci_nvme_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid namespace %u not w
 pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
 pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
 pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
+pci_nvme_err_unaligned_zone_cmd(uint8_t action, uint64_t slba, uint64_t zslba) "unaligned zone op 0x%"PRIx32", got slba=%"PRIu64", zslba=%"PRIu64""
+pci_nvme_err_invalid_zone_state_transition(uint8_t state, uint8_t action, uint64_t slba, uint8_t attrs) "0x%"PRIx32"->0x%"PRIx32", slba=%"PRIu64", attrs=0x%"PRIx32""
+pci_nvme_err_write_not_at_wp(uint64_t slba, uint64_t zone, uint64_t wp) "writing at slba=%"PRIu64", zone=%"PRIu64", but wp=%"PRIu64""
+pci_nvme_err_append_not_at_start(uint64_t slba, uint64_t zone) "appending at slba=%"PRIu64", but zone=%"PRIu64""
+pci_nvme_err_zone_write_not_ok(uint64_t slba, uint32_t nlb, uint32_t status) "slba=%"PRIu64", nlb=%"PRIu32", status=0x%"PRIx16""
+pci_nvme_err_zone_read_not_ok(uint64_t slba, uint32_t nlb, uint32_t status) "slba=%"PRIu64", nlb=%"PRIu32", status=0x%"PRIx16""
+pci_nvme_err_append_too_large(uint64_t slba, uint32_t nlb, uint8_t zasl) "slba=%"PRIu64", nlb=%"PRIu32", zasl=%"PRIu8""
+pci_nvme_err_insuff_active_res(uint32_t max_active) "max_active=%"PRIu32" zone limit exceeded"
+pci_nvme_err_insuff_open_res(uint32_t max_open) "max_open=%"PRIu32" zone limit exceeded"
+pci_nvme_err_zone_file_invalid(int error) "validation error=%"PRIi32""
+pci_nvme_err_zd_extension_map_error(uint32_t zone_idx) "can't map descriptor extension for zone_idx=%"PRIu32""
+pci_nvme_err_invalid_changed_zone_list_offset(uint64_t ofs) "changed zone list log offset must be 0, got %"PRIu64""
+pci_nvme_err_invalid_changed_zone_list_len(uint32_t len) "changed zone list log size is 4096, got %"PRIu32""
 pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
 pci_nvme_err_change_css_when_enabled(void) "changing CC.CSS while controller is enabled"
 pci_nvme_err_only_nvm_cmd_set_avail(void) "setting 110b CC.CSS, but only NVM command set is enabled"
+pci_nvme_err_only_zoned_cmd_set_avail(void) "setting 001b CC.CSS, but only ZONED+NVM command set is enabled"
 pci_nvme_err_invalid_iocsci(uint32_t idx) "unsupported command set combination index %"PRIu32""
 pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
@@ -138,6 +163,7 @@ pci_nvme_err_startfail_sqent_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_
 pci_nvme_err_startfail_asqent_sz_zero(void) "nvme_start_ctrl failed because the admin submission queue size is zero"
 pci_nvme_err_startfail_acqent_sz_zero(void) "nvme_start_ctrl failed because the admin completion queue size is zero"
 pci_nvme_err_startfail(void) "setting controller enable bit failed"
+pci_nvme_err_invalid_mgmt_action(int action) "action=0x%"PRIx8""
 
 # Traces for undefined behavior
 pci_nvme_ub_mmiowr_misaligned32(uint64_t offset) "MMIO write not 32-bit aligned, offset=0x%"PRIx64""
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 09/14] hw/block/nvme: Support Zoned Namespace Command Set
  2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (7 preceding siblings ...)
  2020-09-23 18:20 ` [PATCH v4 08/14] hw/block/nvme: Define Zoned NS Command Set trace events Dmitry Fomichev
@ 2020-09-23 18:20 ` Dmitry Fomichev
  2020-09-25 18:24   ` Klaus Jensen
  2020-09-23 18:20 ` [PATCH v4 10/14] hw/block/nvme: Introduce max active and open zone limits Dmitry Fomichev
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-23 18:20 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

The emulation code has been changed to advertise NVM Command Set when
"zoned" device property is not set (default) and Zoned Namespace
Command Set otherwise.

Handlers for three new NVMe commands introduced in Zoned Namespace
Command Set specification are added, namely for Zone Management
Receive, Zone Management Send and Zone Append.

Device initialization code has been extended to create a proper
configuration for zoned operation using device properties.

Read/Write command handler is modified to only allow writes at the
write pointer if the namespace is zoned. For Zone Append command,
writes implicitly happen at the write pointer and the starting write
pointer value is returned as the result of the command. Write Zeroes
handler is modified to add zoned checks that are identical to those
done as a part of Write flow.

The code to support for Zone Descriptor Extensions is not included in
this commit and ZDES 0 is always reported. A later commit in this
series will add ZDE support.

This commit doesn't yet include checks for active and open zone
limits. It is assumed that there are no limits on either active or
open zones.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 block/nvme.c         |    2 +-
 hw/block/nvme.c      | 1057 ++++++++++++++++++++++++++++++++++++++++--
 include/block/nvme.h |    6 +-
 3 files changed, 1026 insertions(+), 39 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 05485fdd11..7a513c9a17 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -333,7 +333,7 @@ static inline int nvme_translate_error(const NvmeCqe *c)
 {
     uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF;
     if (status) {
-        trace_nvme_error(le32_to_cpu(c->result),
+        trace_nvme_error(le32_to_cpu(c->result32),
                          le16_to_cpu(c->sq_head),
                          le16_to_cpu(c->sq_id),
                          le16_to_cpu(c->cid),
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 6c231b20f9..287984dd37 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -53,6 +53,7 @@
 #include "qemu/osdep.h"
 #include "qemu/units.h"
 #include "qemu/error-report.h"
+#include "crypto/random.h"
 #include "hw/block/block.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pci.h"
@@ -125,6 +126,96 @@ static uint16_t nvme_sqid(NvmeRequest *req)
     return le16_to_cpu(req->sq->sqid);
 }
 
+/*
+ * Add a zone to the tail of a zone list.
+ */
+static void nvme_add_zone_tail(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
+                               NvmeZone *zone)
+{
+    uint32_t idx = (uint32_t)(zone - ns->zone_array);
+
+    assert(nvme_zone_not_in_list(zone));
+
+    if (!zl->size) {
+        zl->head = zl->tail = idx;
+        zone->next = zone->prev = NVME_ZONE_LIST_NIL;
+    } else {
+        ns->zone_array[zl->tail].next = idx;
+        zone->prev = zl->tail;
+        zone->next = NVME_ZONE_LIST_NIL;
+        zl->tail = idx;
+    }
+    zl->size++;
+}
+
+/*
+ * Remove a zone from a zone list. The zone must be linked in the list.
+ */
+static void nvme_remove_zone(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
+                             NvmeZone *zone)
+{
+    uint32_t idx = (uint32_t)(zone - ns->zone_array);
+
+    assert(!nvme_zone_not_in_list(zone));
+
+    --zl->size;
+    if (zl->size == 0) {
+        zl->head = NVME_ZONE_LIST_NIL;
+        zl->tail = NVME_ZONE_LIST_NIL;
+    } else if (idx == zl->head) {
+        zl->head = zone->next;
+        ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
+    } else if (idx == zl->tail) {
+        zl->tail = zone->prev;
+        ns->zone_array[zl->tail].next = NVME_ZONE_LIST_NIL;
+    } else {
+        ns->zone_array[zone->next].prev = zone->prev;
+        ns->zone_array[zone->prev].next = zone->next;
+    }
+
+    zone->prev = zone->next = 0;
+}
+
+static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
+                                   NvmeZone *zone, uint8_t state)
+{
+    if (!nvme_zone_not_in_list(zone)) {
+        switch (nvme_get_zone_state(zone)) {
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+            nvme_remove_zone(n, ns, ns->exp_open_zones, zone);
+            break;
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+            nvme_remove_zone(n, ns, ns->imp_open_zones, zone);
+            break;
+        case NVME_ZONE_STATE_CLOSED:
+            nvme_remove_zone(n, ns, ns->closed_zones, zone);
+            break;
+        case NVME_ZONE_STATE_FULL:
+            nvme_remove_zone(n, ns, ns->full_zones, zone);
+        }
+   }
+
+    nvme_set_zone_state(zone, state);
+
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+        nvme_add_zone_tail(n, ns, ns->exp_open_zones, zone);
+        break;
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_add_zone_tail(n, ns, ns->imp_open_zones, zone);
+        break;
+    case NVME_ZONE_STATE_CLOSED:
+        nvme_add_zone_tail(n, ns, ns->closed_zones, zone);
+        break;
+    case NVME_ZONE_STATE_FULL:
+        nvme_add_zone_tail(n, ns, ns->full_zones, zone);
+    case NVME_ZONE_STATE_READ_ONLY:
+        break;
+    default:
+        zone->d.za = 0;
+    }
+}
+
 static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
 {
     hwaddr low = n->ctrl_mem.addr;
@@ -524,7 +615,7 @@ static void nvme_process_aers(void *opaque)
 
         req = n->aer_reqs[n->outstanding_aers];
 
-        result = (NvmeAerResult *) &req->cqe.result;
+        result = (NvmeAerResult *) &req->cqe.result32;
         result->event_type = event->result.event_type;
         result->event_info = event->result.event_info;
         result->log_page = event->result.log_page;
@@ -595,6 +686,195 @@ static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace *ns,
     return NVME_SUCCESS;
 }
 
+static void nvme_fill_data(QEMUSGList *qsg, QEMUIOVector *iov,
+                           uint64_t offset, uint8_t pattern)
+{
+    ScatterGatherEntry *entry;
+    uint32_t len, ent_len;
+
+    if (qsg->nsg > 0) {
+        entry = qsg->sg;
+        for (len = qsg->size; len > 0; len -= ent_len) {
+            ent_len = MIN(len, entry->len);
+            if (offset > ent_len) {
+                offset -= ent_len;
+            } else if (offset != 0) {
+                dma_memory_set(qsg->as, entry->base + offset,
+                               pattern, ent_len - offset);
+                offset = 0;
+            } else {
+                dma_memory_set(qsg->as, entry->base, pattern, ent_len);
+            }
+            entry++;
+        }
+    } else if (iov->iov) {
+        qemu_iovec_memset(iov, offset, pattern,
+                          iov_size(iov->iov, iov->niov) - offset);
+    }
+}
+
+static uint16_t nvme_check_zone_write(NvmeZone *zone, uint64_t slba,
+                                      uint32_t nlb)
+{
+    uint16_t status;
+
+    if (unlikely((slba + nlb) > nvme_zone_wr_boundary(zone))) {
+        return NVME_ZONE_BOUNDARY_ERROR;
+    }
+
+    switch (nvme_get_zone_state(zone)) {
+    case NVME_ZONE_STATE_EMPTY:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_CLOSED:
+        status = NVME_SUCCESS;
+        break;
+    case NVME_ZONE_STATE_FULL:
+        status = NVME_ZONE_FULL;
+        break;
+    case NVME_ZONE_STATE_OFFLINE:
+        status = NVME_ZONE_OFFLINE;
+        break;
+    case NVME_ZONE_STATE_READ_ONLY:
+        status = NVME_ZONE_READ_ONLY;
+        break;
+    default:
+        assert(false);
+    }
+    return status;
+}
+
+static uint16_t nvme_check_zone_read(NvmeCtrl *n, NvmeZone *zone, uint64_t slba,
+                                     uint32_t nlb, bool zone_x_ok)
+{
+    uint64_t lba = slba, count;
+    uint16_t status;
+    uint8_t zs;
+
+    do {
+        if (!zone_x_ok && (lba + nlb > nvme_zone_rd_boundary(n, zone))) {
+            return NVME_ZONE_BOUNDARY_ERROR | NVME_DNR;
+        }
+
+        zs = nvme_get_zone_state(zone);
+        switch (zs) {
+        case NVME_ZONE_STATE_EMPTY:
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+        case NVME_ZONE_STATE_FULL:
+        case NVME_ZONE_STATE_CLOSED:
+        case NVME_ZONE_STATE_READ_ONLY:
+            status = NVME_SUCCESS;
+            break;
+        case NVME_ZONE_STATE_OFFLINE:
+            status = NVME_ZONE_OFFLINE | NVME_DNR;
+            break;
+        default:
+            assert(false);
+        }
+        if (status != NVME_SUCCESS) {
+            break;
+        }
+
+        if (lba + nlb > nvme_zone_rd_boundary(n, zone)) {
+            count = nvme_zone_rd_boundary(n, zone) - lba;
+        } else {
+            count = nlb;
+        }
+
+        lba += count;
+        nlb -= count;
+        zone++;
+    } while (nlb);
+
+    return status;
+}
+
+static inline uint32_t nvme_zone_idx(NvmeCtrl *n, uint64_t slba)
+{
+    return n->zone_size_log2 > 0 ? slba >> n->zone_size_log2 :
+                                   slba / n->zone_size;
+}
+
+static bool nvme_finalize_zoned_write(NvmeCtrl *n, NvmeRequest *req,
+                                      bool failed)
+{
+    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
+    NvmeNamespace *ns;
+    NvmeZone *zone;
+    uint64_t slba, start_wp = req->cqe.result64;
+    uint32_t nlb, zone_idx;
+    uint8_t zs;
+
+    if (rw->opcode != NVME_CMD_WRITE &&
+        rw->opcode != NVME_CMD_ZONE_APPEND &&
+        rw->opcode != NVME_CMD_WRITE_ZEROES) {
+        return false;
+    }
+
+    slba = le64_to_cpu(rw->slba);
+    nlb = le16_to_cpu(rw->nlb) + 1;
+    zone_idx = nvme_zone_idx(n, slba);
+    assert(zone_idx < n->num_zones);
+    ns = req->ns;
+    zone = &ns->zone_array[zone_idx];
+
+    if (!failed && zone->w_ptr < start_wp + nlb) {
+        /*
+         * A preceding queued write to the zone has failed,
+         * now this write is not at the WP, fail it too.
+         */
+        failed = true;
+    }
+
+    if (failed) {
+        if (zone->w_ptr > start_wp) {
+            zone->w_ptr = start_wp;
+        }
+        req->cqe.result64 = 0;
+    } else if (zone->w_ptr == nvme_zone_wr_boundary(zone)) {
+        zs = nvme_get_zone_state(zone);
+        switch (zs) {
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+        case NVME_ZONE_STATE_CLOSED:
+        case NVME_ZONE_STATE_EMPTY:
+            nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_FULL);
+            /* fall through */
+        case NVME_ZONE_STATE_FULL:
+            break;
+        default:
+            assert(false);
+        }
+        zone->d.wp = zone->w_ptr;
+    } else {
+        zone->d.wp += nlb;
+    }
+
+    return failed;
+}
+
+static uint64_t nvme_advance_zone_wp(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone, uint32_t nlb)
+{
+    uint64_t result = zone->w_ptr;
+    uint8_t zs;
+
+    zone->w_ptr += nlb;
+
+    if (zone->w_ptr < nvme_zone_wr_boundary(zone)) {
+        zs = nvme_get_zone_state(zone);
+        switch (zs) {
+        case NVME_ZONE_STATE_EMPTY:
+        case NVME_ZONE_STATE_CLOSED:
+            nvme_assign_zone_state(n, ns, zone,
+                                   NVME_ZONE_STATE_IMPLICITLY_OPEN);
+        }
+    }
+
+    return result;
+}
+
 static void nvme_rw_cb(void *opaque, int ret)
 {
     NvmeRequest *req = opaque;
@@ -605,9 +885,24 @@ static void nvme_rw_cb(void *opaque, int ret)
     trace_pci_nvme_rw_cb(nvme_cid(req));
 
     if (!ret) {
-        block_acct_done(blk_get_stats(n->conf.blk), &req->acct);
-        req->status = NVME_SUCCESS;
+        if (n->params.zoned) {
+            if (nvme_finalize_zoned_write(n, req, false)) {
+                ret = EIO;
+                block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
+                req->status = NVME_ZONE_INVALID_WRITE;
+            } else if (req->fill_ofs >= 0) {
+                nvme_fill_data(&req->qsg, &req->iov, req->fill_ofs,
+                               n->params.fill_pattern);
+            }
+        }
+        if (!ret) {
+            block_acct_done(blk_get_stats(n->conf.blk), &req->acct);
+            req->status = NVME_SUCCESS;
+        }
     } else {
+        if (n->params.zoned) {
+            nvme_finalize_zoned_write(n, req, true);
+        }
         block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
         req->status = NVME_INTERNAL_DEV_ERROR;
     }
@@ -628,12 +923,14 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
     NvmeNamespace *ns = req->ns;
+    NvmeZone *zone = NULL;
     const uint8_t lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
     const uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
     uint64_t slba = le64_to_cpu(rw->slba);
     uint32_t nlb  = le16_to_cpu(rw->nlb) + 1;
     uint64_t offset = slba << data_shift;
     uint32_t count = nlb << data_shift;
+    uint32_t zone_idx;
     uint16_t status;
 
     trace_pci_nvme_write_zeroes(nvme_cid(req), slba, nlb);
@@ -644,25 +941,51 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
         return status;
     }
 
+    if (n->params.zoned) {
+        zone_idx = nvme_zone_idx(n, slba);
+        assert(zone_idx < n->num_zones);
+        zone = &ns->zone_array[zone_idx];
+
+        status = nvme_check_zone_write(zone, slba, nlb);
+        if (status != NVME_SUCCESS) {
+            trace_pci_nvme_err_zone_write_not_ok(slba, nlb, status);
+            return status | NVME_DNR;
+        }
+
+        assert(nvme_wp_is_valid(zone));
+        if (unlikely(slba != zone->w_ptr)) {
+            trace_pci_nvme_err_write_not_at_wp(slba, zone->d.zslba,
+                                               zone->w_ptr);
+            return NVME_ZONE_INVALID_WRITE | NVME_DNR;
+        }
+    }
+
     block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
                      BLOCK_ACCT_WRITE);
     req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
                                         BDRV_REQ_MAY_UNMAP, nvme_rw_cb, req);
+
+    if (n->params.zoned) {
+        req->cqe.result64 = nvme_advance_zone_wp(n, ns, zone, nlb);
+    }
+
     return NVME_NO_COMPLETE;
 }
 
-static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req, bool append)
 {
     NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
     NvmeNamespace *ns = req->ns;
+    NvmeZone *zone = NULL;
     uint32_t nlb  = le32_to_cpu(rw->nlb) + 1;
     uint64_t slba = le64_to_cpu(rw->slba);
 
     uint8_t lba_index  = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
     uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
     uint64_t data_size = (uint64_t)nlb << data_shift;
-    uint64_t data_offset = slba << data_shift;
-    int is_write = rw->opcode == NVME_CMD_WRITE ? 1 : 0;
+    uint64_t data_offset;
+    uint32_t zone_idx = 0;
+    bool is_write = rw->opcode == NVME_CMD_WRITE || append;
     enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
     uint16_t status;
 
@@ -682,11 +1005,77 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
         return status;
     }
 
+    if (n->params.zoned) {
+        zone_idx = nvme_zone_idx(n, slba);
+        assert(zone_idx < n->num_zones);
+        zone = &ns->zone_array[zone_idx];
+
+        if (is_write) {
+            status = nvme_check_zone_write(zone, slba, nlb);
+            if (status != NVME_SUCCESS) {
+                trace_pci_nvme_err_zone_write_not_ok(slba, nlb, status);
+                return status | NVME_DNR;
+            }
+
+            assert(nvme_wp_is_valid(zone));
+            if (append) {
+                if (unlikely(slba != zone->d.zslba)) {
+                    trace_pci_nvme_err_append_not_at_start(slba, zone->d.zslba);
+                    return NVME_ZONE_INVALID_WRITE | NVME_DNR;
+                }
+                if (data_size > (n->page_size << n->zasl)) {
+                    trace_pci_nvme_err_append_too_large(slba, nlb, n->zasl);
+                    return NVME_INVALID_FIELD | NVME_DNR;
+                }
+                slba = zone->w_ptr;
+            } else if (unlikely(slba != zone->w_ptr)) {
+                trace_pci_nvme_err_write_not_at_wp(slba, zone->d.zslba,
+                                                   zone->w_ptr);
+                return NVME_ZONE_INVALID_WRITE | NVME_DNR;
+            }
+            req->fill_ofs = -1LL;
+        } else {
+            status = nvme_check_zone_read(n, zone, slba, nlb,
+                                          n->params.cross_zone_read);
+            if (status != NVME_SUCCESS) {
+                trace_pci_nvme_err_zone_read_not_ok(slba, nlb, status);
+                return status | NVME_DNR;
+            }
+
+            if (slba + nlb > zone->w_ptr) {
+                /*
+                 * All or some data is read above the WP. Need to
+                 * fill out the buffer area that has no backing data
+                 * with a predefined data pattern (zeros by default)
+                 */
+                if (slba >= zone->w_ptr) {
+                    req->fill_ofs = 0;
+                } else {
+                    req->fill_ofs = ((zone->w_ptr - slba) << data_shift);
+                }
+            } else {
+                req->fill_ofs = -1LL;
+            }
+        }
+    } else if (append) {
+        trace_pci_nvme_err_invalid_opc(rw->opcode);
+        return NVME_INVALID_OPCODE | NVME_DNR;
+    }
+
     if (nvme_map_dptr(n, data_size, req)) {
         block_acct_invalid(blk_get_stats(n->conf.blk), acct);
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
+    if (unlikely(n->params.zoned && req->fill_ofs == 0)) {
+        /* No backend I/O necessary, only need to fill the buffer */
+        nvme_fill_data(&req->qsg, &req->iov, 0, n->params.fill_pattern);
+        req->status = NVME_SUCCESS;
+        return NVME_SUCCESS;
+    }
+
+    data_offset = slba << data_shift;
+
     if (req->qsg.nsg > 0) {
         block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->qsg.size,
                          acct);
@@ -705,9 +1094,388 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
                            req);
     }
 
+    if (n->params.zoned && is_write) {
+        req->cqe.result64 = nvme_advance_zone_wp(n, ns, zone, nlb);
+    }
+
     return NVME_NO_COMPLETE;
 }
 
+static uint16_t nvme_get_mgmt_zone_slba_idx(NvmeCtrl *n, NvmeNamespace *ns,
+                                            NvmeCmd *c, uint64_t *slba,
+                                            uint32_t *zone_idx)
+{
+    uint32_t dw10 = le32_to_cpu(c->cdw10);
+    uint32_t dw11 = le32_to_cpu(c->cdw11);
+
+    if (!n->params.zoned) {
+        trace_pci_nvme_err_invalid_opc(c->opcode);
+        return NVME_INVALID_OPCODE | NVME_DNR;
+    }
+
+    *slba = ((uint64_t)dw11) << 32 | dw10;
+    if (unlikely(*slba >= ns->id_ns.nsze)) {
+        trace_pci_nvme_err_invalid_lba_range(*slba, 0, ns->id_ns.nsze);
+        *slba = 0;
+        return NVME_LBA_RANGE | NVME_DNR;
+    }
+
+    *zone_idx = nvme_zone_idx(n, *slba);
+    assert(*zone_idx < n->num_zones);
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_open_zone(NvmeCtrl *n, NvmeNamespace *ns,
+                               NvmeZone *zone, uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EMPTY:
+    case NVME_ZONE_STATE_CLOSED:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_EXPLICITLY_OPEN);
+        /* fall through */
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_open_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_CLOSED;
+}
+
+static uint16_t nvme_close_zone(NvmeCtrl *n,  NvmeNamespace *ns,
+                                NvmeZone *zone, uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
+        /* fall through */
+    case NVME_ZONE_STATE_CLOSED:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_close_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_EXPLICITLY_OPEN;
+}
+
+static uint16_t nvme_finish_zone(NvmeCtrl *n, NvmeNamespace *ns,
+                                 NvmeZone *zone, uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+    case NVME_ZONE_STATE_CLOSED:
+    case NVME_ZONE_STATE_EMPTY:
+        zone->w_ptr = nvme_zone_wr_boundary(zone);
+        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_FULL);
+        /* fall through */
+    case NVME_ZONE_STATE_FULL:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_finish_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_EXPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_CLOSED;
+}
+
+static uint16_t nvme_reset_zone(NvmeCtrl *n, NvmeNamespace *ns,
+                                NvmeZone *zone, uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+    case NVME_ZONE_STATE_CLOSED:
+    case NVME_ZONE_STATE_FULL:
+        zone->w_ptr = zone->d.zslba;
+        zone->d.wp = zone->w_ptr;
+        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_EMPTY);
+        /* fall through */
+    case NVME_ZONE_STATE_EMPTY:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_reset_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_EXPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_CLOSED ||
+           state == NVME_ZONE_STATE_FULL;
+}
+
+static uint16_t nvme_offline_zone(NvmeCtrl *n, NvmeNamespace *ns,
+                                  NvmeZone *zone, uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_READ_ONLY:
+        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_OFFLINE);
+        /* fall through */
+    case NVME_ZONE_STATE_OFFLINE:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_offline_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_READ_ONLY;
+}
+
+typedef uint16_t (*op_handler_t)(NvmeCtrl *, NvmeNamespace *, NvmeZone *,
+                                 uint8_t);
+typedef bool (*need_to_proc_zone_t)(uint8_t);
+
+static uint16_t name_do_zone_op(NvmeCtrl *n, NvmeNamespace *ns,
+                                NvmeZone *zone, uint8_t state, bool all,
+                                op_handler_t op_hndlr,
+                                need_to_proc_zone_t proc_zone)
+{
+    int i;
+    uint16_t status = 0;
+
+    if (!all) {
+        status = op_hndlr(n, ns, zone, state);
+    } else {
+        for (i = 0; i < n->num_zones; i++, zone++) {
+            state = nvme_get_zone_state(zone);
+            if (proc_zone(state)) {
+                status = op_hndlr(n, ns, zone, state);
+                if (status != NVME_SUCCESS) {
+                    break;
+                }
+            }
+        }
+    }
+
+    return status;
+}
+
+static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeCmd *cmd = (NvmeCmd *)&req->cmd;
+    NvmeNamespace *ns = req->ns;
+    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
+    uint64_t slba = 0;
+    uint32_t zone_idx = 0;
+    uint16_t status;
+    uint8_t action, state;
+    bool all;
+    NvmeZone *zone;
+
+    action = dw13 & 0xff;
+    all = dw13 & 0x100;
+
+    req->status = NVME_SUCCESS;
+
+    if (!all) {
+        status = nvme_get_mgmt_zone_slba_idx(n, ns, cmd, &slba, &zone_idx);
+        if (status) {
+            return status;
+        }
+    }
+
+    zone = &ns->zone_array[zone_idx];
+    if (slba != zone->d.zslba) {
+        trace_pci_nvme_err_unaligned_zone_cmd(action, slba, zone->d.zslba);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+    state = nvme_get_zone_state(zone);
+
+    switch (action) {
+
+    case NVME_ZONE_ACTION_OPEN:
+        trace_pci_nvme_open_zone(slba, zone_idx, all);
+        status = name_do_zone_op(n, ns, zone, state, all,
+                                 nvme_open_zone, nvme_cond_open_all);
+        break;
+
+    case NVME_ZONE_ACTION_CLOSE:
+        trace_pci_nvme_close_zone(slba, zone_idx, all);
+        status = name_do_zone_op(n, ns, zone, state, all,
+                                 nvme_close_zone, nvme_cond_close_all);
+        break;
+
+    case NVME_ZONE_ACTION_FINISH:
+        trace_pci_nvme_finish_zone(slba, zone_idx, all);
+        status = name_do_zone_op(n, ns, zone, state, all,
+                                 nvme_finish_zone, nvme_cond_finish_all);
+        break;
+
+    case NVME_ZONE_ACTION_RESET:
+        trace_pci_nvme_reset_zone(slba, zone_idx, all);
+        status = name_do_zone_op(n, ns, zone, state, all,
+                                 nvme_reset_zone, nvme_cond_reset_all);
+        break;
+
+    case NVME_ZONE_ACTION_OFFLINE:
+        trace_pci_nvme_offline_zone(slba, zone_idx, all);
+        status = name_do_zone_op(n, ns, zone, state, all,
+                                 nvme_offline_zone, nvme_cond_offline_all);
+        break;
+
+    case NVME_ZONE_ACTION_SET_ZD_EXT:
+        trace_pci_nvme_set_descriptor_extension(slba, zone_idx);
+        return NVME_INVALID_FIELD | NVME_DNR;
+        break;
+
+    default:
+        trace_pci_nvme_err_invalid_mgmt_action(action);
+        status = NVME_INVALID_FIELD;
+    }
+
+    if (status == NVME_ZONE_INVAL_TRANSITION) {
+        trace_pci_nvme_err_invalid_zone_state_transition(state, action, slba,
+                                                         zone->d.za);
+    }
+    if (status) {
+        status |= NVME_DNR;
+    }
+
+    return status;
+}
+
+static bool nvme_zone_matches_filter(uint32_t zafs, NvmeZone *zl)
+{
+    int zs = nvme_get_zone_state(zl);
+
+    switch (zafs) {
+    case NVME_ZONE_REPORT_ALL:
+        return true;
+    case NVME_ZONE_REPORT_EMPTY:
+        return (zs == NVME_ZONE_STATE_EMPTY);
+    case NVME_ZONE_REPORT_IMPLICITLY_OPEN:
+        return (zs == NVME_ZONE_STATE_IMPLICITLY_OPEN);
+    case NVME_ZONE_REPORT_EXPLICITLY_OPEN:
+        return (zs == NVME_ZONE_STATE_EXPLICITLY_OPEN);
+    case NVME_ZONE_REPORT_CLOSED:
+        return (zs == NVME_ZONE_STATE_CLOSED);
+    case NVME_ZONE_REPORT_FULL:
+        return (zs == NVME_ZONE_STATE_FULL);
+    case NVME_ZONE_REPORT_READ_ONLY:
+        return (zs == NVME_ZONE_STATE_READ_ONLY);
+    case NVME_ZONE_REPORT_OFFLINE:
+        return (zs == NVME_ZONE_STATE_OFFLINE);
+    default:
+        return false;
+    }
+}
+
+static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeCmd *cmd = (NvmeCmd *)&req->cmd;
+    NvmeNamespace *ns = req->ns;
+    uint64_t prp1 = le64_to_cpu(cmd->dptr.prp1);
+    uint64_t prp2 = le64_to_cpu(cmd->dptr.prp2);
+    /* cdw12 is zero-based number of dwords to return. Convert to bytes */
+    uint32_t len = (le32_to_cpu(cmd->cdw12) + 1) << 2;
+    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
+    uint32_t zone_idx, zra, zrasf, partial;
+    uint64_t max_zones, nr_zones = 0;
+    uint16_t ret;
+    uint64_t slba;
+    NvmeZoneDescr *z;
+    NvmeZone *zs;
+    NvmeZoneReportHeader *header;
+    void *buf, *buf_p;
+    size_t zone_entry_sz;
+
+    req->status = NVME_SUCCESS;
+
+    ret = nvme_get_mgmt_zone_slba_idx(n, ns, cmd, &slba, &zone_idx);
+    if (ret) {
+        return ret;
+    }
+
+    if (len < sizeof(NvmeZoneReportHeader)) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    zra = dw13 & 0xff;
+    if (!(zra == NVME_ZONE_REPORT || zra == NVME_ZONE_REPORT_EXTENDED)) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    if (zra == NVME_ZONE_REPORT_EXTENDED) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    zrasf = (dw13 >> 8) & 0xff;
+    if (zrasf > NVME_ZONE_REPORT_OFFLINE) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    partial = (dw13 >> 16) & 0x01;
+
+    zone_entry_sz = sizeof(NvmeZoneDescr);
+
+    max_zones = (len - sizeof(NvmeZoneReportHeader)) / zone_entry_sz;
+    buf = g_malloc0(len);
+
+    header = (NvmeZoneReportHeader *)buf;
+    buf_p = buf + sizeof(NvmeZoneReportHeader);
+
+    while (zone_idx < n->num_zones && nr_zones < max_zones) {
+        zs = &ns->zone_array[zone_idx];
+
+        if (!nvme_zone_matches_filter(zrasf, zs)) {
+            zone_idx++;
+            continue;
+        }
+
+        z = (NvmeZoneDescr *)buf_p;
+        buf_p += sizeof(NvmeZoneDescr);
+        nr_zones++;
+
+        z->zt = zs->d.zt;
+        z->zs = zs->d.zs;
+        z->zcap = cpu_to_le64(zs->d.zcap);
+        z->zslba = cpu_to_le64(zs->d.zslba);
+        z->za = zs->d.za;
+
+        if (nvme_wp_is_valid(zs)) {
+            z->wp = cpu_to_le64(zs->d.wp);
+        } else {
+            z->wp = cpu_to_le64(~0ULL);
+        }
+
+        zone_idx++;
+    }
+
+    if (!partial) {
+        for (; zone_idx < n->num_zones; zone_idx++) {
+            zs = &ns->zone_array[zone_idx];
+            if (nvme_zone_matches_filter(zrasf, zs)) {
+                nr_zones++;
+            }
+        }
+    }
+    header->nr_zones = cpu_to_le64(nr_zones);
+
+    ret = nvme_dma_prp(n, (uint8_t *)buf, len, prp1, prp2,
+                       DMA_DIRECTION_FROM_DEVICE, req);
+    g_free(buf);
+
+    return ret;
+}
+
 static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
     uint32_t nsid = le32_to_cpu(req->cmd.nsid);
@@ -726,9 +1494,15 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
         return nvme_flush(n, req);
     case NVME_CMD_WRITE_ZEROES:
         return nvme_write_zeroes(n, req);
+    case NVME_CMD_ZONE_APPEND:
+        return nvme_rw(n, req, true);
     case NVME_CMD_WRITE:
     case NVME_CMD_READ:
-        return nvme_rw(n, req);
+        return nvme_rw(n, req, false);
+    case NVME_CMD_ZONE_MGMT_SEND:
+        return nvme_zone_mgmt_send(n, req);
+    case NVME_CMD_ZONE_MGMT_RECV:
+        return nvme_zone_mgmt_recv(n, req);
     default:
         trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
         return NVME_INVALID_OPCODE | NVME_DNR;
@@ -957,7 +1731,7 @@ static uint16_t nvme_error_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
                         DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
+static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint8_t csi, uint32_t buf_len,
                                  uint64_t off, NvmeRequest *req)
 {
     NvmeCmd *cmd = &req->cmd;
@@ -985,11 +1759,19 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
     acs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;
     acs[NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFFECTS_CSUPP;
 
-    iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
-    iocs[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFFECTS_CSUPP |
-                                  NVME_CMD_EFFECTS_LBCC;
-    iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
-    iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
+    if (NVME_CC_CSS(n->bar.cc) != CSS_ADMIN_ONLY) {
+        iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
+        iocs[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFFECTS_CSUPP |
+                                      NVME_CMD_EFFECTS_LBCC;
+        iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
+        iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
+    }
+    if (csi == NVME_CSI_ZONED && NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
+        iocs[NVME_CMD_ZONE_APPEND] = NVME_CMD_EFFECTS_CSUPP |
+                                     NVME_CMD_EFFECTS_LBCC;
+        iocs[NVME_CMD_ZONE_MGMT_SEND] = NVME_CMD_EFFECTS_CSUPP;
+        iocs[NVME_CMD_ZONE_MGMT_RECV] = NVME_CMD_EFFECTS_CSUPP;
+    }
 
     trans_len = MIN(sizeof(cmd_eff_log) - off, buf_len);
 
@@ -1008,6 +1790,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
     uint8_t  lid = dw10 & 0xff;
     uint8_t  lsp = (dw10 >> 8) & 0xf;
     uint8_t  rae = (dw10 >> 15) & 0x1;
+    uint8_t csi = le32_to_cpu(cmd->cdw14) >> 24;
     uint32_t numdl, numdu;
     uint64_t off, lpol, lpou;
     size_t   len;
@@ -1041,7 +1824,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
     case NVME_LOG_FW_SLOT_INFO:
         return nvme_fw_log_info(n, len, off, req);
     case NVME_LOG_CMD_EFFECTS:
-        return nvme_cmd_effects(n, len, off, req);
+        return nvme_cmd_effects(n, csi, len, off, req);
     default:
         trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1162,6 +1945,16 @@ static uint16_t nvme_rpt_empty_id_struct(NvmeCtrl *n, uint64_t prp1,
                         DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static inline bool nvme_csi_has_nvm_support(NvmeNamespace *ns)
+{
+    switch (ns->csi) {
+    case NVME_CSI_NVM:
+    case NVME_CSI_ZONED:
+        return true;
+    }
+    return false;
+}
+
 static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -1177,13 +1970,22 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
 static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+    NvmeIdCtrlZoned *id;
     uint64_t prp1 = le64_to_cpu(c->prp1);
     uint64_t prp2 = le64_to_cpu(c->prp2);
+    uint16_t ret;
 
     trace_pci_nvme_identify_ctrl_csi(c->csi);
 
     if (c->csi == NVME_CSI_NVM) {
         return nvme_rpt_empty_id_struct(n, prp1, prp2, req);
+    } else if (c->csi == NVME_CSI_ZONED && n->params.zoned) {
+        id = g_malloc0(sizeof(*id));
+        id->zasl = n->zasl;
+        ret = nvme_dma_prp(n, (uint8_t *)id, sizeof(*id), prp1, prp2,
+                           DMA_DIRECTION_FROM_DEVICE, req);
+        g_free(id);
+        return ret;
     }
 
     return NVME_INVALID_FIELD | NVME_DNR;
@@ -1212,8 +2014,12 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req,
         return nvme_rpt_empty_id_struct(n, prp1, prp2, req);
     }
 
-    return nvme_dma_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns), prp1,
-                        prp2, DMA_DIRECTION_FROM_DEVICE, req);
+    if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
+        return nvme_dma_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns), prp1,
+                            prp2, DMA_DIRECTION_FROM_DEVICE, req);
+    }
+
+    return NVME_INVALID_CMD_SET | NVME_DNR;
 }
 
 static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
@@ -1239,8 +2045,12 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
         return nvme_rpt_empty_id_struct(n, prp1, prp2, req);
     }
 
-    if (c->csi == NVME_CSI_NVM) {
+    if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
         return nvme_rpt_empty_id_struct(n, prp1, prp2, req);
+    } else if (c->csi == NVME_CSI_ZONED && ns->csi == NVME_CSI_ZONED) {
+        return nvme_dma_prp(n, (uint8_t *)ns->id_ns_zoned,
+                            sizeof(*ns->id_ns_zoned), prp1, prp2,
+                            DMA_DIRECTION_FROM_DEVICE, req);
     }
 
     return NVME_INVALID_FIELD | NVME_DNR;
@@ -1300,7 +2110,7 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req,
 
     trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
 
-    if (c->csi != NVME_CSI_NVM) {
+    if (c->csi != NVME_CSI_NVM && c->csi != NVME_CSI_ZONED) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
@@ -1364,7 +2174,7 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
     desc->nidt = NVME_NIDT_CSI;
     desc->nidl = NVME_NIDL_CSI;
     buf_ptr += sizeof(*desc);
-    *(uint8_t *)buf_ptr = NVME_CSI_NVM;
+    *(uint8_t *)buf_ptr = ns->csi;
 
     status = nvme_dma_prp(n, buf, data_len, prp1, prp2,
                           DMA_DIRECTION_FROM_DEVICE, req);
@@ -1387,6 +2197,9 @@ static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeRequest *req)
     list = g_malloc0(data_len);
     ptr = (uint8_t *)list;
     NVME_SET_CSI(*ptr, NVME_CSI_NVM);
+    if (n->params.zoned) {
+        NVME_SET_CSI(*ptr, NVME_CSI_ZONED);
+    }
     status = nvme_dma_prp(n, (uint8_t *)list, data_len, prp1, prp2,
                           DMA_DIRECTION_FROM_DEVICE, req);
     g_free(list);
@@ -1432,7 +2245,7 @@ static uint16_t nvme_abort(NvmeCtrl *n, NvmeRequest *req)
 {
     uint16_t sqid = le32_to_cpu(req->cmd.cdw10) & 0xffff;
 
-    req->cqe.result = 1;
+    req->cqe.result32 = 1;
     if (nvme_check_sqid(n, sqid)) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
@@ -1613,7 +2426,7 @@ defaults:
     }
 
 out:
-    req->cqe.result = cpu_to_le32(result);
+    req->cqe.result32 = cpu_to_le32(result);
     return NVME_SUCCESS;
 }
 
@@ -1722,8 +2535,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
                                     ((dw11 >> 16) & 0xFFFF) + 1,
                                     n->params.max_ioqpairs,
                                     n->params.max_ioqpairs);
-        req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
-                                      ((n->params.max_ioqpairs - 1) << 16));
+        req->cqe.result32 = cpu_to_le32((n->params.max_ioqpairs - 1) |
+                                        ((n->params.max_ioqpairs - 1) << 16));
         break;
     case NVME_ASYNCHRONOUS_EVENT_CONF:
         n->features.async_config = dw11;
@@ -1955,6 +2768,20 @@ static int nvme_start_ctrl(NvmeCtrl *n)
                 n->namespaces[i].attached = true;
             }
             break;
+        case NVME_CSI_ZONED:
+            if (NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
+                n->namespaces[i].attached = true;
+            }
+            break;
+        }
+    }
+
+    if (n->params.zoned) {
+        if (!n->zasl_bs) {
+            assert(n->params.mdts);
+            n->zasl = n->params.mdts;
+        } else {
+            n->zasl = 31 - clz32(n->zasl_bs / n->page_size);
         }
     }
 
@@ -2018,12 +2845,18 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
             } else {
                 switch (NVME_CC_CSS(data)) {
                 case CSS_NVM_ONLY:
-                    trace_pci_nvme_css_nvm_cset_selected_by_host(data &
-                                                                 0xffffffff);
+                    if (n->params.zoned) {
+                        NVME_GUEST_ERR(pci_nvme_err_only_zoned_cmd_set_avail,
+                               "only NVM+ZONED command set can be selected");
                     break;
+                }
+                trace_pci_nvme_css_nvm_cset_selected_by_host(data &
+                                                             0xffffffff);
+                break;
                 case CSS_CSI:
                     NVME_SET_CC_CSS(n->bar.cc, CSS_CSI);
-                    trace_pci_nvme_css_all_csets_sel_by_host(data & 0xffffffff);
+                    trace_pci_nvme_css_all_csets_sel_by_host(data &
+                                                             0xffffffff);
                     break;
                 case CSS_ADMIN_ONLY:
                     break;
@@ -2355,6 +3188,129 @@ static const MemoryRegionOps nvme_cmb_ops = {
     },
 };
 
+static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
+                               uint64_t capacity)
+{
+    NvmeZone *zone;
+    uint64_t start = 0, zone_size = n->zone_size;
+    int i;
+
+    ns->zone_array = g_malloc0(n->zone_array_size);
+    ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
+    ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
+    ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
+    ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
+    zone = ns->zone_array;
+
+    nvme_init_zone_list(ns->exp_open_zones);
+    nvme_init_zone_list(ns->imp_open_zones);
+    nvme_init_zone_list(ns->closed_zones);
+    nvme_init_zone_list(ns->full_zones);
+
+    for (i = 0; i < n->num_zones; i++, zone++) {
+        if (start + zone_size > capacity) {
+            zone_size = capacity - start;
+        }
+        zone->d.zt = NVME_ZONE_TYPE_SEQ_WRITE;
+        nvme_set_zone_state(zone, NVME_ZONE_STATE_EMPTY);
+        zone->d.za = 0;
+        zone->d.zcap = n->zone_capacity;
+        zone->d.zslba = start;
+        zone->d.wp = start;
+        zone->w_ptr = start;
+        zone->prev = 0;
+        zone->next = 0;
+        start += zone_size;
+    }
+
+    return 0;
+}
+
+static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
+{
+    uint64_t zone_size, zone_cap;
+    uint32_t nz;
+
+    if (n->params.zone_size_mb) {
+        zone_size = n->params.zone_size_mb;
+    } else {
+        zone_size = NVME_DEFAULT_ZONE_SIZE;
+    }
+    if (n->params.zone_capacity_mb) {
+        zone_cap = n->params.zone_capacity_mb;
+    } else {
+        zone_cap = zone_size;
+    }
+    n->zone_size = zone_size * MiB / n->conf.logical_block_size;
+    n->zone_capacity = zone_cap * MiB / n->conf.logical_block_size;
+    if (n->zone_capacity > n->zone_size) {
+        error_setg(errp, "zone capacity exceeds zone size");
+        return;
+    }
+
+    nz = DIV_ROUND_UP(n->ns_size / n->conf.logical_block_size, n->zone_size);
+    n->num_zones = nz;
+    n->zone_array_size = sizeof(NvmeZone) * nz;
+    n->zone_size_log2 = 0;
+    if (is_power_of_2(n->zone_size)) {
+        n->zone_size_log2 = 63 - clz64(n->zone_size);
+    }
+
+    if (!n->params.zasl_kb) {
+        n->zasl_bs = n->params.mdts ? 0 : NVME_DEFAULT_MAX_ZA_SIZE * KiB;
+    } else {
+        n->zasl_bs = n->params.zasl_kb * KiB;
+    }
+
+    return;
+}
+
+static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
+                              Error **errp)
+{
+    int ret;
+
+    ret = nvme_init_zone_meta(n, ns, n->num_zones * n->zone_size);
+    if (ret) {
+        error_setg(errp, "could not init zone metadata");
+        return -1;
+    }
+
+    ns->id_ns_zoned = g_malloc0(sizeof(*ns->id_ns_zoned));
+
+    /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
+    ns->id_ns_zoned->mar = 0xffffffff;
+    ns->id_ns_zoned->mor = 0xffffffff;
+    ns->id_ns_zoned->zoc = 0;
+    ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
+
+    ns->id_ns_zoned->lbafe[lba_index].zsze = cpu_to_le64(n->zone_size);
+    ns->id_ns_zoned->lbafe[lba_index].zdes = 0;
+
+    if (n->params.fill_pattern == 0) {
+        ns->id_ns.dlfeat = 0x01;
+    } else if (n->params.fill_pattern == 0xff) {
+        ns->id_ns.dlfeat = 0x02;
+    }
+
+    return 0;
+}
+
+static void nvme_zoned_clear(NvmeCtrl *n)
+{
+    int i;
+
+    for (i = 0; i < n->num_namespaces; i++) {
+        NvmeNamespace *ns = &n->namespaces[i];
+        g_free(ns->id_ns_zoned);
+        g_free(ns->zone_array);
+        g_free(ns->exp_open_zones);
+        g_free(ns->imp_open_zones);
+        g_free(ns->closed_zones);
+        g_free(ns->full_zones);
+    }
+}
+
 static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
 {
     NvmeParams *params = &n->params;
@@ -2423,18 +3379,13 @@ static void nvme_init_state(NvmeCtrl *n)
 
 static void nvme_init_blk(NvmeCtrl *n, Error **errp)
 {
+    int64_t bs_size;
+
     if (!blkconf_blocksizes(&n->conf, errp)) {
         return;
     }
     blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
                                   false, errp);
-}
-
-static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
-{
-    int64_t bs_size;
-    NvmeIdNs *id_ns = &ns->id_ns;
-    int lba_index;
 
     bs_size = blk_getlength(n->conf.blk);
     if (bs_size < 0) {
@@ -2443,6 +3394,12 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
     }
 
     n->ns_size = bs_size;
+}
+
+static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
+{
+    NvmeIdNs *id_ns = &ns->id_ns;
+    int lba_index;
 
     ns->csi = NVME_CSI_NVM;
     qemu_uuid_generate(&ns->uuid); /* TODO make UUIDs persistent */
@@ -2450,8 +3407,18 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
     id_ns->lbaf[lba_index].ds = 31 - clz32(n->conf.logical_block_size);
     id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(n, ns));
 
+    if (n->params.zoned) {
+        ns->csi = NVME_CSI_ZONED;
+        id_ns->ncap = cpu_to_le64(n->zone_capacity * n->num_zones);
+        if (nvme_zoned_init_ns(n, ns, lba_index, errp) != 0) {
+            return;
+        }
+    } else {
+        ns->csi = NVME_CSI_NVM;
+        id_ns->ncap = id_ns->nsze;
+    }
+
     /* no thin provisioning */
-    id_ns->ncap = id_ns->nsze;
     id_ns->nuse = id_ns->ncap;
 }
 
@@ -2611,8 +3578,9 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     NVME_CAP_SET_CQR(n->bar.cap, 1);
     NVME_CAP_SET_TO(n->bar.cap, 0xf);
     /*
-     * The device now always supports NS Types, but all commands
-     * that support CSI field will only handle NVM Command Set.
+     * The device now always supports NS Types, even when "zoned" property
+     * is set to zero. If this is the case, all commands that support CSI
+     * field only handle NVM Command Set.
      */
     NVME_CAP_SET_CSS(n->bar.cap, (CAP_CSS_NVM | CAP_CSS_CSI_SUPP));
     NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
@@ -2648,6 +3616,13 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
         return;
     }
 
+    if (n->params.zoned) {
+        nvme_zoned_init_ctrl(n, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+    }
     nvme_init_ctrl(n, pci_dev);
 
     ns = n->namespaces;
@@ -2666,6 +3641,9 @@ static void nvme_exit(PCIDevice *pci_dev)
     NvmeCtrl *n = NVME(pci_dev);
 
     nvme_clear_ctrl(n);
+    if (n->params.zoned) {
+        nvme_zoned_clear(n);
+    }
     g_free(n->namespaces);
     g_free(n->cq);
     g_free(n->sq);
@@ -2693,6 +3671,13 @@ static Property nvme_props[] = {
     DEFINE_PROP_UINT8("aerl", NvmeCtrl, params.aerl, 3),
     DEFINE_PROP_UINT32("aer_max_queued", NvmeCtrl, params.aer_max_queued, 64),
     DEFINE_PROP_UINT8("mdts", NvmeCtrl, params.mdts, 7),
+    DEFINE_PROP_BOOL("zoned", NvmeCtrl, params.zoned, false),
+    DEFINE_PROP_UINT64("zone_size", NvmeCtrl, params.zone_size_mb,
+                       NVME_DEFAULT_ZONE_SIZE),
+    DEFINE_PROP_UINT64("zone_capacity", NvmeCtrl, params.zone_capacity_mb, 0),
+    DEFINE_PROP_UINT32("zone_append_size_limit", NvmeCtrl, params.zasl_kb, 0),
+    DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
+    DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/block/nvme.h b/include/block/nvme.h
index ced66cf4eb..f630dfd163 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -651,8 +651,10 @@ typedef struct QEMU_PACKED NvmeAerResult {
 } NvmeAerResult;
 
 typedef struct QEMU_PACKED NvmeCqe {
-    uint32_t    result;
-    uint32_t    rsvd;
+    union {
+        uint64_t     result64;
+        uint32_t     result32;
+    };
     uint16_t    sq_head;
     uint16_t    sq_id;
     uint16_t    cid;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 10/14] hw/block/nvme: Introduce max active and open zone limits
  2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (8 preceding siblings ...)
  2020-09-23 18:20 ` [PATCH v4 09/14] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
@ 2020-09-23 18:20 ` Dmitry Fomichev
  2020-09-23 18:20 ` [PATCH v4 11/14] hw/block/nvme: Support Zone Descriptor Extensions Dmitry Fomichev
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-23 18:20 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

Added two module properties, "max_active" and "max_open" to control
the maximum number of zones that can be active or open. Once these
variables are set to non-default values, these limits are checked
during I/O and Too Many Active or Too Many Open command status is
returned if they are exceeded.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 179 +++++++++++++++++++++++++++++++++++++++++++++++-
 hw/block/nvme.h |   4 ++
 2 files changed, 181 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 287984dd37..3f397a3ea7 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -176,6 +176,87 @@ static void nvme_remove_zone(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
     zone->prev = zone->next = 0;
 }
 
+/*
+ * Take the first zone out from a list, return NULL if the list is empty.
+ */
+static NvmeZone *nvme_remove_zone_head(NvmeCtrl *n, NvmeNamespace *ns,
+                                       NvmeZoneList *zl)
+{
+    NvmeZone *zone = nvme_peek_zone_head(ns, zl);
+
+    if (zone) {
+        --zl->size;
+        if (zl->size == 0) {
+            zl->head = NVME_ZONE_LIST_NIL;
+            zl->tail = NVME_ZONE_LIST_NIL;
+        } else {
+            zl->head = zone->next;
+            ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
+        }
+        zone->prev = zone->next = 0;
+    }
+
+    return zone;
+}
+
+/*
+ * Check if we can open a zone without exceeding open/active limits.
+ * AOR stands for "Active and Open Resources" (see TP 4053 section 2.5).
+ */
+static int nvme_aor_check(NvmeCtrl *n, NvmeNamespace *ns,
+                          uint32_t act, uint32_t opn)
+{
+    if (n->params.max_active_zones != 0 &&
+        ns->nr_active_zones + act > n->params.max_active_zones) {
+        trace_pci_nvme_err_insuff_active_res(n->params.max_active_zones);
+        return NVME_ZONE_TOO_MANY_ACTIVE | NVME_DNR;
+    }
+    if (n->params.max_open_zones != 0 &&
+        ns->nr_open_zones + opn > n->params.max_open_zones) {
+        trace_pci_nvme_err_insuff_open_res(n->params.max_open_zones);
+        return NVME_ZONE_TOO_MANY_OPEN | NVME_DNR;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static inline void nvme_aor_inc_open(NvmeCtrl *n, NvmeNamespace *ns)
+{
+    assert(ns->nr_open_zones >= 0);
+    if (n->params.max_open_zones) {
+        ns->nr_open_zones++;
+        assert(ns->nr_open_zones <= n->params.max_open_zones);
+    }
+}
+
+static inline void nvme_aor_dec_open(NvmeCtrl *n, NvmeNamespace *ns)
+{
+    if (n->params.max_open_zones) {
+        assert(ns->nr_open_zones > 0);
+        ns->nr_open_zones--;
+    }
+    assert(ns->nr_open_zones >= 0);
+}
+
+static inline void nvme_aor_inc_active(NvmeCtrl *n, NvmeNamespace *ns)
+{
+    assert(ns->nr_active_zones >= 0);
+    if (n->params.max_active_zones) {
+        ns->nr_active_zones++;
+        assert(ns->nr_active_zones <= n->params.max_active_zones);
+    }
+}
+
+static inline void nvme_aor_dec_active(NvmeCtrl *n, NvmeNamespace *ns)
+{
+    if (n->params.max_active_zones) {
+        assert(ns->nr_active_zones > 0);
+        ns->nr_active_zones--;
+        assert(ns->nr_active_zones >= ns->nr_open_zones);
+    }
+    assert(ns->nr_active_zones >= 0);
+}
+
 static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
                                    NvmeZone *zone, uint8_t state)
 {
@@ -790,6 +871,41 @@ static uint16_t nvme_check_zone_read(NvmeCtrl *n, NvmeZone *zone, uint64_t slba,
     return status;
 }
 
+static void nvme_auto_transition_zone(NvmeCtrl *n, NvmeNamespace *ns,
+                                      bool implicit, bool adding_active)
+{
+    NvmeZone *zone;
+
+    if (implicit && n->params.max_open_zones &&
+        ns->nr_open_zones == n->params.max_open_zones) {
+        zone = nvme_remove_zone_head(n, ns, ns->imp_open_zones);
+        if (zone) {
+            /*
+             * Automatically close this implicitly open zone.
+             */
+            nvme_aor_dec_open(n, ns);
+            nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
+        }
+    }
+}
+
+static uint16_t nvme_auto_open_zone(NvmeCtrl *n, NvmeNamespace *ns,
+                                    NvmeZone *zone)
+{
+    uint16_t status = NVME_SUCCESS;
+    uint8_t zs = nvme_get_zone_state(zone);
+
+    if (zs == NVME_ZONE_STATE_EMPTY) {
+        nvme_auto_transition_zone(n, ns, true, true);
+        status = nvme_aor_check(n, ns, 1, 1);
+    } else if (zs == NVME_ZONE_STATE_CLOSED) {
+        nvme_auto_transition_zone(n, ns, true, false);
+        status = nvme_aor_check(n, ns, 0, 1);
+    }
+
+    return status;
+}
+
 static inline uint32_t nvme_zone_idx(NvmeCtrl *n, uint64_t slba)
 {
     return n->zone_size_log2 > 0 ? slba >> n->zone_size_log2 :
@@ -837,7 +953,11 @@ static bool nvme_finalize_zoned_write(NvmeCtrl *n, NvmeRequest *req,
         switch (zs) {
         case NVME_ZONE_STATE_IMPLICITLY_OPEN:
         case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+            nvme_aor_dec_open(n, ns);
+            /* fall through */
         case NVME_ZONE_STATE_CLOSED:
+            nvme_aor_dec_active(n, ns);
+            /* fall through */
         case NVME_ZONE_STATE_EMPTY:
             nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_FULL);
             /* fall through */
@@ -866,7 +986,10 @@ static uint64_t nvme_advance_zone_wp(NvmeCtrl *n, NvmeNamespace *ns,
         zs = nvme_get_zone_state(zone);
         switch (zs) {
         case NVME_ZONE_STATE_EMPTY:
+            nvme_aor_inc_active(n, ns);
+            /* fall through */
         case NVME_ZONE_STATE_CLOSED:
+            nvme_aor_inc_open(n, ns);
             nvme_assign_zone_state(n, ns, zone,
                                    NVME_ZONE_STATE_IMPLICITLY_OPEN);
         }
@@ -958,6 +1081,11 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
                                                zone->w_ptr);
             return NVME_ZONE_INVALID_WRITE | NVME_DNR;
         }
+
+        status = nvme_auto_open_zone(n, ns, zone);
+        if (status != NVME_SUCCESS) {
+            return status;
+        }
     }
 
     block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
@@ -1033,6 +1161,12 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req, bool append)
                                                    zone->w_ptr);
                 return NVME_ZONE_INVALID_WRITE | NVME_DNR;
             }
+
+            status = nvme_auto_open_zone(n, ns, zone);
+            if (status != NVME_SUCCESS) {
+                return status;
+            }
+
             req->fill_ofs = -1LL;
         } else {
             status = nvme_check_zone_read(n, zone, slba, nlb,
@@ -1129,9 +1263,27 @@ static uint16_t nvme_get_mgmt_zone_slba_idx(NvmeCtrl *n, NvmeNamespace *ns,
 static uint16_t nvme_open_zone(NvmeCtrl *n, NvmeNamespace *ns,
                                NvmeZone *zone, uint8_t state)
 {
+    uint16_t status;
+
     switch (state) {
     case NVME_ZONE_STATE_EMPTY:
+        nvme_auto_transition_zone(n, ns, false, true);
+        status = nvme_aor_check(n, ns, 1, 0);
+        if (status != NVME_SUCCESS) {
+            return status;
+        }
+        nvme_aor_inc_active(n, ns);
+        /* fall through */
     case NVME_ZONE_STATE_CLOSED:
+        status = nvme_aor_check(n, ns, 0, 1);
+        if (status != NVME_SUCCESS) {
+            if (state == NVME_ZONE_STATE_EMPTY) {
+                nvme_aor_dec_active(n, ns);
+            }
+            return status;
+        }
+        nvme_aor_inc_open(n, ns);
+        /* fall through */
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
         nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_EXPLICITLY_OPEN);
         /* fall through */
@@ -1153,6 +1305,7 @@ static uint16_t nvme_close_zone(NvmeCtrl *n,  NvmeNamespace *ns,
     switch (state) {
     case NVME_ZONE_STATE_EXPLICITLY_OPEN:
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_aor_dec_open(n, ns);
         nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
         /* fall through */
     case NVME_ZONE_STATE_CLOSED:
@@ -1174,7 +1327,11 @@ static uint16_t nvme_finish_zone(NvmeCtrl *n, NvmeNamespace *ns,
     switch (state) {
     case NVME_ZONE_STATE_EXPLICITLY_OPEN:
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_aor_dec_open(n, ns);
+        /* fall through */
     case NVME_ZONE_STATE_CLOSED:
+        nvme_aor_dec_active(n, ns);
+        /* fall through */
     case NVME_ZONE_STATE_EMPTY:
         zone->w_ptr = nvme_zone_wr_boundary(zone);
         nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_FULL);
@@ -1199,7 +1356,11 @@ static uint16_t nvme_reset_zone(NvmeCtrl *n, NvmeNamespace *ns,
     switch (state) {
     case NVME_ZONE_STATE_EXPLICITLY_OPEN:
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_aor_dec_open(n, ns);
+        /* fall through */
     case NVME_ZONE_STATE_CLOSED:
+        nvme_aor_dec_active(n, ns);
+        /* fall through */
     case NVME_ZONE_STATE_FULL:
         zone->w_ptr = zone->d.zslba;
         zone->d.wp = zone->w_ptr;
@@ -3262,6 +3423,18 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
         n->zasl_bs = n->params.zasl_kb * KiB;
     }
 
+    /* Make sure that the values of all Zoned Command Set properties are sane */
+    if (n->params.max_open_zones > nz) {
+        warn_report("max_open_zones value %u exceeds the number of zones %u,"
+                    " adjusting", n->params.max_open_zones, nz);
+        n->params.max_open_zones = nz;
+    }
+    if (n->params.max_active_zones > nz) {
+        warn_report("max_active_zones value %u exceeds the number of zones %u,"
+                    " adjusting", n->params.max_active_zones, nz);
+        n->params.max_active_zones = nz;
+    }
+
     return;
 }
 
@@ -3279,8 +3452,8 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
     ns->id_ns_zoned = g_malloc0(sizeof(*ns->id_ns_zoned));
 
     /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
-    ns->id_ns_zoned->mar = 0xffffffff;
-    ns->id_ns_zoned->mor = 0xffffffff;
+    ns->id_ns_zoned->mar = cpu_to_le32(n->params.max_active_zones - 1);
+    ns->id_ns_zoned->mor = cpu_to_le32(n->params.max_open_zones - 1);
     ns->id_ns_zoned->zoc = 0;
     ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
 
@@ -3676,6 +3849,8 @@ static Property nvme_props[] = {
                        NVME_DEFAULT_ZONE_SIZE),
     DEFINE_PROP_UINT64("zone_capacity", NvmeCtrl, params.zone_capacity_mb, 0),
     DEFINE_PROP_UINT32("zone_append_size_limit", NvmeCtrl, params.zasl_kb, 0),
+    DEFINE_PROP_UINT32("max_active", NvmeCtrl, params.max_active_zones, 0),
+    DEFINE_PROP_UINT32("max_open", NvmeCtrl, params.max_open_zones, 0),
     DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
     DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
     DEFINE_PROP_END_OF_LIST(),
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index d9f307f0ed..6efd566cb2 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -22,6 +22,8 @@ typedef struct NvmeParams {
     uint32_t    zasl_kb;
     uint64_t    zone_size_mb;
     uint64_t    zone_capacity_mb;
+    uint32_t    max_active_zones;
+    uint32_t    max_open_zones;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
@@ -103,6 +105,8 @@ typedef struct NvmeNamespace {
     NvmeZoneList    *imp_open_zones;
     NvmeZoneList    *closed_zones;
     NvmeZoneList    *full_zones;
+    int32_t         nr_open_zones;
+    int32_t         nr_active_zones;
 } NvmeNamespace;
 
 static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 11/14] hw/block/nvme: Support Zone Descriptor Extensions
  2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (9 preceding siblings ...)
  2020-09-23 18:20 ` [PATCH v4 10/14] hw/block/nvme: Introduce max active and open zone limits Dmitry Fomichev
@ 2020-09-23 18:20 ` Dmitry Fomichev
  2020-09-23 18:20 ` [PATCH v4 12/14] hw/block/nvme: Add injection of Offline/Read-Only zones Dmitry Fomichev
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-23 18:20 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

Zone Descriptor Extension is a label that can be assigned to a zone.
It can be set to an Empty zone and it stays assigned until the zone
is reset.

This commit adds a new optional module property, "zone_descr_ext_size".
Its value must be a multiple of 64 bytes. If this value is non-zero,
it becomes possible to assign extensions of that size to any Empty
zones. The default value for this property is 0, therefore setting
extensions is disabled by default.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++---
 hw/block/nvme.h |  8 ++++++
 2 files changed, 77 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3f397a3ea7..cfa791aa72 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1400,6 +1400,26 @@ static bool nvme_cond_offline_all(uint8_t state)
     return state == NVME_ZONE_STATE_READ_ONLY;
 }
 
+static uint16_t nvme_set_zd_ext(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone, uint8_t state)
+{
+    uint16_t status;
+
+    if (state == NVME_ZONE_STATE_EMPTY) {
+        nvme_auto_transition_zone(n, ns, false, true);
+        status = nvme_aor_check(n, ns, 1, 0);
+        if (status != NVME_SUCCESS) {
+            return status;
+        }
+        nvme_aor_inc_active(n, ns);
+        zone->d.za |= NVME_ZA_ZD_EXT_VALID;
+        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
 typedef uint16_t (*op_handler_t)(NvmeCtrl *, NvmeNamespace *, NvmeZone *,
                                  uint8_t);
 typedef bool (*need_to_proc_zone_t)(uint8_t);
@@ -1434,12 +1454,14 @@ static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeRequest *req)
     NvmeCmd *cmd = (NvmeCmd *)&req->cmd;
     NvmeNamespace *ns = req->ns;
     uint32_t dw13 = le32_to_cpu(cmd->cdw13);
+    uint64_t prp1, prp2;
     uint64_t slba = 0;
     uint32_t zone_idx = 0;
     uint16_t status;
     uint8_t action, state;
     bool all;
     NvmeZone *zone;
+    uint8_t *zd_ext;
 
     action = dw13 & 0xff;
     all = dw13 & 0x100;
@@ -1494,7 +1516,24 @@ static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeRequest *req)
 
     case NVME_ZONE_ACTION_SET_ZD_EXT:
         trace_pci_nvme_set_descriptor_extension(slba, zone_idx);
-        return NVME_INVALID_FIELD | NVME_DNR;
+        if (all || !n->params.zd_extension_size) {
+            return NVME_INVALID_FIELD | NVME_DNR;
+        }
+        zd_ext = nvme_get_zd_extension(n, ns, zone_idx);
+        prp1 = le64_to_cpu(cmd->dptr.prp1);
+        prp2 = le64_to_cpu(cmd->dptr.prp2);
+        status = nvme_dma_prp(n, zd_ext, n->params.zd_extension_size,
+                              prp1, prp2, DMA_DIRECTION_TO_DEVICE, req);
+        if (status) {
+            trace_pci_nvme_err_zd_extension_map_error(zone_idx);
+            return status;
+        }
+
+        status = nvme_set_zd_ext(n, ns, zone, state);
+        if (status == NVME_SUCCESS) {
+            trace_pci_nvme_zd_extension_set(zone_idx);
+            return status;
+        }
         break;
 
     default:
@@ -1574,7 +1613,7 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
-    if (zra == NVME_ZONE_REPORT_EXTENDED) {
+    if (zra == NVME_ZONE_REPORT_EXTENDED && !n->params.zd_extension_size) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
@@ -1586,6 +1625,9 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
     partial = (dw13 >> 16) & 0x01;
 
     zone_entry_sz = sizeof(NvmeZoneDescr);
+    if (zra == NVME_ZONE_REPORT_EXTENDED) {
+        zone_entry_sz += n->params.zd_extension_size;
+    }
 
     max_zones = (len - sizeof(NvmeZoneReportHeader)) / zone_entry_sz;
     buf = g_malloc0(len);
@@ -1617,6 +1659,14 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
             z->wp = cpu_to_le64(~0ULL);
         }
 
+        if (zra == NVME_ZONE_REPORT_EXTENDED) {
+            if (zs->d.za & NVME_ZA_ZD_EXT_VALID) {
+                memcpy(buf_p, nvme_get_zd_extension(n, ns, zone_idx),
+                       n->params.zd_extension_size);
+            }
+            buf_p += n->params.zd_extension_size;
+        }
+
         zone_idx++;
     }
 
@@ -2727,7 +2777,6 @@ static uint16_t nvme_aer(NvmeCtrl *n, NvmeRequest *req)
 
     n->aer_reqs[n->outstanding_aers] = req;
     n->outstanding_aers++;
-
     if (!QTAILQ_EMPTY(&n->aer_queue)) {
         nvme_process_aers(n);
     }
@@ -3361,6 +3410,7 @@ static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
     ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
     ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
     ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
+    ns->zd_extensions = g_malloc0(n->params.zd_extension_size * n->num_zones);
     zone = ns->zone_array;
 
     nvme_init_zone_list(ns->exp_open_zones);
@@ -3434,6 +3484,17 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
                     " adjusting", n->params.max_active_zones, nz);
         n->params.max_active_zones = nz;
     }
+    if (n->params.zd_extension_size) {
+        if (n->params.zd_extension_size & 0x3f) {
+            error_setg(errp,
+                "zone descriptor extension size must be a multiple of 64B");
+            return;
+        }
+        if ((n->params.zd_extension_size >> 6) > 0xff) {
+            error_setg(errp, "zone descriptor extension size is too large");
+            return;
+        }
+    }
 
     return;
 }
@@ -3458,7 +3519,8 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
     ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
 
     ns->id_ns_zoned->lbafe[lba_index].zsze = cpu_to_le64(n->zone_size);
-    ns->id_ns_zoned->lbafe[lba_index].zdes = 0;
+    ns->id_ns_zoned->lbafe[lba_index].zdes =
+        n->params.zd_extension_size >> 6; /* Units of 64B */
 
     if (n->params.fill_pattern == 0) {
         ns->id_ns.dlfeat = 0x01;
@@ -3481,6 +3543,7 @@ static void nvme_zoned_clear(NvmeCtrl *n)
         g_free(ns->imp_open_zones);
         g_free(ns->closed_zones);
         g_free(ns->full_zones);
+        g_free(ns->zd_extensions);
     }
 }
 
@@ -3849,6 +3912,8 @@ static Property nvme_props[] = {
                        NVME_DEFAULT_ZONE_SIZE),
     DEFINE_PROP_UINT64("zone_capacity", NvmeCtrl, params.zone_capacity_mb, 0),
     DEFINE_PROP_UINT32("zone_append_size_limit", NvmeCtrl, params.zasl_kb, 0),
+    DEFINE_PROP_UINT32("zone_descr_ext_size", NvmeCtrl,
+                       params.zd_extension_size, 0),
     DEFINE_PROP_UINT32("max_active", NvmeCtrl, params.max_active_zones, 0),
     DEFINE_PROP_UINT32("max_open", NvmeCtrl, params.max_open_zones, 0),
     DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 6efd566cb2..0e82fca815 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -24,6 +24,7 @@ typedef struct NvmeParams {
     uint64_t    zone_capacity_mb;
     uint32_t    max_active_zones;
     uint32_t    max_open_zones;
+    uint32_t    zd_extension_size;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
@@ -105,6 +106,7 @@ typedef struct NvmeNamespace {
     NvmeZoneList    *imp_open_zones;
     NvmeZoneList    *closed_zones;
     NvmeZoneList    *full_zones;
+    uint8_t         *zd_extensions;
     int32_t         nr_open_zones;
     int32_t         nr_active_zones;
 } NvmeNamespace;
@@ -218,6 +220,12 @@ static inline bool nvme_wp_is_valid(NvmeZone *zone)
            st != NVME_ZONE_STATE_OFFLINE;
 }
 
+static inline uint8_t *nvme_get_zd_extension(NvmeCtrl *n, NvmeNamespace *ns,
+                                             uint32_t zone_idx)
+{
+    return &ns->zd_extensions[zone_idx * n->params.zd_extension_size];
+}
+
 /*
  * Initialize a zone list head.
  */
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 12/14] hw/block/nvme: Add injection of Offline/Read-Only zones
  2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (10 preceding siblings ...)
  2020-09-23 18:20 ` [PATCH v4 11/14] hw/block/nvme: Support Zone Descriptor Extensions Dmitry Fomichev
@ 2020-09-23 18:20 ` Dmitry Fomichev
  2020-09-23 18:20 ` [PATCH v4 13/14] hw/block/nvme: Use zone metadata file for persistence Dmitry Fomichev
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-23 18:20 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

ZNS specification defines two zone conditions for the zones that no
longer can function properly, possibly because of flash wear or other
internal fault. It is useful to be able to "inject" a small number of
such zones for testing purposes.

This commit defines two optional device properties, "offline_zones"
and "rdonly_zones". Users can assign non-zero values to these variables
to specify the number of zones to be initialized as Offline or
Read-Only. The actual number of injected zones may be smaller than the
requested amount - Read-Only and Offline counts are expected to be much
smaller than the total number of zones on a drive.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 hw/block/nvme.h |  2 ++
 2 files changed, 48 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index cfa791aa72..4630be38d7 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -3402,8 +3402,11 @@ static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
                                uint64_t capacity)
 {
     NvmeZone *zone;
+    Error *err;
     uint64_t start = 0, zone_size = n->zone_size;
+    uint32_t rnd;
     int i;
+    uint16_t zs;
 
     ns->zone_array = g_malloc0(n->zone_array_size);
     ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
@@ -3434,6 +3437,37 @@ static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
         start += zone_size;
     }
 
+    /* If required, make some zones Offline or Read Only */
+
+    for (i = 0; i < n->params.nr_offline_zones; i++) {
+        do {
+            qcrypto_random_bytes(&rnd, sizeof(rnd), &err);
+            rnd %= n->num_zones;
+        } while (rnd < n->params.max_open_zones);
+        zone = &ns->zone_array[rnd];
+        zs = nvme_get_zone_state(zone);
+        if (zs != NVME_ZONE_STATE_OFFLINE) {
+            nvme_set_zone_state(zone, NVME_ZONE_STATE_OFFLINE);
+        } else {
+            i--;
+        }
+    }
+
+    for (i = 0; i < n->params.nr_rdonly_zones; i++) {
+        do {
+            qcrypto_random_bytes(&rnd, sizeof(rnd), &err);
+            rnd %= n->num_zones;
+        } while (rnd < n->params.max_open_zones);
+        zone = &ns->zone_array[rnd];
+        zs = nvme_get_zone_state(zone);
+        if (zs != NVME_ZONE_STATE_OFFLINE &&
+            zs != NVME_ZONE_STATE_READ_ONLY) {
+            nvme_set_zone_state(zone, NVME_ZONE_STATE_READ_ONLY);
+        } else {
+            i--;
+        }
+    }
+
     return 0;
 }
 
@@ -3484,6 +3518,16 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
                     " adjusting", n->params.max_active_zones, nz);
         n->params.max_active_zones = nz;
     }
+    if (n->params.max_open_zones < nz) {
+        if (n->params.nr_offline_zones > nz - n->params.max_open_zones) {
+            n->params.nr_offline_zones = nz - n->params.max_open_zones;
+        }
+        if (n->params.nr_rdonly_zones >
+            nz - n->params.max_open_zones - n->params.nr_offline_zones) {
+            n->params.nr_rdonly_zones =
+                nz - n->params.max_open_zones - n->params.nr_offline_zones;
+        }
+    }
     if (n->params.zd_extension_size) {
         if (n->params.zd_extension_size & 0x3f) {
             error_setg(errp,
@@ -3916,6 +3960,8 @@ static Property nvme_props[] = {
                        params.zd_extension_size, 0),
     DEFINE_PROP_UINT32("max_active", NvmeCtrl, params.max_active_zones, 0),
     DEFINE_PROP_UINT32("max_open", NvmeCtrl, params.max_open_zones, 0),
+    DEFINE_PROP_UINT32("offline_zones", NvmeCtrl, params.nr_offline_zones, 0),
+    DEFINE_PROP_UINT32("rdonly_zones", NvmeCtrl, params.nr_rdonly_zones, 0),
     DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
     DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
     DEFINE_PROP_END_OF_LIST(),
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 0e82fca815..dea0c12792 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -25,6 +25,8 @@ typedef struct NvmeParams {
     uint32_t    max_active_zones;
     uint32_t    max_open_zones;
     uint32_t    zd_extension_size;
+    uint32_t    nr_offline_zones;
+    uint32_t    nr_rdonly_zones;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 13/14] hw/block/nvme: Use zone metadata file for persistence
  2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (11 preceding siblings ...)
  2020-09-23 18:20 ` [PATCH v4 12/14] hw/block/nvme: Add injection of Offline/Read-Only zones Dmitry Fomichev
@ 2020-09-23 18:20 ` Dmitry Fomichev
  2020-09-23 18:20 ` [PATCH v4 14/14] hw/block/nvme: Document zoned parameters in usage text Dmitry Fomichev
  2020-09-24 21:07 ` [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Klaus Jensen
  14 siblings, 0 replies; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-23 18:20 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

A ZNS drive that is emulated by this module is currently initialized
with all zones Empty upon startup. However, actual ZNS SSDs save the
state and condition of all zones in their internal NVRAM in the event
of power loss. When such a drive is powered up again, it closes or
finishes all zones that were open at the moment of shutdown. Besides
that, the write pointer position as well as the state and condition
of all zones is preserved across power-downs.

This commit adds the capability to have a persistent zone metadata
to the device. The new optional module property, "zone_file",
is introduced. If added to the command line, this property specifies
the name of the file that stores the zone metadata. If "zone_file" is
omitted, the device will be initialized with all zones empty, the same
as before.

If zone metadata is configured to be persistent, then zone descriptor
extensions also persist across controller shutdowns.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c       | 377 +++++++++++++++++++++++++++++++++++++++---
 hw/block/nvme.h       |  38 +++++
 hw/block/trace-events |   1 +
 3 files changed, 395 insertions(+), 21 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 4630be38d7..5f55e86a9a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -111,6 +111,8 @@ static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
 };
 
 static void nvme_process_sq(void *opaque);
+static void nvme_sync_zone_file(NvmeCtrl *n, NvmeNamespace *ns,
+                                NvmeZone *zone, int len);
 
 static uint16_t nvme_cid(NvmeRequest *req)
 {
@@ -146,6 +148,7 @@ static void nvme_add_zone_tail(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
         zl->tail = idx;
     }
     zl->size++;
+    nvme_set_zone_meta_dirty(n, ns, true);
 }
 
 /*
@@ -162,12 +165,15 @@ static void nvme_remove_zone(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
     if (zl->size == 0) {
         zl->head = NVME_ZONE_LIST_NIL;
         zl->tail = NVME_ZONE_LIST_NIL;
+        nvme_set_zone_meta_dirty(n, ns, true);
     } else if (idx == zl->head) {
         zl->head = zone->next;
         ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
+        nvme_set_zone_meta_dirty(n, ns, true);
     } else if (idx == zl->tail) {
         zl->tail = zone->prev;
         ns->zone_array[zl->tail].next = NVME_ZONE_LIST_NIL;
+        nvme_set_zone_meta_dirty(n, ns, true);
     } else {
         ns->zone_array[zone->next].prev = zone->prev;
         ns->zone_array[zone->prev].next = zone->next;
@@ -194,6 +200,7 @@ static NvmeZone *nvme_remove_zone_head(NvmeCtrl *n, NvmeNamespace *ns,
             ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
         }
         zone->prev = zone->next = 0;
+        nvme_set_zone_meta_dirty(n, ns, true);
     }
 
     return zone;
@@ -290,11 +297,13 @@ static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
         break;
     case NVME_ZONE_STATE_FULL:
         nvme_add_zone_tail(n, ns, ns->full_zones, zone);
+        /* fall through */
     case NVME_ZONE_STATE_READ_ONLY:
         break;
     default:
         zone->d.za = 0;
     }
+    nvme_sync_zone_file(n, ns, zone, sizeof(NvmeZone));
 }
 
 static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
@@ -3398,9 +3407,114 @@ static const MemoryRegionOps nvme_cmb_ops = {
     },
 };
 
-static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
+static int nvme_validate_zone_file(NvmeCtrl *n, NvmeNamespace *ns,
+                                   uint64_t capacity)
+{
+    NvmeZoneMeta *meta = ns->zone_meta;
+    NvmeZone *zone = ns->zone_array;
+    uint64_t start = 0, zone_size = n->zone_size;
+    int i, n_imp_open = 0, n_exp_open = 0, n_closed = 0, n_full = 0;
+
+    if (meta->magic != NVME_ZONE_META_MAGIC) {
+        return 1;
+    }
+    if (meta->version != NVME_ZONE_META_VER) {
+        return 2;
+    }
+    if (meta->zone_size != zone_size) {
+        return 3;
+    }
+    if (meta->zone_capacity != n->zone_capacity) {
+        return 4;
+    }
+    if (meta->nr_offline_zones != n->params.nr_offline_zones) {
+        return 5;
+    }
+    if (meta->nr_rdonly_zones != n->params.nr_rdonly_zones) {
+        return 6;
+    }
+    if (meta->lba_size != n->conf.logical_block_size) {
+        return 7;
+    }
+    if (meta->zd_extension_size != n->params.zd_extension_size) {
+        return 8;
+    }
+
+    for (i = 0; i < n->num_zones; i++, zone++) {
+        if (start + zone_size > capacity) {
+            zone_size = capacity - start;
+        }
+        if (zone->d.zt != NVME_ZONE_TYPE_SEQ_WRITE) {
+            return 9;
+        }
+        if (zone->d.zcap != n->zone_capacity) {
+            return 10;
+        }
+        if (zone->d.zslba != start) {
+            return 11;
+        }
+        switch (nvme_get_zone_state(zone)) {
+        case NVME_ZONE_STATE_EMPTY:
+        case NVME_ZONE_STATE_OFFLINE:
+        case NVME_ZONE_STATE_READ_ONLY:
+            if (zone->d.wp != start) {
+                return 12;
+            }
+            break;
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+            if (zone->d.wp < start ||
+                zone->d.wp >= zone->d.zslba + zone->d.zcap) {
+                return 13;
+            }
+            n_imp_open++;
+            break;
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+            if (zone->d.wp < start ||
+                zone->d.wp >= zone->d.zslba + zone->d.zcap) {
+                return 13;
+            }
+            n_exp_open++;
+            break;
+        case NVME_ZONE_STATE_CLOSED:
+            if (zone->d.wp < start ||
+                zone->d.wp >= zone->d.zslba + zone->d.zcap) {
+                return 13;
+            }
+            n_closed++;
+            break;
+        case NVME_ZONE_STATE_FULL:
+            if (zone->d.wp != zone->d.zslba + zone->d.zcap) {
+                return 14;
+            }
+            n_full++;
+            break;
+        default:
+            return 15;
+        }
+
+        start += zone_size;
+    }
+
+    if (n_exp_open != nvme_zone_list_size(ns->exp_open_zones)) {
+        return 16;
+    }
+    if (n_imp_open != nvme_zone_list_size(ns->imp_open_zones)) {
+        return 17;
+    }
+    if (n_closed != nvme_zone_list_size(ns->closed_zones)) {
+        return 18;
+    }
+    if (n_full != nvme_zone_list_size(ns->full_zones)) {
+        return 19;
+    }
+
+    return 0;
+}
+
+static int nvme_init_zone_file(NvmeCtrl *n, NvmeNamespace *ns,
                                uint64_t capacity)
 {
+    NvmeZoneMeta *meta = ns->zone_meta;
     NvmeZone *zone;
     Error *err;
     uint64_t start = 0, zone_size = n->zone_size;
@@ -3408,18 +3522,33 @@ static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
     int i;
     uint16_t zs;
 
-    ns->zone_array = g_malloc0(n->zone_array_size);
-    ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
-    ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
-    ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
-    ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
-    ns->zd_extensions = g_malloc0(n->params.zd_extension_size * n->num_zones);
+    if (n->params.zone_file) {
+        meta->magic = NVME_ZONE_META_MAGIC;
+        meta->version = NVME_ZONE_META_VER;
+        meta->zone_size = zone_size;
+        meta->zone_capacity = n->zone_capacity;
+        meta->lba_size = n->conf.logical_block_size;
+        meta->nr_offline_zones = n->params.nr_offline_zones;
+        meta->nr_rdonly_zones = n->params.nr_rdonly_zones;
+        meta->zd_extension_size = n->params.zd_extension_size;
+    } else {
+        ns->zone_array = g_malloc0(n->zone_array_size);
+        ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
+        ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
+        ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
+        ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
+        ns->zd_extensions =
+            g_malloc0(n->params.zd_extension_size * n->num_zones);
+    }
     zone = ns->zone_array;
 
     nvme_init_zone_list(ns->exp_open_zones);
     nvme_init_zone_list(ns->imp_open_zones);
     nvme_init_zone_list(ns->closed_zones);
     nvme_init_zone_list(ns->full_zones);
+    if (n->params.zone_file) {
+        nvme_set_zone_meta_dirty(n, ns, true);
+    }
 
     for (i = 0; i < n->num_zones; i++, zone++) {
         if (start + zone_size > capacity) {
@@ -3471,7 +3600,196 @@ static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
     return 0;
 }
 
-static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
+static void nvme_open_zone_file(NvmeCtrl *n, bool *init_meta, Error **errp)
+{
+#ifdef CONFIG_POSIX
+
+    struct stat statbuf;
+    Error *local_err = NULL;
+    size_t fsize;
+    int ret;
+
+    ret = stat(n->params.zone_file, &statbuf);
+    if (ret && errno == ENOENT) {
+        *init_meta = true;
+    } else if (!S_ISREG(statbuf.st_mode)) {
+        error_setg(errp, "\"%s\" is not a regular file",
+                   n->params.zone_file);
+        return;
+    }
+
+    n->zone_file_fd = open(n->params.zone_file,
+                           O_RDWR | O_LARGEFILE | O_BINARY | O_CREAT, 0644);
+    if (n->zone_file_fd < 0) {
+        error_setg_errno(errp, errno, "failed to create zone file \"%s\"",
+                         n->params.zone_file);
+        return;
+    }
+
+    fsize = n->meta_size * n->num_namespaces;
+
+    if (stat(n->params.zone_file, &statbuf)) {
+        error_setg_errno(errp, errno, "can't stat zone file \"%s\"",
+                         n->params.zone_file);
+        return;
+    }
+    if (statbuf.st_size != fsize) {
+        ret = ftruncate(n->zone_file_fd, fsize);
+        if (ret < 0) {
+            error_setg_errno(errp, errno, "can't truncate zone file \"%s\"",
+                             n->params.zone_file);
+            return;
+        }
+        *init_meta = true;
+    }
+
+    memory_region_init_ram_from_fd(&n->zone_mem, OBJECT(n), "nvme-zone-meta",
+                                   fsize, true, n->zone_file_fd, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+    }
+
+    if (*init_meta) {
+        trace_pci_nvme_initialized_zone_file(n->params.zone_file);
+    }
+    return;
+
+#else
+
+    error_setg(errp, "zone metadata file not supported on this host");
+
+#endif /* CONFIG_POSIX */
+}
+
+static int nvme_map_zone_file(NvmeCtrl *n, NvmeNamespace *ns, bool *init_meta)
+{
+    ns->zone_meta = (void *)memory_region_get_ram_ptr(&n->zone_mem) +
+                    n->meta_size * (ns->nsid - 1);
+
+    ns->zone_array = (NvmeZone *)(ns->zone_meta + 1);
+    ns->exp_open_zones = &ns->zone_meta->exp_open_zones;
+    ns->imp_open_zones = &ns->zone_meta->imp_open_zones;
+    ns->closed_zones = &ns->zone_meta->closed_zones;
+    ns->full_zones = &ns->zone_meta->full_zones;
+
+    if (n->params.zd_extension_size) {
+        ns->zd_extensions = (uint8_t *)(ns->zone_meta + 1);
+        ns->zd_extensions += n->zone_array_size;
+    }
+
+    return 0;
+}
+
+static void nvme_sync_zone_file(NvmeCtrl *n, NvmeNamespace *ns,
+                                NvmeZone *zone, int len)
+{
+    uintptr_t z = (uintptr_t)zone, off = z - (uintptr_t)ns->zone_meta;
+
+    memory_region_msync(&n->zone_mem, off, len);
+
+    if (nvme_zone_meta_dirty(n, ns)) {
+        nvme_set_zone_meta_dirty(n, ns, false);
+        memory_region_msync(&n->zone_mem, 0, sizeof(NvmeZoneMeta));
+    }
+}
+
+/*
+ * Close or finish all the zones that might be still open after power-down.
+ */
+static void nvme_prepare_zones(NvmeCtrl *n, NvmeNamespace *ns)
+{
+    NvmeZone *zone;
+    uint32_t set_state;
+    int i;
+
+    assert(!ns->nr_active_zones);
+    assert(!ns->nr_open_zones);
+
+    zone = ns->zone_array;
+    for (i = 0; i < n->num_zones; i++, zone++) {
+        switch (nvme_get_zone_state(zone)) {
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+            nvme_remove_zone(n, ns, ns->imp_open_zones, zone);
+            break;
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+            nvme_remove_zone(n, ns, ns->exp_open_zones, zone);
+            break;
+        case NVME_ZONE_STATE_CLOSED:
+            nvme_aor_inc_active(n, ns);
+            /* fall through */
+        default:
+            continue;
+        }
+
+        if (zone->d.za & NVME_ZA_ZD_EXT_VALID) {
+            set_state = NVME_ZONE_STATE_CLOSED;
+        } else if (zone->d.wp == zone->d.zslba) {
+            set_state = NVME_ZONE_STATE_EMPTY;
+        } else if (n->params.max_active_zones == 0 ||
+                   ns->nr_active_zones < n->params.max_active_zones) {
+            set_state = NVME_ZONE_STATE_CLOSED;
+        } else {
+            set_state = NVME_ZONE_STATE_FULL;
+        }
+
+        switch (set_state) {
+        case NVME_ZONE_STATE_CLOSED:
+            trace_pci_nvme_power_on_close(nvme_get_zone_state(zone),
+                                          zone->d.zslba);
+            nvme_aor_inc_active(n, ns);
+            nvme_add_zone_tail(n, ns, ns->closed_zones, zone);
+            break;
+        case NVME_ZONE_STATE_EMPTY:
+            trace_pci_nvme_power_on_reset(nvme_get_zone_state(zone),
+                                          zone->d.zslba);
+            break;
+        case NVME_ZONE_STATE_FULL:
+            trace_pci_nvme_power_on_full(nvme_get_zone_state(zone),
+                                         zone->d.zslba);
+            zone->d.wp = nvme_zone_wr_boundary(zone);
+        }
+
+        zone->w_ptr = zone->d.wp;
+        nvme_set_zone_state(zone, set_state);
+    }
+}
+
+static int nvme_load_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
+                               uint64_t capacity, bool init_meta)
+{
+    int ret = 0;
+
+    if (n->params.zone_file) {
+        ret = nvme_map_zone_file(n, ns, &init_meta);
+        trace_pci_nvme_mapped_zone_file(n->params.zone_file, ret);
+        if (ret < 0) {
+            return ret;
+        }
+
+        if (!init_meta) {
+            ret = nvme_validate_zone_file(n, ns, capacity);
+            if (ret) {
+                trace_pci_nvme_err_zone_file_invalid(ret);
+                init_meta = true;
+            }
+        }
+    } else {
+        init_meta = true;
+    }
+
+    if (init_meta) {
+        ret = nvme_init_zone_file(n, ns, capacity);
+    } else {
+        nvme_prepare_zones(n, ns);
+    }
+    if (!ret && n->params.zone_file) {
+        nvme_sync_zone_file(n, ns, ns->zone_array, n->zone_array_size);
+    }
+
+    return ret;
+}
+
+static void nvme_zoned_init_ctrl(NvmeCtrl *n, bool *init_meta, Error **errp)
 {
     uint64_t zone_size, zone_cap;
     uint32_t nz;
@@ -3501,6 +3819,10 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
         n->zone_size_log2 = 63 - clz64(n->zone_size);
     }
 
+    n->meta_size = sizeof(NvmeZoneMeta) + n->zone_array_size +
+                          nz * n->params.zd_extension_size;
+    n->meta_size = ROUND_UP(n->meta_size, qemu_real_host_page_size);
+
     if (!n->params.zasl_kb) {
         n->zasl_bs = n->params.mdts ? 0 : NVME_DEFAULT_MAX_ZA_SIZE * KiB;
     } else {
@@ -3540,17 +3862,22 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
         }
     }
 
+    if (n->params.zone_file) {
+        nvme_open_zone_file(n, init_meta, errp);
+    }
+
     return;
 }
 
 static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
-                              Error **errp)
+                              bool init_meta, Error **errp)
 {
     int ret;
 
-    ret = nvme_init_zone_meta(n, ns, n->num_zones * n->zone_size);
+    ret = nvme_load_zone_meta(n, ns, n->num_zones * n->zone_size,
+                              init_meta);
     if (ret) {
-        error_setg(errp, "could not init zone metadata");
+        error_setg(errp, "could not load/init zone metadata");
         return -1;
     }
 
@@ -3579,15 +3906,20 @@ static void nvme_zoned_clear(NvmeCtrl *n)
 {
     int i;
 
+    if (n->params.zone_file)  {
+        close(n->zone_file_fd);
+    }
     for (i = 0; i < n->num_namespaces; i++) {
         NvmeNamespace *ns = &n->namespaces[i];
         g_free(ns->id_ns_zoned);
-        g_free(ns->zone_array);
-        g_free(ns->exp_open_zones);
-        g_free(ns->imp_open_zones);
-        g_free(ns->closed_zones);
-        g_free(ns->full_zones);
-        g_free(ns->zd_extensions);
+        if (!n->params.zone_file) {
+            g_free(ns->zone_array);
+            g_free(ns->exp_open_zones);
+            g_free(ns->imp_open_zones);
+            g_free(ns->closed_zones);
+            g_free(ns->full_zones);
+            g_free(ns->zd_extensions);
+        }
     }
 }
 
@@ -3676,7 +4008,8 @@ static void nvme_init_blk(NvmeCtrl *n, Error **errp)
     n->ns_size = bs_size;
 }
 
-static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
+static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, bool init_meta,
+                                Error **errp)
 {
     NvmeIdNs *id_ns = &ns->id_ns;
     int lba_index;
@@ -3690,7 +4023,7 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
     if (n->params.zoned) {
         ns->csi = NVME_CSI_ZONED;
         id_ns->ncap = cpu_to_le64(n->zone_capacity * n->num_zones);
-        if (nvme_zoned_init_ns(n, ns, lba_index, errp) != 0) {
+        if (nvme_zoned_init_ns(n, ns, lba_index, init_meta, errp) != 0) {
             return;
         }
     } else {
@@ -3874,6 +4207,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     NvmeCtrl *n = NVME(pci_dev);
     NvmeNamespace *ns;
     Error *local_err = NULL;
+    bool init_meta = false;
 
     int i;
 
@@ -3897,7 +4231,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     }
 
     if (n->params.zoned) {
-        nvme_zoned_init_ctrl(n, &local_err);
+        nvme_zoned_init_ctrl(n, &init_meta, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
             return;
@@ -3908,7 +4242,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     ns = n->namespaces;
     for (i = 0; i < n->num_namespaces; i++, ns++) {
         ns->nsid = i + 1;
-        nvme_init_namespace(n, ns, &local_err);
+        nvme_init_namespace(n, ns, init_meta, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
             return;
@@ -3956,6 +4290,7 @@ static Property nvme_props[] = {
                        NVME_DEFAULT_ZONE_SIZE),
     DEFINE_PROP_UINT64("zone_capacity", NvmeCtrl, params.zone_capacity_mb, 0),
     DEFINE_PROP_UINT32("zone_append_size_limit", NvmeCtrl, params.zasl_kb, 0),
+    DEFINE_PROP_STRING("zone_file", NvmeCtrl, params.zone_file),
     DEFINE_PROP_UINT32("zone_descr_ext_size", NvmeCtrl,
                        params.zd_extension_size, 0),
     DEFINE_PROP_UINT32("max_active", NvmeCtrl, params.max_active_zones, 0),
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index dea0c12792..021c049e1f 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -18,6 +18,7 @@ typedef struct NvmeParams {
 
     bool        zoned;
     bool        cross_zone_read;
+    char        *zone_file;
     uint8_t     fill_pattern;
     uint32_t    zasl_kb;
     uint64_t    zone_size_mb;
@@ -95,6 +96,27 @@ typedef struct NvmeZoneList {
     uint8_t         rsvd12[4];
 } NvmeZoneList;
 
+#define NVME_ZONE_META_MAGIC 0x3aebaa70
+#define NVME_ZONE_META_VER  1
+
+typedef struct NvmeZoneMeta {
+    uint32_t        magic;
+    uint32_t        version;
+    uint64_t        zone_size;
+    uint64_t        zone_capacity;
+    uint32_t        nr_offline_zones;
+    uint32_t        nr_rdonly_zones;
+    uint32_t        lba_size;
+    uint32_t        rsvd40;
+    NvmeZoneList    exp_open_zones;
+    NvmeZoneList    imp_open_zones;
+    NvmeZoneList    closed_zones;
+    NvmeZoneList    full_zones;
+    uint8_t         zd_extension_size;
+    uint8_t         dirty;
+    uint8_t         rsvd594[3990];
+} NvmeZoneMeta;
+
 typedef struct NvmeNamespace {
     NvmeIdNs        id_ns;
     uint32_t        nsid;
@@ -104,6 +126,7 @@ typedef struct NvmeNamespace {
 
     NvmeIdNsZoned   *id_ns_zoned;
     NvmeZone        *zone_array;
+    NvmeZoneMeta    *zone_meta;
     NvmeZoneList    *exp_open_zones;
     NvmeZoneList    *imp_open_zones;
     NvmeZoneList    *closed_zones;
@@ -169,8 +192,10 @@ typedef struct NvmeCtrl {
     QTAILQ_HEAD(, NvmeAsyncEvent) aer_queue;
     int         aer_queued;
 
+    MemoryRegion    zone_mem;
     int             zone_file_fd;
     uint32_t        num_zones;
+    size_t          meta_size;
     uint64_t        zone_size;
     uint64_t        zone_capacity;
     uint64_t        zone_array_size;
@@ -279,4 +304,17 @@ static inline NvmeZone *nvme_next_zone_in_list(NvmeNamespace *ns, NvmeZone *z,
     return &ns->zone_array[z->next];
 }
 
+static inline bool nvme_zone_meta_dirty(NvmeCtrl *n, NvmeNamespace *ns)
+{
+    return n->params.zone_file ? ns->zone_meta->dirty : false;
+}
+
+static inline void nvme_set_zone_meta_dirty(NvmeCtrl *n, NvmeNamespace *ns,
+                                            bool yesno)
+{
+    if (n->params.zone_file) {
+        ns->zone_meta->dirty = yesno;
+    }
+}
+
 #endif /* HW_NVME_H */
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 53c7a2fd1f..7ee983b9df 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -100,6 +100,7 @@ pci_nvme_zd_extension_set(uint32_t zone_idx) "set descriptor extension for zone_
 pci_nvme_power_on_close(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Closed state"
 pci_nvme_power_on_reset(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Empty state"
 pci_nvme_power_on_full(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Full state"
+pci_nvme_initialized_zone_file(char *zfile_name) "mapped zone file %s"
 pci_nvme_mapped_zone_file(char *zfile_name, int ret) "mapped zone file %s, error %d"
 
 # nvme traces for error conditions
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 14/14] hw/block/nvme: Document zoned parameters in usage text
  2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (12 preceding siblings ...)
  2020-09-23 18:20 ` [PATCH v4 13/14] hw/block/nvme: Use zone metadata file for persistence Dmitry Fomichev
@ 2020-09-23 18:20 ` Dmitry Fomichev
  2020-09-24 21:07 ` [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Klaus Jensen
  14 siblings, 0 replies; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-23 18:20 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

Added brief descriptions of the new device properties that are
now available to users to configure features of Zoned Namespace
Command Set in the emulator.

This patch is for documentation only, no functionality change.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 43 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 41 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 5f55e86a9a..f1cadc76fb 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,7 +9,7 @@
  */
 
 /**
- * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
+ * Reference Specs: http://www.nvmexpress.org, 1.4, 1.3, 1.2, 1.1, 1.0e
  *
  *  https://nvmexpress.org/developers/nvme-specification/
  */
@@ -22,7 +22,7 @@
  *              [pmrdev=<mem_backend_file_id>,] \
  *              max_ioqpairs=<N[optional]>, \
  *              aerl=<N[optional]>, aer_max_queued=<N[optional]>, \
- *              mdts=<N[optional]>
+ *              mdts=<N[optional]>, zoned=<true|false[optional]>
  *
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
  * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
@@ -48,6 +48,45 @@
  *   completion when there are no oustanding AERs. When the maximum number of
  *   enqueued events are reached, subsequent events will be dropped.
  *
+ * Setting `zoned` to true makes the device to support zoned namespaces.
+ * In this case, of the following options are available to configure zoned
+ * operation:
+ *     zone_size=<zone size in MiB, default: 128MiB>
+ *
+ *     zone_capacity=<zone capacity in MiB, default: zone_size>
+ *         The value 0 (default) forces zone capacity to be the same as zone
+ *         size. The value of this property may not exceed zone size.
+ *
+ *     zone_file=<zone metadata file name, default: none>
+ *         Zone metadata file, if specified, allows zone information
+ *         to be persistent across shutdowns and restarts.
+ *
+ *     zone_descr_ext_size=<zone descriptor extension size, default 0>
+ *         This value needs to be specified in 64B units. If it is zero,
+ *         namespace(s) will not support zone descriptor extensions.
+ *
+ *     max_active=<Maximum Active Resources (zones), default: 0 - no limit>
+ *
+ *     max_open=<Maximum Open Resources (zones), default: 0 - no limit>
+ *
+ *     zone_append_size_limit=<zone append size limit, in KiB, default: MDTS>
+ *         The maximum I/O size that can be supported by Zone Append
+ *         command. Since internally this this value is maintained as
+ *         ZASL = log2(<maximum append size> / <page size>), some
+ *         values assigned to this property may be rounded down and
+ *         result in a lower maximum ZA data size being in effect.
+ *         If MDTS property is not assigned, the default value of 128KiB is
+ *         used as ZASL.
+ *
+ *     offline_zones=<the number of offline zones to inject, default: 0>
+ *
+ *     rdonly_zones=<the number of read-only zones to inject, default: 0>
+ *
+ *     cross_zone_read=<enables Read Across Zone Boundaries, default: true>
+ *
+ *     fill_pattern=<data fill pattern, default: 0x00>
+ *         The byte pattern to return for any portions of unwritten data
+ *         during read.
  */
 
 #include "qemu/osdep.h"
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 06/14] hw/block/nvme: Add support for active/inactive namespaces
  2020-09-23 18:20 ` [PATCH v4 06/14] hw/block/nvme: Add support for active/inactive namespaces Dmitry Fomichev
@ 2020-09-24 12:12   ` Klaus Jensen
  2020-09-24 18:17     ` Niklas Cassel
  0 siblings, 1 reply; 42+ messages in thread
From: Klaus Jensen @ 2020-09-24 12:12 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 1944 bytes --]

On Sep 24 03:20, Dmitry Fomichev wrote:
> From: Niklas Cassel <niklas.cassel@wdc.com>
> 
> In NVMe, a namespace is active if it exists and is attached to the
> controller.
> 
> CAP.CSS (together with the I/O Command Set data structure) defines what
> command sets are supported by the controller.
> 
> CC.CSS (together with Set Profile) can be set to enable a subset of the
> available command sets. The namespaces belonging to a disabled command set
> will not be able to attach to the controller, and will thus be inactive.
> 
> E.g., if the user sets CC.CSS to Admin Only, NVM namespaces should be
> marked as inactive.
> 

Hmm. I'm not convinced that this is correct. Can you reference the spec?

On the specific case you mention the spec is actually pretty clear:

  "When only the Admin Command Set is supported, any command submitted
  on an I/O Submission Queue and any I/O Command Set Specific Admin
  command submitted on the Admin Submission Queue is completed with
  status Invalid Command Opcode."

My /interpretation/ (because the spec is vague on this point) is that
with TP 4056, if the host writes 0x0 to CC.CSS, you will (should) just
see Invalid Command Opcode for namespaces not supporting the NVM command
set since we are operating in a backward compatible way.

Now, if the host sets CC.CSS to 0x6, then it is obviously aware of
namespaces and other rules apply. For instance, it may set the I/O
Command Set Combination Index through a Set Features command, but TP
4056 is clear that the host will not be allowed to choose a combination
that leaves an attached namespace unsupported.

For this device, that does not implement namespace management and thus
has no notion of attaching/detaching namespaces, the controller should
by default choose an I/O Command Set Combination that indicates support
for all I/O command sets that are required to support the namespaces
configured.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 01/14] hw/block/nvme: Report actual LBA data shift in LBAF
  2020-09-23 18:20 ` [PATCH v4 01/14] hw/block/nvme: Report actual LBA data shift in LBAF Dmitry Fomichev
@ 2020-09-24 12:12   ` Klaus Jensen
  0 siblings, 0 replies; 42+ messages in thread
From: Klaus Jensen @ 2020-09-24 12:12 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 644 bytes --]

On Sep 24 03:20, Dmitry Fomichev wrote:
> Calculate the data shift value to report based on the set value of
> logical_block_size device property.
> 
> In the process, use a local variable to calculate the LBA format
> index instead of the hardcoded value 0. This makes the code more
> readable and it will make it easier to add support for multiple LBA
> formats in the future.
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

Yeah, using the standard approach of the logical_block_size parameter is
probably preferable to an 'lbads' parameter as I've been doing.

Reviewed-by: Klaus Jensen <k.jensen@samsung.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 06/14] hw/block/nvme: Add support for active/inactive namespaces
  2020-09-24 12:12   ` Klaus Jensen
@ 2020-09-24 18:17     ` Niklas Cassel
  2020-09-24 18:55       ` Klaus Jensen
  0 siblings, 1 reply; 42+ messages in thread
From: Niklas Cassel @ 2020-09-24 18:17 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, Fam Zheng, Damien Le Moal, qemu-block,
	Dmitry Fomichev, Klaus Jensen, qemu-devel, Maxim Levitsky,
	Alistair Francis, Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

On Thu, Sep 24, 2020 at 02:12:03PM +0200, Klaus Jensen wrote:
> On Sep 24 03:20, Dmitry Fomichev wrote:
> > From: Niklas Cassel <niklas.cassel@wdc.com>
> > 
> > E.g., if the user sets CC.CSS to Admin Only, NVM namespaces should be
> > marked as inactive.
> > 
> 
> Hmm. I'm not convinced that this is correct. Can you reference the spec?
> 

"If the user sets CC.CSS to Admin Only, NVM namespaces should be marked as inactive."

After reading the spec several times, I agree that this statement is false,
although I really wish it wasn't :)



My interpretation was based on:

From CC.CSS:
"If bit 44 is set to ‘1’ in the Command Sets Supported (CSS) field, then the
value 111b indicates that only the Admin Command Set is supported and that no
I/O Command Set or I/O Command Set Specific Admin commands are supported."

The NVM Command Set is a Command Set, so I assumed that since the Command
Set was not supported, trying to do something like CNS 00h (Identify Namespace),
would return an zero-filled struct. (Since the namespace belongs to a
Command Set that is not supported.)

To me it seems silly that a namespace can be "active" at the same time
as the Command Set that the namespace belongs to is not enabled,
but that seems like how the spec is written...


Additionally in TP4056 section 5.19 it says:
"If an attempt is made to attach a namespace to a controller that supports the
corresponding I/O Command Set and the corresponding I/O Command Set is not
enabled by the I/O Command Set profile feature, then the command shall be
aborted with a status of I/O Command Set Not Enabled."

But if you already have a e.g. a Key Value namespace attached, and boot up with
e.g. CC.CSS = NVM, or CC.CSS = Admin, you will have a namespace belonging to a
disabled command set attached (and "active" :p).
But if you have no namespaces attached, boot up with CC.CSS = NVM or
CC.CSS = Admin, you will not be allowed to attach your Key Value namespace.
(5.19 says nothing about CC.CSS, but let's assume it isn't supposed to be
allowed.)
Will you be allowed to attach a ZNS namespace? (Since CC.CSS != ALL Selected,
I would assume that is is not supposed to be allowed.)
Now, will you be allowed to attach a NVM namespace when CC.CSS = Admin?
(I assume that the intention is that you should.)


CC.CSS can only be changed when the controller is disabled.
I assume that you can attach NVM namespaces even when CC.CSS = Admin.
(Attaching namespaces != NVM is probably not allowed when
CC.CSS != ALL Selected.)
Attaching namespaces are persistent. Even if you restart with CC.CSS = Admin,
they will still be attached, and active.
(Although you might not be able to do anything with them... see the next
section.)
You can not change to a profile (using Set Profile) that lacks a command set
that you are currently using. This change is while the controller is enabled,
so I guess it is ok that this check is stricter than CC.CSS.

Things would have been way simpler if the controller just deattached
non-supported namespaces internally at power on...


> My /interpretation/ (because the spec is vague on this point) is that
> with TP 4056, if the host writes 0x0 to CC.CSS, you will (should) just
> see Invalid Command Opcode for namespaces not supporting the NVM command
> set since we are operating in a backward compatible way.

If a controller is booted with CC.CSS = NVM Command Set:

We have an attached Key Value namespace. (It will be set as active.)
I think that we can agree that any IOCS Key Value specific command
will fail.

We have an attached Zoned Namespace. (It will be set as active.)
Sure, any IOCS Zoned Namespace Command specific command will fail.
Your interpretation is that it will allow NVM Commands, since a Zoned Namespace
implements the opcodes of the NVM Command Set.

Each Command Set has a table with the opcodes that it supports.
In e.g. Key Value Command Set, opcode 01h means "Store command"
but in NVM Command Set, opcode 01h means "Write command".

These commands are totally different, and uses completely different dwords.
So I don't think that it is obvious that CC.CSS = NVM, should allow
"NVM reserved opcodes" for certain Command Sets, but not in others.

Like so many other things in NVMe, this will probably be vendor specific.
For devices only supporting NVM + ZNS, it will probably support NVM
Command Set commands even when CC.CSS = NVM, but it's probably not
something that can be guarantee to be true for all devices supporting
a Command Set that extends the NVM Command Set.


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 06/14] hw/block/nvme: Add support for active/inactive namespaces
  2020-09-24 18:17     ` Niklas Cassel
@ 2020-09-24 18:55       ` Klaus Jensen
  2020-09-24 19:40         ` Niklas Cassel
  0 siblings, 1 reply; 42+ messages in thread
From: Klaus Jensen @ 2020-09-24 18:55 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Kevin Wolf, Fam Zheng, Damien Le Moal, qemu-block, Klaus Jensen,
	qemu-devel, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 760 bytes --]

On Sep 24 18:17, Niklas Cassel wrote:
> On Thu, Sep 24, 2020 at 02:12:03PM +0200, Klaus Jensen wrote:
> > On Sep 24 03:20, Dmitry Fomichev wrote:
> > > From: Niklas Cassel <niklas.cassel@wdc.com>
> > > 
> > > E.g., if the user sets CC.CSS to Admin Only, NVM namespaces should be
> > > marked as inactive.
> > > 
> > 
> > Hmm. I'm not convinced that this is correct. Can you reference the spec?
> > 
> 
> CC.CSS can only be changed when the controller is disabled.

Right. I think I see you point. While the controller is disabled, the
host obviously cannot even see what namespaces are available, so the
controller is free to only expose (aka, attach) the namespaces that
makes sense for the value of CC.CSS.

OK then, the patch is good :)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 06/14] hw/block/nvme: Add support for active/inactive namespaces
  2020-09-24 18:55       ` Klaus Jensen
@ 2020-09-24 19:40         ` Niklas Cassel
  0 siblings, 0 replies; 42+ messages in thread
From: Niklas Cassel @ 2020-09-24 19:40 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, Fam Zheng, Damien Le Moal, qemu-block, Klaus Jensen,
	qemu-devel, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Thu, Sep 24, 2020 at 08:55:24PM +0200, Klaus Jensen wrote:
> On Sep 24 18:17, Niklas Cassel wrote:
> > On Thu, Sep 24, 2020 at 02:12:03PM +0200, Klaus Jensen wrote:
> > > On Sep 24 03:20, Dmitry Fomichev wrote:
> > > > From: Niklas Cassel <niklas.cassel@wdc.com>
> > > > 
> > > > E.g., if the user sets CC.CSS to Admin Only, NVM namespaces should be
> > > > marked as inactive.
> > > > 
> > > 
> > > Hmm. I'm not convinced that this is correct. Can you reference the spec?
> > > 
> > 
> > CC.CSS can only be changed when the controller is disabled.
> 
> Right. I think I see you point. While the controller is disabled, the
> host obviously cannot even see what namespaces are available, so the
> controller is free to only expose (aka, attach) the namespaces that
> makes sense for the value of CC.CSS.
> 
> OK then, the patch is good :)

That was my thought, that the controller internally would
detach unsupported namespaces (even if the controller didn't
expose namespace management capabilities to the user).

This was how I assumed that things worked, but if we should
follow the spec strictly, we should do like you suggested
and keep them attached, and return the proper error code,
on non-admin commands.

Thank you for improving my understanding.
Considering that CC.CSS can only be changed when the controller
is disabled, I still kind of wished that the spec said that
unsupported namespaces would be automatically detached.


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (13 preceding siblings ...)
  2020-09-23 18:20 ` [PATCH v4 14/14] hw/block/nvme: Document zoned parameters in usage text Dmitry Fomichev
@ 2020-09-24 21:07 ` Klaus Jensen
  2020-09-28  2:33   ` Dmitry Fomichev
  14 siblings, 1 reply; 42+ messages in thread
From: Klaus Jensen @ 2020-09-24 21:07 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 2103 bytes --]

On Sep 24 03:20, Dmitry Fomichev wrote:
> v3 -> v4
> 
>  - Fix bugs introduced in v2/v3 for QD > 1 operation. Now, all writes
>    to a zone happen at the new write pointer variable, zone->w_ptr,
>    that is advanced right after submitting the backend i/o. The existing
>    zone->d.wp variable is updated upon the successful write completion
>    and it is used for zone reporting. Some code has been split from
>    nvme_finalize_zoned_write() function to a new function,
>    nvme_advance_zone_wp().
> 

Same approach that I've used, +1.

>  - Make the code compile under mingw. Switch to using QEMU API for
>    mmap/msync, i.e. memory_region...(). Since mmap is not available in
>    mingw (even though there is mman-win32 library available on Github),
>    conditional compilation is added around these calls to avoid
>    undefined symbols under mingw. A better fix would be to add stub
>    functions to softmmu/memory.c for the case when CONFIG_POSIX is not
>    defined, but such change is beyond the scope of this patchset and it
>    can be made in a separate patch.
> 

Ewwww.

This feels like a hack or at the very least an abuse of the physical
memory management API.

If it really needs to be memory mapped, then I think a hostmem-based
approach similar to what Andrzej did for PMR is needed (I think that
will get rid of the CONFIG_POSIX ifdef at least, but still leave it
slightly tricky to get it to work on all platforms AFAIK). But really,
since we do not require memory semantics for this, then I think the
abstraction is fundamentally wrong.

I am, of course, blowing my own horn, since my implementation uses a
portable blockdev for this.

Another issue is the complete lack of endian conversions. Does it
matter? It depends. Will anyone ever use this on a big endian host and
move the meta data backing file to a little endian host? Probably not.
So does it really matter? Probably not, but it is cutting corners.

>  - Make the list of review comments addressed in v2 of the series
>    (see below).
> 

Very detailed! Thanks!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 09/14] hw/block/nvme: Support Zoned Namespace Command Set
  2020-09-23 18:20 ` [PATCH v4 09/14] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
@ 2020-09-25 18:24   ` Klaus Jensen
  0 siblings, 0 replies; 42+ messages in thread
From: Klaus Jensen @ 2020-09-25 18:24 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 4732 bytes --]

On Sep 24 03:20, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> The code to support for Zone Descriptor Extensions is not included in
> this commit and ZDES 0 is always reported. A later commit in this
> series will add ZDE support.
> 
> This commit doesn't yet include checks for active and open zone
> limits. It is assumed that there are no limits on either active or
> open zones.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  block/nvme.c         |    2 +-
>  hw/block/nvme.c      | 1057 ++++++++++++++++++++++++++++++++++++++++--
>  include/block/nvme.h |    6 +-
>  3 files changed, 1026 insertions(+), 39 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 05485fdd11..7a513c9a17 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -682,11 +1005,77 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
>          return status;
>      }
>  
> +    if (n->params.zoned) {
> +        zone_idx = nvme_zone_idx(n, slba);
> +        assert(zone_idx < n->num_zones);
> +        zone = &ns->zone_array[zone_idx];
> +
> +        if (is_write) {
> +            status = nvme_check_zone_write(zone, slba, nlb);
> +            if (status != NVME_SUCCESS) {
> +                trace_pci_nvme_err_zone_write_not_ok(slba, nlb, status);
> +                return status | NVME_DNR;
> +            }
> +
> +            assert(nvme_wp_is_valid(zone));
> +            if (append) {
> +                if (unlikely(slba != zone->d.zslba)) {
> +                    trace_pci_nvme_err_append_not_at_start(slba, zone->d.zslba);
> +                    return NVME_ZONE_INVALID_WRITE | NVME_DNR;
> +                }
> +                if (data_size > (n->page_size << n->zasl)) {
> +                    trace_pci_nvme_err_append_too_large(slba, nlb, n->zasl);
> +                    return NVME_INVALID_FIELD | NVME_DNR;
> +                }
> +                slba = zone->w_ptr;
> +            } else if (unlikely(slba != zone->w_ptr)) {
> +                trace_pci_nvme_err_write_not_at_wp(slba, zone->d.zslba,
> +                                                   zone->w_ptr);
> +                return NVME_ZONE_INVALID_WRITE | NVME_DNR;
> +            }
> +            req->fill_ofs = -1LL;
> +        } else {
> +            status = nvme_check_zone_read(n, zone, slba, nlb,
> +                                          n->params.cross_zone_read);
> +            if (status != NVME_SUCCESS) {
> +                trace_pci_nvme_err_zone_read_not_ok(slba, nlb, status);
> +                return status | NVME_DNR;
> +            }
> +
> +            if (slba + nlb > zone->w_ptr) {
> +                /*
> +                 * All or some data is read above the WP. Need to
> +                 * fill out the buffer area that has no backing data
> +                 * with a predefined data pattern (zeros by default)
> +                 */
> +                if (slba >= zone->w_ptr) {
> +                    req->fill_ofs = 0;
> +                } else {
> +                    req->fill_ofs = ((zone->w_ptr - slba) << data_shift);
> +                }

If Read Across Zone Boundaries is enabled and the read in zone A
includes LBAs above the write pointer, but crossing into a full zone
(zone B), then you are gonna overwrite the valid data in zone B with the
fill pattern.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-24 21:07 ` [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Klaus Jensen
@ 2020-09-28  2:33   ` Dmitry Fomichev
  2020-09-28  6:36     ` Klaus Jensen
  0 siblings, 1 reply; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:33 UTC (permalink / raw)
  To: Klaus Jensen, Keith Busch, Damien Le Moal
  Cc: Kevin Wolf, Fam Zheng, qemu-block, Niklas Cassel, Klaus Jensen,
	qemu-devel, Maxim Levitsky, Alistair Francis,
	Philippe Mathieu-Daudé,
	Matias Bjorling

> -----Original Message-----
> From: Klaus Jensen <its@irrelevant.dk>
> Sent: Thursday, September 24, 2020 5:08 PM
> To: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
> Cc: Keith Busch <kbusch@kernel.org>; Klaus Jensen
> <k.jensen@samsung.com>; Kevin Wolf <kwolf@redhat.com>; Philippe
> Mathieu-Daudé <philmd@redhat.com>; Maxim Levitsky
> <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>; Niklas Cassel
> <Niklas.Cassel@wdc.com>; Damien Le Moal <Damien.LeMoal@wdc.com>;
> qemu-block@nongnu.org; qemu-devel@nongnu.org; Alistair Francis
> <Alistair.Francis@wdc.com>; Matias Bjorling <Matias.Bjorling@wdc.com>
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> and Zoned Namespace Command Set
> 
> On Sep 24 03:20, Dmitry Fomichev wrote:
> > v3 -> v4
> >
> >  - Fix bugs introduced in v2/v3 for QD > 1 operation. Now, all writes
> >    to a zone happen at the new write pointer variable, zone->w_ptr,
> >    that is advanced right after submitting the backend i/o. The existing
> >    zone->d.wp variable is updated upon the successful write completion
> >    and it is used for zone reporting. Some code has been split from
> >    nvme_finalize_zoned_write() function to a new function,
> >    nvme_advance_zone_wp().
> >
> 
> Same approach that I've used, +1.
> 
> >  - Make the code compile under mingw. Switch to using QEMU API for
> >    mmap/msync, i.e. memory_region...(). Since mmap is not available in
> >    mingw (even though there is mman-win32 library available on Github),
> >    conditional compilation is added around these calls to avoid
> >    undefined symbols under mingw. A better fix would be to add stub
> >    functions to softmmu/memory.c for the case when CONFIG_POSIX is not
> >    defined, but such change is beyond the scope of this patchset and it
> >    can be made in a separate patch.
> >
> 
> Ewwww.
> 
> This feels like a hack or at the very least an abuse of the physical
> memory management API.
> 
> If it really needs to be memory mapped, then I think a hostmem-based
> approach similar to what Andrzej did for PMR is needed (I think that
> will get rid of the CONFIG_POSIX ifdef at least, but still leave it
> slightly tricky to get it to work on all platforms AFAIK).

Ok, it looks that using the HostMemoryBackendFile backend will be
more appropriate. This will remove the need for conditional compile.

The mmap() portability is pretty decent across software platforms.
Any poor Windows user who is forced to emulate ZNS on mingw will be
able to do so, just without having zone state persistency. Considering
how specialized this stuff is in first place, I estimate the number of users
affected by this "limitation" to be exactly zero.

> But really,
> since we do not require memory semantics for this, then I think the
> abstraction is fundamentally wrong.
> 

Seriously, what is wrong with using mmap :) ? It is used successfully for
similar applications, for example -
https://github.com/open-iscsi/tcmu-runner/blob/master/file_zbc.c

> I am, of course, blowing my own horn, since my implementation uses a
> portable blockdev for this.
> 

You are making it sound like the entire WDC series relies on this approach.
Actually, the persistency is introduced in the second to last patch in the
series and it only adds a couple of lines of code in the i/o path to mark
zones dirty. This is possible because of using mmap() and I find the way
it is done to be quite elegant, not ugly :)

> Another issue is the complete lack of endian conversions. Does it
> matter? It depends. Will anyone ever use this on a big endian host and
> move the meta data backing file to a little endian host? Probably not.
> So does it really matter? Probably not, but it is cutting corners.
> 

Great point on endianness! Naturally, all file backed values are stored in
their native endianness. This way, there is no extra overhead on big endian
hardware architectures. Portability concerns can be easily addressed by
storing metadata endianness as a byte flag in its header. Then, during
initialization, the metadata validation code can detect the possible
discrepancy in endianness and automatically convert the metadata to the
endianness of the host. This part is out of scope of this series, but I would
be able to contribute such a solution as an enhancement in the future.

> >  - Make the list of review comments addressed in v2 of the series
> >    (see below).
> >
> 
> Very detailed! Thanks!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-28  2:33   ` Dmitry Fomichev
@ 2020-09-28  6:36     ` Klaus Jensen
  2020-09-28 21:25       ` Keith Busch
  2020-09-29 15:42       ` Dmitry Fomichev
  0 siblings, 2 replies; 42+ messages in thread
From: Klaus Jensen @ 2020-09-28  6:36 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Fam Zheng, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 3992 bytes --]

On Sep 28 02:33, Dmitry Fomichev wrote:
> > -----Original Message-----
> > From: Klaus Jensen <its@irrelevant.dk>
> >
> > If it really needs to be memory mapped, then I think a hostmem-based
> > approach similar to what Andrzej did for PMR is needed (I think that
> > will get rid of the CONFIG_POSIX ifdef at least, but still leave it
> > slightly tricky to get it to work on all platforms AFAIK).
> 
> Ok, it looks that using the HostMemoryBackendFile backend will be
> more appropriate. This will remove the need for conditional compile.
> 
> The mmap() portability is pretty decent across software platforms.
> Any poor Windows user who is forced to emulate ZNS on mingw will be
> able to do so, just without having zone state persistency. Considering
> how specialized this stuff is in first place, I estimate the number of users
> affected by this "limitation" to be exactly zero.
> 

QEMU is a cross platform project - we should strive for portability.

Alienating developers that use a Windows platform and calling them out
as "poor" is not exactly good for the zoned ecosystem.

> > But really,
> > since we do not require memory semantics for this, then I think the
> > abstraction is fundamentally wrong.
> > 
> 
> Seriously, what is wrong with using mmap :) ? It is used successfully for
> similar applications, for example -
> https://github.com/open-iscsi/tcmu-runner/blob/master/file_zbc.c
> 

There is nothing fundamentally wrong with mmap. I just think it is the
wrong abstraction here (and it limits portability for no good reason).
For PMR there is a good reason - it requires memory semantics.

> > I am, of course, blowing my own horn, since my implementation uses a
> > portable blockdev for this.
> > 
> 
> You are making it sound like the entire WDC series relies on this approach.
> Actually, the persistency is introduced in the second to last patch in the
> series and it only adds a couple of lines of code in the i/o path to mark
> zones dirty. This is possible because of using mmap() and I find the way
> it is done to be quite elegant, not ugly :)
> 

No, I understand that your implementation works fine without
persistance, but persistance is key. That is why my series adds it in
the first patch. Without persistence it is just a toy. And the QEMU
device is not just an "NVMe-version" of null_blk.

And I don't think I ever called the use of mmap ugly. I called out the
physical memory API shenanigans as a hack.

> > Another issue is the complete lack of endian conversions. Does it
> > matter? It depends. Will anyone ever use this on a big endian host and
> > move the meta data backing file to a little endian host? Probably not.
> > So does it really matter? Probably not, but it is cutting corners.
> > 

After I had replied this, I considered a follow-up, because there are
probably QEMU developers that would call me out on this.

This definitely DOES matter to QEMU.

> 
> Great point on endianness! Naturally, all file backed values are stored in
> their native endianness. This way, there is no extra overhead on big endian
> hardware architectures. Portability concerns can be easily addressed by
> storing metadata endianness as a byte flag in its header. Then, during
> initialization, the metadata validation code can detect the possible
> discrepancy in endianness and automatically convert the metadata to the
> endianness of the host. This part is out of scope of this series, but I would
> be able to contribute such a solution as an enhancement in the future.
> 

It is not out of scope. I don't see why we should merge something that
is arguably buggy.

Bottomline is that I just don't see why we should accept an
implementation that

  a) excludes some platforms (Windows) from using persistence; and
  b) contains endianness conversion issues

when there is a portable implementation posted that at least tries to
convert endianness as needed.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-28  6:36     ` Klaus Jensen
@ 2020-09-28 21:25       ` Keith Busch
  2020-09-28 22:54         ` Damien Le Moal
  2020-09-29 15:42       ` Dmitry Fomichev
  1 sibling, 1 reply; 42+ messages in thread
From: Keith Busch @ 2020-09-28 21:25 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, Fam Zheng, Damien Le Moal, qemu-block, Niklas Cassel,
	Dmitry Fomichev, Klaus Jensen, qemu-devel, Maxim Levitsky,
	Alistair Francis, Philippe Mathieu-Daudé,
	Matias Bjorling

On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
> On Sep 28 02:33, Dmitry Fomichev wrote:
> > You are making it sound like the entire WDC series relies on this approach.
> > Actually, the persistency is introduced in the second to last patch in the
> > series and it only adds a couple of lines of code in the i/o path to mark
> > zones dirty. This is possible because of using mmap() and I find the way
> > it is done to be quite elegant, not ugly :)
> > 
> 
> No, I understand that your implementation works fine without
> persistance, but persistance is key. That is why my series adds it in
> the first patch. Without persistence it is just a toy. And the QEMU
> device is not just an "NVMe-version" of null_blk.

I really think we should be a bit more cautious of commiting to an
on-disk format for the persistent state. Both this and Klaus' persistent
state feels a bit ad-hoc, and with all the other knobs provided, it
looks too easy to have out-of-sync states, or just not being able to
boot at all if a qemu versions have different on-disk formats.

Is anyone really considering zone emulation for production level stuff
anyway? I can't imagine a real scenario where you'd want put yourself
through that: you are just giving yourself all the downsides of a zoned
block device and none of the benefits. AFAIK, this is provided as a
development vehicle, closer to a "toy".

I think we should consider trimming this down to a more minimal set that
we *do* agree on and commit for inclusion ASAP. We can iterate all the
bells & whistles and flush out the meta data's data marshalling scheme
for persistence later.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-28 21:25       ` Keith Busch
@ 2020-09-28 22:54         ` Damien Le Moal
  2020-09-29 10:46           ` Klaus Jensen
  0 siblings, 1 reply; 42+ messages in thread
From: Damien Le Moal @ 2020-09-28 22:54 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Niklas Cassel,
	Dmitry Fomichev, Klaus Jensen, qemu-devel, Maxim Levitsky,
	Alistair Francis, Philippe Mathieu-Daudé,
	Matias Bjorling

On 2020/09/29 6:25, Keith Busch wrote:
> On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
>> On Sep 28 02:33, Dmitry Fomichev wrote:
>>> You are making it sound like the entire WDC series relies on this approach.
>>> Actually, the persistency is introduced in the second to last patch in the
>>> series and it only adds a couple of lines of code in the i/o path to mark
>>> zones dirty. This is possible because of using mmap() and I find the way
>>> it is done to be quite elegant, not ugly :)
>>>
>>
>> No, I understand that your implementation works fine without
>> persistance, but persistance is key. That is why my series adds it in
>> the first patch. Without persistence it is just a toy. And the QEMU
>> device is not just an "NVMe-version" of null_blk.
> 
> I really think we should be a bit more cautious of commiting to an
> on-disk format for the persistent state. Both this and Klaus' persistent
> state feels a bit ad-hoc, and with all the other knobs provided, it
> looks too easy to have out-of-sync states, or just not being able to
> boot at all if a qemu versions have different on-disk formats.
> 
> Is anyone really considering zone emulation for production level stuff
> anyway? I can't imagine a real scenario where you'd want put yourself
> through that: you are just giving yourself all the downsides of a zoned
> block device and none of the benefits. AFAIK, this is provided as a
> development vehicle, closer to a "toy".
> 
> I think we should consider trimming this down to a more minimal set that
> we *do* agree on and commit for inclusion ASAP. We can iterate all the
> bells & whistles and flush out the meta data's data marshalling scheme
> for persistence later.

+1 on this. Removing the persistence also removes the debate on endianess. With
that out of the way, it should be straightforward to get agreement on a series
that can be merged quickly to get developers started with testing ZNS software
with QEMU. That is the most important goal here. 5.9 is around the corner, we
need something for people to get started with ZNS quickly.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-28 22:54         ` Damien Le Moal
@ 2020-09-29 10:46           ` Klaus Jensen
  2020-09-29 11:13             ` Damien Le Moal
                               ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Klaus Jensen @ 2020-09-29 10:46 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Niklas Cassel, Klaus Jensen,
	qemu-devel, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 4901 bytes --]

On Sep 28 22:54, Damien Le Moal wrote:
> On 2020/09/29 6:25, Keith Busch wrote:
> > On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
> >> On Sep 28 02:33, Dmitry Fomichev wrote:
> >>> You are making it sound like the entire WDC series relies on this approach.
> >>> Actually, the persistency is introduced in the second to last patch in the
> >>> series and it only adds a couple of lines of code in the i/o path to mark
> >>> zones dirty. This is possible because of using mmap() and I find the way
> >>> it is done to be quite elegant, not ugly :)
> >>>
> >>
> >> No, I understand that your implementation works fine without
> >> persistance, but persistance is key. That is why my series adds it in
> >> the first patch. Without persistence it is just a toy. And the QEMU
> >> device is not just an "NVMe-version" of null_blk.
> > 
> > I really think we should be a bit more cautious of commiting to an
> > on-disk format for the persistent state. Both this and Klaus' persistent
> > state feels a bit ad-hoc, and with all the other knobs provided, it
> > looks too easy to have out-of-sync states, or just not being able to
> > boot at all if a qemu versions have different on-disk formats.
> > 
> > Is anyone really considering zone emulation for production level stuff
> > anyway? I can't imagine a real scenario where you'd want put yourself
> > through that: you are just giving yourself all the downsides of a zoned
> > block device and none of the benefits. AFAIK, this is provided as a
> > development vehicle, closer to a "toy".
> > 
> > I think we should consider trimming this down to a more minimal set that
> > we *do* agree on and commit for inclusion ASAP. We can iterate all the
> > bells & whistles and flush out the meta data's data marshalling scheme
> > for persistence later.
> 
> +1 on this. Removing the persistence also removes the debate on endianess. With
> that out of the way, it should be straightforward to get agreement on a series
> that can be merged quickly to get developers started with testing ZNS software
> with QEMU. That is the most important goal here. 5.9 is around the corner, we
> need something for people to get started with ZNS quickly.
> 

Wait. What. No. Stop!

It is unmistakably clear that you are invalidating my arguments about
portability and endianness issues by suggesting that we just remove
persistent state and deal with it later, but persistence is the killer
feature that sets the QEMU emulated device apart from other emulation
options. It is not about using emulation in production (because yeah,
why would you?), but persistence is what makes it possible to develop
and test "zoned FTLs" or something that requires recovery at power up.
This is what allows testing of how your host software deals with opened
zones being transitioned to FULL on power up and the persistent tracking
of LBA allocation (in my series) can be used to properly test error
recovery if you lost state in the app.

Please, work with me on this instead of just removing such an essential
feature. Since persistence seems to be the only thing we are really
discussing, we should have plenty of time until the soft-freeze to come
up with a proper solution on that.

I agree that my version had a format that was pretty ad-hoc and that
won't fly - it needs magic and version capabilities like in Dmitry's
series, which incidentially looks a lot like what we did in the
OpenChannel implementation, so I agree with the strategy.

ZNS-wise, the only thing my implementation stores is the zone
descriptors (in spec-native little-endian format) and the zone
descriptor extensions. So there are no endian issues with those. The
allocation tracking bitmap is always stored in little endian, but
converted to big-endian if running on a big-endian host.

Let me just conjure something up.

    #define NVME_PSTATE_MAGIC ...
    #define NVME_PSTATE_V1    1

    typedef struct NvmePstateHeader {
        uint32_t magic;
        uint32_t version;

        uint64_t blk_len;

        uint8_t  lbads;
        uint8_t  iocs;

        uint8_t  rsvd18[3054];

        struct {
            uint64_t zsze;
            uint8_t  zdes;
        } QEMU_PACKED zns;

        uint8_t  rsvd3089[1007];
    } QEMU_PACKED NvmePstateHeader;

With such a header we have all we need. We can bail out if any
parameters do not match and similar to nvme data structures it contains
reserved areas for future use. I'll be posting a v2 with this. If this
still feels too ad-hoc, we can be inspired by QCOW2 and the "extension"
feature.

I can agree that we drop other optional features like zone excursions
and the reset/finish recommended limit simulation, but PLEASE DO NOT
remove persistence and upstream a half-baked version when we are so
close and have time to get it right.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-29 10:46           ` Klaus Jensen
@ 2020-09-29 11:13             ` Damien Le Moal
  2020-09-29 17:44               ` Keith Busch
  2020-09-29 15:43             ` Dmitry Fomichev
  2020-09-29 17:29             ` Keith Busch
  2 siblings, 1 reply; 42+ messages in thread
From: Damien Le Moal @ 2020-09-29 11:13 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Niklas Cassel, Klaus Jensen,
	qemu-devel, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On 2020/09/29 19:46, Klaus Jensen wrote:
> On Sep 28 22:54, Damien Le Moal wrote:
>> On 2020/09/29 6:25, Keith Busch wrote:
>>> On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
>>>> On Sep 28 02:33, Dmitry Fomichev wrote:
>>>>> You are making it sound like the entire WDC series relies on this approach.
>>>>> Actually, the persistency is introduced in the second to last patch in the
>>>>> series and it only adds a couple of lines of code in the i/o path to mark
>>>>> zones dirty. This is possible because of using mmap() and I find the way
>>>>> it is done to be quite elegant, not ugly :)
>>>>>
>>>>
>>>> No, I understand that your implementation works fine without
>>>> persistance, but persistance is key. That is why my series adds it in
>>>> the first patch. Without persistence it is just a toy. And the QEMU
>>>> device is not just an "NVMe-version" of null_blk.
>>>
>>> I really think we should be a bit more cautious of commiting to an
>>> on-disk format for the persistent state. Both this and Klaus' persistent
>>> state feels a bit ad-hoc, and with all the other knobs provided, it
>>> looks too easy to have out-of-sync states, or just not being able to
>>> boot at all if a qemu versions have different on-disk formats.
>>>
>>> Is anyone really considering zone emulation for production level stuff
>>> anyway? I can't imagine a real scenario where you'd want put yourself
>>> through that: you are just giving yourself all the downsides of a zoned
>>> block device and none of the benefits. AFAIK, this is provided as a
>>> development vehicle, closer to a "toy".
>>>
>>> I think we should consider trimming this down to a more minimal set that
>>> we *do* agree on and commit for inclusion ASAP. We can iterate all the
>>> bells & whistles and flush out the meta data's data marshalling scheme
>>> for persistence later.
>>
>> +1 on this. Removing the persistence also removes the debate on endianess. With
>> that out of the way, it should be straightforward to get agreement on a series
>> that can be merged quickly to get developers started with testing ZNS software
>> with QEMU. That is the most important goal here. 5.9 is around the corner, we
>> need something for people to get started with ZNS quickly.
>>
> 
> Wait. What. No. Stop!
> 
> It is unmistakably clear that you are invalidating my arguments about
> portability and endianness issues by suggesting that we just remove
> persistent state and deal with it later, but persistence is the killer
> feature that sets the QEMU emulated device apart from other emulation
> options. It is not about using emulation in production (because yeah,
> why would you?), but persistence is what makes it possible to develop
> and test "zoned FTLs" or something that requires recovery at power up.
> This is what allows testing of how your host software deals with opened
> zones being transitioned to FULL on power up and the persistent tracking
> of LBA allocation (in my series) can be used to properly test error
> recovery if you lost state in the app.

I am not invalidating anything. I am in violent agreement with you about the
usefulness of persistence. My point was that I agree with Keith: let's first get
the base emulation in and improve on top of it. And the base emulation does not
need to include persistence and endianess of the saved zone meta for now. The
result of this would still be super useful to have in stable.

Then let's add persistence and others bells and whistles on top (see below).

> Please, work with me on this instead of just removing such an essential
> feature. Since persistence seems to be the only thing we are really
> discussing, we should have plenty of time until the soft-freeze to come
> up with a proper solution on that.
> 
> I agree that my version had a format that was pretty ad-hoc and that
> won't fly - it needs magic and version capabilities like in Dmitry's
> series, which incidentially looks a lot like what we did in the
> OpenChannel implementation, so I agree with the strategy.
> 
> ZNS-wise, the only thing my implementation stores is the zone
> descriptors (in spec-native little-endian format) and the zone
> descriptor extensions. So there are no endian issues with those. The
> allocation tracking bitmap is always stored in little endian, but
> converted to big-endian if running on a big-endian host.
> 
> Let me just conjure something up.
> 
>     #define NVME_PSTATE_MAGIC ...
>     #define NVME_PSTATE_V1    1
> 
>     typedef struct NvmePstateHeader {
>         uint32_t magic;
>         uint32_t version;
> 
>         uint64_t blk_len;
> 
>         uint8_t  lbads;
>         uint8_t  iocs;
> 
>         uint8_t  rsvd18[3054];
> 
>         struct {
>             uint64_t zsze;
>             uint8_t  zdes;
>         } QEMU_PACKED zns;
> 
>         uint8_t  rsvd3089[1007];
>     } QEMU_PACKED NvmePstateHeader;
> 
> With such a header we have all we need. We can bail out if any
> parameters do not match and similar to nvme data structures it contains
> reserved areas for future use. I'll be posting a v2 with this. If this
> still feels too ad-hoc, we can be inspired by QCOW2 and the "extension"
> feature.
> 
> I can agree that we drop other optional features like zone excursions
> and the reset/finish recommended limit simulation, but PLEASE DO NOT
> remove persistence and upstream a half-baked version when we are so
> close and have time to get it right.

OK. Then let's move the persistence implementation as the last patch in the
series. This way, if it is still controversial, it will not block the rest.

Here is what I propose:
Dmitry: remove persistence stuff from your patches, address comments and resend.
Klaus: Rebase your persistence patch(es) with reworked format on top of Dmitry
series and send.

That creates a pipeline for reviews and persistence is not a blocker. And I
agree that other ZNS feature can come after we get all of that done first.

Thoughts ? Keith ? Would that work for you ?

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-28  6:36     ` Klaus Jensen
  2020-09-28 21:25       ` Keith Busch
@ 2020-09-29 15:42       ` Dmitry Fomichev
  2020-09-29 18:39         ` Klaus Jensen
  1 sibling, 1 reply; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-29 15:42 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, Fam Zheng, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

> -----Original Message-----
> From: Klaus Jensen <its@irrelevant.dk>
> Sent: Monday, September 28, 2020 2:37 AM
> To: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
> Cc: Keith Busch <kbusch@kernel.org>; Damien Le Moal
> <Damien.LeMoal@wdc.com>; Klaus Jensen <k.jensen@samsung.com>; Kevin
> Wolf <kwolf@redhat.com>; Philippe Mathieu-Daudé <philmd@redhat.com>;
> Maxim Levitsky <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>;
> Niklas Cassel <Niklas.Cassel@wdc.com>; qemu-block@nongnu.org; qemu-
> devel@nongnu.org; Alistair Francis <Alistair.Francis@wdc.com>; Matias
> Bjorling <Matias.Bjorling@wdc.com>
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> and Zoned Namespace Command Set
> 
> On Sep 28 02:33, Dmitry Fomichev wrote:
> > > -----Original Message-----
> > > From: Klaus Jensen <its@irrelevant.dk>
> > >
> > > If it really needs to be memory mapped, then I think a hostmem-based
> > > approach similar to what Andrzej did for PMR is needed (I think that
> > > will get rid of the CONFIG_POSIX ifdef at least, but still leave it
> > > slightly tricky to get it to work on all platforms AFAIK).
> >
> > Ok, it looks that using the HostMemoryBackendFile backend will be
> > more appropriate. This will remove the need for conditional compile.
> >
> > The mmap() portability is pretty decent across software platforms.
> > Any poor Windows user who is forced to emulate ZNS on mingw will be
> > able to do so, just without having zone state persistency. Considering
> > how specialized this stuff is in first place, I estimate the number of users
> > affected by this "limitation" to be exactly zero.
> >
> 
> QEMU is a cross platform project - we should strive for portability.
> 
> Alienating developers that use a Windows platform and calling them out
> as "poor" is not exactly good for the zoned ecosystem.
> 

Wow. By bringing up political correctness here you are basically admitting
the fact that you have no real technical argument here. The whole Windows
issue is red herring that you are using to attack the code that is absolutely
legit, but comes from a competitor. Your initial complaint was that it
doesn't compile in mingw and that it uses "wrong" API. You have even
suggested the API to use. Now, the code uses that API and builds fine, but
now it's still not good simply because you "do not like it". It's a disgrace.

> > > But really,
> > > since we do not require memory semantics for this, then I think the
> > > abstraction is fundamentally wrong.
> > >
> >
> > Seriously, what is wrong with using mmap :) ? It is used successfully for
> > similar applications, for example -
> > https://github.com/open-iscsi/tcmu-runner/blob/master/file_zbc.c
> >
> 
> There is nothing fundamentally wrong with mmap. I just think it is the
> wrong abstraction here (and it limits portability for no good reason).
> For PMR there is a good reason - it requires memory semantics.
> 

We are trying to emulate NVMEe controller NVRAM.  The best abstraction
for emulating NVRAM would be... NVRAM!

> > > I am, of course, blowing my own horn, since my implementation uses a
> > > portable blockdev for this.
> > >
> >
> > You are making it sound like the entire WDC series relies on this approach.
> > Actually, the persistency is introduced in the second to last patch in the
> > series and it only adds a couple of lines of code in the i/o path to mark
> > zones dirty. This is possible because of using mmap() and I find the way
> > it is done to be quite elegant, not ugly :)
> >
> 
> No, I understand that your implementation works fine without
> persistance, but persistance is key. That is why my series adds it in
> the first patch. Without persistence it is just a toy. And the QEMU
> device is not just an "NVMe-version" of null_blk.
> 
> And I don't think I ever called the use of mmap ugly. I called out the
> physical memory API shenanigans as a hack.
> 
> > > Another issue is the complete lack of endian conversions. Does it
> > > matter? It depends. Will anyone ever use this on a big endian host and
> > > move the meta data backing file to a little endian host? Probably not.
> > > So does it really matter? Probably not, but it is cutting corners.
> > >
> 
> After I had replied this, I considered a follow-up, because there are
> probably QEMU developers that would call me out on this.
> 
> This definitely DOES matter to QEMU.
> 
> >
> > Great point on endianness! Naturally, all file backed values are stored in
> > their native endianness. This way, there is no extra overhead on big endian
> > hardware architectures. Portability concerns can be easily addressed by
> > storing metadata endianness as a byte flag in its header. Then, during
> > initialization, the metadata validation code can detect the possible
> > discrepancy in endianness and automatically convert the metadata to the
> > endianness of the host. This part is out of scope of this series, but I would
> > be able to contribute such a solution as an enhancement in the future.
> >
> 
> It is not out of scope. I don't see why we should merge something that
> is arguably buggy.

Again, wow! Now you turned around and arbitrarily elevated this issue from
moderate ("Does it matter?, cutting corners") to severe ("buggy"). Likely
because v5 of WDC patchset has been posted. This, again, just shows your
lack of integrity as a maintainer.

This "issue" is a real trivial one to fix as I described above and you are
blowing it up way out of proportion, making it look like it is a fundamental
problem that can not be resolved. It's not.

> 
> Bottomline is that I just don't see why we should accept an
> implementation that
> 
>   a) excludes some platforms (Windows) from using persistence; and
>   b) contains endianness conversion issues
> 
> when there is a portable implementation posted that at least tries to
> convert endianness as needed.

Doesn't that implementation discriminate against big endian architectures? :)
Ok, it is a joke - with some folks I need to clarify this.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-29 10:46           ` Klaus Jensen
  2020-09-29 11:13             ` Damien Le Moal
@ 2020-09-29 15:43             ` Dmitry Fomichev
  2020-09-29 16:36               ` Klaus Jensen
  2020-09-29 17:29             ` Keith Busch
  2 siblings, 1 reply; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-29 15:43 UTC (permalink / raw)
  To: Klaus Jensen, Damien Le Moal
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Niklas Cassel, Klaus Jensen,
	qemu-devel, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

> -----Original Message-----
> From: Qemu-block <qemu-block-
> bounces+dmitry.fomichev=wdc.com@nongnu.org> On Behalf Of Klaus
> Jensen
> Sent: Tuesday, September 29, 2020 6:47 AM
> To: Damien Le Moal <Damien.LeMoal@wdc.com>
> Cc: Fam Zheng <fam@euphon.net>; Kevin Wolf <kwolf@redhat.com>; qemu-
> block@nongnu.org; Niklas Cassel <Niklas.Cassel@wdc.com>; Klaus Jensen
> <k.jensen@samsung.com>; qemu-devel@nongnu.org; Alistair Francis
> <Alistair.Francis@wdc.com>; Keith Busch <kbusch@kernel.org>; Philippe
> Mathieu-Daudé <philmd@redhat.com>; Matias Bjorling
> <Matias.Bjorling@wdc.com>
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> and Zoned Namespace Command Set
> 
> On Sep 28 22:54, Damien Le Moal wrote:
> > On 2020/09/29 6:25, Keith Busch wrote:
> > > On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
> > >> On Sep 28 02:33, Dmitry Fomichev wrote:
> > >>> You are making it sound like the entire WDC series relies on this
> approach.
> > >>> Actually, the persistency is introduced in the second to last patch in the
> > >>> series and it only adds a couple of lines of code in the i/o path to mark
> > >>> zones dirty. This is possible because of using mmap() and I find the way
> > >>> it is done to be quite elegant, not ugly :)
> > >>>
> > >>
> > >> No, I understand that your implementation works fine without
> > >> persistance, but persistance is key. That is why my series adds it in
> > >> the first patch. Without persistence it is just a toy. And the QEMU
> > >> device is not just an "NVMe-version" of null_blk.
> > >
> > > I really think we should be a bit more cautious of commiting to an
> > > on-disk format for the persistent state. Both this and Klaus' persistent
> > > state feels a bit ad-hoc, and with all the other knobs provided, it
> > > looks too easy to have out-of-sync states, or just not being able to
> > > boot at all if a qemu versions have different on-disk formats.
> > >
> > > Is anyone really considering zone emulation for production level stuff
> > > anyway? I can't imagine a real scenario where you'd want put yourself
> > > through that: you are just giving yourself all the downsides of a zoned
> > > block device and none of the benefits. AFAIK, this is provided as a
> > > development vehicle, closer to a "toy".
> > >
> > > I think we should consider trimming this down to a more minimal set that
> > > we *do* agree on and commit for inclusion ASAP. We can iterate all the
> > > bells & whistles and flush out the meta data's data marshalling scheme
> > > for persistence later.
> >
> > +1 on this. Removing the persistence also removes the debate on
> endianess. With
> > that out of the way, it should be straightforward to get agreement on a
> series
> > that can be merged quickly to get developers started with testing ZNS
> software
> > with QEMU. That is the most important goal here. 5.9 is around the corner,
> we
> > need something for people to get started with ZNS quickly.
> >
> 
> Wait. What. No. Stop!
> 
> It is unmistakably clear that you are invalidating my arguments about
> portability and endianness issues by suggesting that we just remove
> persistent state and deal with it later, but persistence is the killer
> feature that sets the QEMU emulated device apart from other emulation
> options. It is not about using emulation in production (because yeah,
> why would you?), but persistence is what makes it possible to develop
> and test "zoned FTLs" or something that requires recovery at power up.
> This is what allows testing of how your host software deals with opened
> zones being transitioned to FULL on power up and the persistent tracking
> of LBA allocation (in my series) can be used to properly test error
> recovery if you lost state in the app.
> 
> Please, work with me on this instead of just removing such an essential
> feature. Since persistence seems to be the only thing we are really
> discussing, we should have plenty of time until the soft-freeze to come
> up with a proper solution on that.
> 
> I agree that my version had a format that was pretty ad-hoc and that
> won't fly - it needs magic and version capabilities like in Dmitry's
> series, which incidentially looks a lot like what we did in the
> OpenChannel implementation, so I agree with the strategy.

Are you insinuating that I somehow took stuff from OCSSD code and try
to claim priority this way? I am not at all that familiar with that code.
And I've already sent you the link to tcmu-runner code that served me
as an inspiration for implementing persistence in WDC patchset.
That code has been around for years, uses mmap, works great and has
nothing to do with you.

> 
> ZNS-wise, the only thing my implementation stores is the zone
> descriptors (in spec-native little-endian format) and the zone
> descriptor extensions. So there are no endian issues with those. The
> allocation tracking bitmap is always stored in little endian, but
> converted to big-endian if running on a big-endian host.
> 
> Let me just conjure something up.
> 
>     #define NVME_PSTATE_MAGIC ...
>     #define NVME_PSTATE_V1    1
> 
>     typedef struct NvmePstateHeader {
>         uint32_t magic;
>         uint32_t version;
> 
>         uint64_t blk_len;
> 
>         uint8_t  lbads;
>         uint8_t  iocs;
> 
>         uint8_t  rsvd18[3054];
> 
>         struct {
>             uint64_t zsze;
>             uint8_t  zdes;
>         } QEMU_PACKED zns;
> 
>         uint8_t  rsvd3089[1007];
>     } QEMU_PACKED NvmePstateHeader;
> 

Why conjure something that already exists in WDC patchset? And that part
has been published in the very first version of our patches, weeks before
your entire ZNS series was posted. Add an rsvd[] here and there and that code
can be as suitable to achieve the stated goals as what you have above.

> With such a header we have all we need. We can bail out if any
> parameters do not match and similar to nvme data structures it contains
> reserved areas for future use. I'll be posting a v2 with this. If this
> still feels too ad-hoc, we can be inspired by QCOW2 and the "extension"
> feature.
> 
> I can agree that we drop other optional features like zone excursions
> and the reset/finish recommended limit simulation, but PLEASE DO NOT
> remove persistence and upstream a half-baked version when we are so
> close and have time to get it right.

One important thing IMO is to reduce future need for metadata versioning.
This requires a really good effort to design and review the proper metadata
format that would become stable for some time. Think about portability.
To get out something "conjured up" now and then have to move to V2
metadata in the next release is even worse than no persistence at all.
So maybe is makes sense to go with Keith's suggestion.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-29 15:43             ` Dmitry Fomichev
@ 2020-09-29 16:36               ` Klaus Jensen
  0 siblings, 0 replies; 42+ messages in thread
From: Klaus Jensen @ 2020-09-29 16:36 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 7947 bytes --]

On Sep 29 15:43, Dmitry Fomichev wrote:
> > -----Original Message-----
> > From: Qemu-block <qemu-block-
> > bounces+dmitry.fomichev=wdc.com@nongnu.org> On Behalf Of Klaus
> > Jensen
> > Sent: Tuesday, September 29, 2020 6:47 AM
> > To: Damien Le Moal <Damien.LeMoal@wdc.com>
> > Cc: Fam Zheng <fam@euphon.net>; Kevin Wolf <kwolf@redhat.com>; qemu-
> > block@nongnu.org; Niklas Cassel <Niklas.Cassel@wdc.com>; Klaus Jensen
> > <k.jensen@samsung.com>; qemu-devel@nongnu.org; Alistair Francis
> > <Alistair.Francis@wdc.com>; Keith Busch <kbusch@kernel.org>; Philippe
> > Mathieu-Daudé <philmd@redhat.com>; Matias Bjorling
> > <Matias.Bjorling@wdc.com>
> > Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> > and Zoned Namespace Command Set
> > 
> > On Sep 28 22:54, Damien Le Moal wrote:
> > > On 2020/09/29 6:25, Keith Busch wrote:
> > > > On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
> > > >> On Sep 28 02:33, Dmitry Fomichev wrote:
> > > >>> You are making it sound like the entire WDC series relies on this
> > approach.
> > > >>> Actually, the persistency is introduced in the second to last patch in the
> > > >>> series and it only adds a couple of lines of code in the i/o path to mark
> > > >>> zones dirty. This is possible because of using mmap() and I find the way
> > > >>> it is done to be quite elegant, not ugly :)
> > > >>>
> > > >>
> > > >> No, I understand that your implementation works fine without
> > > >> persistance, but persistance is key. That is why my series adds it in
> > > >> the first patch. Without persistence it is just a toy. And the QEMU
> > > >> device is not just an "NVMe-version" of null_blk.
> > > >
> > > > I really think we should be a bit more cautious of commiting to an
> > > > on-disk format for the persistent state. Both this and Klaus' persistent
> > > > state feels a bit ad-hoc, and with all the other knobs provided, it
> > > > looks too easy to have out-of-sync states, or just not being able to
> > > > boot at all if a qemu versions have different on-disk formats.
> > > >
> > > > Is anyone really considering zone emulation for production level stuff
> > > > anyway? I can't imagine a real scenario where you'd want put yourself
> > > > through that: you are just giving yourself all the downsides of a zoned
> > > > block device and none of the benefits. AFAIK, this is provided as a
> > > > development vehicle, closer to a "toy".
> > > >
> > > > I think we should consider trimming this down to a more minimal set that
> > > > we *do* agree on and commit for inclusion ASAP. We can iterate all the
> > > > bells & whistles and flush out the meta data's data marshalling scheme
> > > > for persistence later.
> > >
> > > +1 on this. Removing the persistence also removes the debate on
> > endianess. With
> > > that out of the way, it should be straightforward to get agreement on a
> > series
> > > that can be merged quickly to get developers started with testing ZNS
> > software
> > > with QEMU. That is the most important goal here. 5.9 is around the corner,
> > we
> > > need something for people to get started with ZNS quickly.
> > >
> > 
> > Wait. What. No. Stop!
> > 
> > It is unmistakably clear that you are invalidating my arguments about
> > portability and endianness issues by suggesting that we just remove
> > persistent state and deal with it later, but persistence is the killer
> > feature that sets the QEMU emulated device apart from other emulation
> > options. It is not about using emulation in production (because yeah,
> > why would you?), but persistence is what makes it possible to develop
> > and test "zoned FTLs" or something that requires recovery at power up.
> > This is what allows testing of how your host software deals with opened
> > zones being transitioned to FULL on power up and the persistent tracking
> > of LBA allocation (in my series) can be used to properly test error
> > recovery if you lost state in the app.
> > 
> > Please, work with me on this instead of just removing such an essential
> > feature. Since persistence seems to be the only thing we are really
> > discussing, we should have plenty of time until the soft-freeze to come
> > up with a proper solution on that.
> > 
> > I agree that my version had a format that was pretty ad-hoc and that
> > won't fly - it needs magic and version capabilities like in Dmitry's
> > series, which incidentially looks a lot like what we did in the
> > OpenChannel implementation, so I agree with the strategy.
> 
> Are you insinuating that I somehow took stuff from OCSSD code and try
> to claim priority this way? I am not at all that familiar with that code.
> And I've already sent you the link to tcmu-runner code that served me
> as an inspiration for implementing persistence in WDC patchset.
> That code has been around for years, uses mmap, works great and has
> nothing to do with you.
> 

No. I am not insinuating anything. The OpenChannel device also used a
blockdev, but, yes, incidentially (and sorry, I should not have used
that word), it looked like how we did it there and I noted that I agreed
with the strategy.

> > 
> > ZNS-wise, the only thing my implementation stores is the zone
> > descriptors (in spec-native little-endian format) and the zone
> > descriptor extensions. So there are no endian issues with those. The
> > allocation tracking bitmap is always stored in little endian, but
> > converted to big-endian if running on a big-endian host.
> > 
> > Let me just conjure something up.
> > 
> >     #define NVME_PSTATE_MAGIC ...
> >     #define NVME_PSTATE_V1    1
> > 
> >     typedef struct NvmePstateHeader {
> >         uint32_t magic;
> >         uint32_t version;
> > 
> >         uint64_t blk_len;
> > 
> >         uint8_t  lbads;
> >         uint8_t  iocs;
> > 
> >         uint8_t  rsvd18[3054];
> > 
> >         struct {
> >             uint64_t zsze;
> >             uint8_t  zdes;
> >         } QEMU_PACKED zns;
> > 
> >         uint8_t  rsvd3089[1007];
> >     } QEMU_PACKED NvmePstateHeader;
> > 
> 
> Why conjure something that already exists in WDC patchset? And that part
> has been published in the very first version of our patches, weeks before
> your entire ZNS series was posted. Add an rsvd[] here and there and that code
> can be as suitable to achieve the stated goals as what you have above.
> 

Yes, I read your code. I know you have a header and I also noted above
that "it needs magic and version capabilities like in Dmitry's series".

> > series,
> > With such a header we have all we need. We can bail out if any
> > parameters do not match and similar to nvme data structures it contains
> > reserved areas for future use. I'll be posting a v2 with this. If this
> > still feels too ad-hoc, we can be inspired by QCOW2 and the "extension"
> > feature.
> > 
> > I can agree that we drop other optional features like zone excursions
> > and the reset/finish recommended limit simulation, but PLEASE DO NOT
> > remove persistence and upstream a half-baked version when we are so
> > close and have time to get it right.
> 
> One important thing IMO is to reduce future need for metadata versioning.
> This requires a really good effort to design and review the proper metadata
> format that would become stable for some time. Think about portability.
> To get out something "conjured up" now and then have to move to V2
> metadata in the next release is even worse than no persistence at all.
> So maybe is makes sense to go with Keith's suggestion.

As I've said, we have time until the soft-freeze to get this right. I
"conjured" something up to show a point. The reason we review and
iterate is to NOT upstream something thats is conjured up.

But we gotta start somewhere, no? So what is so bad about me posting a
v2?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-29 10:46           ` Klaus Jensen
  2020-09-29 11:13             ` Damien Le Moal
  2020-09-29 15:43             ` Dmitry Fomichev
@ 2020-09-29 17:29             ` Keith Busch
  2020-09-29 18:00               ` Klaus Jensen
  2 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2020-09-29 17:29 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Alistair Francis,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> It is unmistakably clear that you are invalidating my arguments about
> portability and endianness issues by suggesting that we just remove
> persistent state and deal with it later, but persistence is the killer
> feature that sets the QEMU emulated device apart from other emulation
> options. It is not about using emulation in production (because yeah,
> why would you?), but persistence is what makes it possible to develop
> and test "zoned FTLs" or something that requires recovery at power up.
> This is what allows testing of how your host software deals with opened
> zones being transitioned to FULL on power up and the persistent tracking
> of LBA allocation (in my series) can be used to properly test error
> recovery if you lost state in the app.

Hold up -- why does an OPEN zone transition to FULL on power up? The
spec suggests it should be CLOSED. The spec does appear to support going
to FULL on a NVM Subsystem Reset, though. Actually, now that I'm looking
at this part of the spec, these implicit transitions seem a bit less
clear than I expected. I'm not sure it's clear enough to evaluate qemu's
compliance right now.

But I don't see what testing these transitions has to do with having a
persistent state. You can reboot your VM without tearing down the
running QEMU instance. You can also unbind the driver or shutdown the
controller within the running operating system. That should make those
implicit state transitions reachable in order to exercise your FTL's
recovery.

I agree the persistent state provides conveniences for developers. I
just don't want to gate ZNS enabling on it either since the core design
doesn't depend on it.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-29 11:13             ` Damien Le Moal
@ 2020-09-29 17:44               ` Keith Busch
  0 siblings, 0 replies; 42+ messages in thread
From: Keith Busch @ 2020-09-29 17:44 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Niklas Cassel, Klaus Jensen,
	qemu-devel, Alistair Francis, Klaus Jensen,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Tue, Sep 29, 2020 at 11:13:51AM +0000, Damien Le Moal wrote:
> OK. Then let's move the persistence implementation as the last patch in the
> series. This way, if it is still controversial, it will not block the rest.
> 
> Here is what I propose:
> Dmitry: remove persistence stuff from your patches, address comments and resend.
> Klaus: Rebase your persistence patch(es) with reworked format on top of Dmitry
> series and send.
> 
> That creates a pipeline for reviews and persistence is not a blocker. And I
> agree that other ZNS feature can come after we get all of that done first.
> 
> Thoughts ? Keith ? Would that work for you ?

That works for me. I will have comments for Dmitry's v5, though, so
please wait one more day before considering a respin.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-29 17:29             ` Keith Busch
@ 2020-09-29 18:00               ` Klaus Jensen
  2020-09-29 18:15                 ` Keith Busch
  2020-09-29 18:17                 ` Matias Bjorling
  0 siblings, 2 replies; 42+ messages in thread
From: Klaus Jensen @ 2020-09-29 18:00 UTC (permalink / raw)
  To: Keith Busch
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Alistair Francis,
	Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 3081 bytes --]

On Sep 29 10:29, Keith Busch wrote:
> On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > It is unmistakably clear that you are invalidating my arguments about
> > portability and endianness issues by suggesting that we just remove
> > persistent state and deal with it later, but persistence is the killer
> > feature that sets the QEMU emulated device apart from other emulation
> > options. It is not about using emulation in production (because yeah,
> > why would you?), but persistence is what makes it possible to develop
> > and test "zoned FTLs" or something that requires recovery at power up.
> > This is what allows testing of how your host software deals with opened
> > zones being transitioned to FULL on power up and the persistent tracking
> > of LBA allocation (in my series) can be used to properly test error
> > recovery if you lost state in the app.
> 
> Hold up -- why does an OPEN zone transition to FULL on power up? The
> spec suggests it should be CLOSED. The spec does appear to support going
> to FULL on a NVM Subsystem Reset, though. Actually, now that I'm looking
> at this part of the spec, these implicit transitions seem a bit less
> clear than I expected. I'm not sure it's clear enough to evaluate qemu's
> compliance right now.
> 
> But I don't see what testing these transitions has to do with having a
> persistent state. You can reboot your VM without tearing down the
> running QEMU instance. You can also unbind the driver or shutdown the
> controller within the running operating system. That should make those
> implicit state transitions reachable in order to exercise your FTL's
> recovery.
> 

Oh dear - don't "spec" with me ;)

NVMe v1.4 Section 7.3.1:

    An NVM Subsystem Reset is initiated when:
      * Main power is applied to the NVM subsystem;
      * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
        field;
      * Requested using a method defined in the NVMe Management
        Interface specification; or
      * A vendor specific event occurs.

In the context of QEMU, "Main power" is tearing down QEMU and starting
it from scratch. Just like on a "real" host, unbinding the driver,
rebooting or shutting down the controller does not cause a subsystem
reset (and does not cause the zones to change state). And since the
device does not indicate support for the optional NSSR.NSSRC register,
that way to initiate a subsystem cannot be used.

The reason for moving to FULL is that write pointer updates are not
persisted on each advancement, only when the zone state changes. So
zones that were opened might have valid data, but invalid write pointer.
So the device transitions them to FULL as it is allowed to.

                                                        QED.

> I agree the persistent state provides conveniences for developers. I
> just don't want to gate ZNS enabling on it either since the core design
> doesn't depend on it.

I just don't see why we cant have the icing on the cake when it is
already there :)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-29 18:00               ` Klaus Jensen
@ 2020-09-29 18:15                 ` Keith Busch
  2020-09-29 18:18                   ` Klaus Jensen
  2020-09-29 18:17                 ` Matias Bjorling
  1 sibling, 1 reply; 42+ messages in thread
From: Keith Busch @ 2020-09-29 18:15 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Alistair Francis,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Tue, Sep 29, 2020 at 08:00:04PM +0200, Klaus Jensen wrote:
> On Sep 29 10:29, Keith Busch wrote:
> > On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > > It is unmistakably clear that you are invalidating my arguments about
> > > portability and endianness issues by suggesting that we just remove
> > > persistent state and deal with it later, but persistence is the killer
> > > feature that sets the QEMU emulated device apart from other emulation
> > > options. It is not about using emulation in production (because yeah,
> > > why would you?), but persistence is what makes it possible to develop
> > > and test "zoned FTLs" or something that requires recovery at power up.
> > > This is what allows testing of how your host software deals with opened
> > > zones being transitioned to FULL on power up and the persistent tracking
> > > of LBA allocation (in my series) can be used to properly test error
> > > recovery if you lost state in the app.
> > 
> > Hold up -- why does an OPEN zone transition to FULL on power up? The
> > spec suggests it should be CLOSED. The spec does appear to support going
> > to FULL on a NVM Subsystem Reset, though. Actually, now that I'm looking
> > at this part of the spec, these implicit transitions seem a bit less
> > clear than I expected. I'm not sure it's clear enough to evaluate qemu's
> > compliance right now.
> > 
> > But I don't see what testing these transitions has to do with having a
> > persistent state. You can reboot your VM without tearing down the
> > running QEMU instance. You can also unbind the driver or shutdown the
> > controller within the running operating system. That should make those
> > implicit state transitions reachable in order to exercise your FTL's
> > recovery.
> > 
> 
> Oh dear - don't "spec" with me ;)
> 
> NVMe v1.4 Section 7.3.1:
> 
>     An NVM Subsystem Reset is initiated when:
>       * Main power is applied to the NVM subsystem;
>       * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
>         field;
>       * Requested using a method defined in the NVMe Management
>         Interface specification; or
>       * A vendor specific event occurs.
 
Okay. I wish the nvme twg would strip the changelog from the published
TPs. We have unhelpful statements like this in the ZNS spec:

  "Default active zones to transition to Closed state on power/controller reset."

> In the context of QEMU, "Main power" is tearing down QEMU and starting
> it from scratch. Just like on a "real" host, unbinding the driver,
> rebooting or shutting down the controller does not cause a subsystem
> reset (and does not cause the zones to change state). 

That can't be right. The ZNS spec says:

  The initial state of a zone state machine is set as a result of:
    a) an NVM Subsystem Reset; or
    b) all controllers in the NVM subsystem reporting Shutdown
       processing complete ((i.e., 10b in the Shutdown Status (SHST)
       register)

So a CC.SHN had better cause an implicit transition of open zones to
their "initial" state since 'open' is not a valid initial state.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-29 18:00               ` Klaus Jensen
  2020-09-29 18:15                 ` Keith Busch
@ 2020-09-29 18:17                 ` Matias Bjorling
  2020-09-29 18:36                   ` Klaus Jensen
  1 sibling, 1 reply; 42+ messages in thread
From: Matias Bjorling @ 2020-09-29 18:17 UTC (permalink / raw)
  To: Klaus Jensen, Keith Busch
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Alistair Francis,
	Philippe Mathieu-Daudé



> -----Original Message-----
> From: Klaus Jensen <its@irrelevant.dk>
> Sent: Tuesday, 29 September 2020 20.00
> To: Keith Busch <kbusch@kernel.org>
> Cc: Damien Le Moal <Damien.LeMoal@wdc.com>; Fam Zheng
> <fam@euphon.net>; Kevin Wolf <kwolf@redhat.com>; qemu-
> block@nongnu.org; Niklas Cassel <Niklas.Cassel@wdc.com>; Klaus Jensen
> <k.jensen@samsung.com>; qemu-devel@nongnu.org; Alistair Francis
> <Alistair.Francis@wdc.com>; Philippe Mathieu-Daudé <philmd@redhat.com>;
> Matias Bjorling <Matias.Bjorling@wdc.com>
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and
> Zoned Namespace Command Set
> 
> On Sep 29 10:29, Keith Busch wrote:
> > On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > > It is unmistakably clear that you are invalidating my arguments
> > > about portability and endianness issues by suggesting that we just
> > > remove persistent state and deal with it later, but persistence is
> > > the killer feature that sets the QEMU emulated device apart from
> > > other emulation options. It is not about using emulation in
> > > production (because yeah, why would you?), but persistence is what
> > > makes it possible to develop and test "zoned FTLs" or something that
> requires recovery at power up.
> > > This is what allows testing of how your host software deals with
> > > opened zones being transitioned to FULL on power up and the
> > > persistent tracking of LBA allocation (in my series) can be used to
> > > properly test error recovery if you lost state in the app.
> >
> > Hold up -- why does an OPEN zone transition to FULL on power up? The
> > spec suggests it should be CLOSED. The spec does appear to support
> > going to FULL on a NVM Subsystem Reset, though. Actually, now that I'm
> > looking at this part of the spec, these implicit transitions seem a
> > bit less clear than I expected. I'm not sure it's clear enough to
> > evaluate qemu's compliance right now.
> >
> > But I don't see what testing these transitions has to do with having a
> > persistent state. You can reboot your VM without tearing down the
> > running QEMU instance. You can also unbind the driver or shutdown the
> > controller within the running operating system. That should make those
> > implicit state transitions reachable in order to exercise your FTL's
> > recovery.
> >
> 
> Oh dear - don't "spec" with me ;)
> 
> NVMe v1.4 Section 7.3.1:
> 
>     An NVM Subsystem Reset is initiated when:
>       * Main power is applied to the NVM subsystem;
>       * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
>         field;
>       * Requested using a method defined in the NVMe Management
>         Interface specification; or
>       * A vendor specific event occurs.
> 
> In the context of QEMU, "Main power" is tearing down QEMU and starting it
> from scratch. Just like on a "real" host, unbinding the driver, rebooting or
> shutting down the controller does not cause a subsystem reset (and does not
> cause the zones to change state). And since the device does not indicate
> support for the optional NSSR.NSSRC register, that way to initiate a subsystem
> cannot be used.
> 
> The reason for moving to FULL is that write pointer updates are not persisted
> on each advancement, only when the zone state changes. So zones that were
> opened might have valid data, but invalid write pointer.
> So the device transitions them to FULL as it is allowed to.
> 

How about when one must also recover from intermediate states (i.e., open/closed upon power loss). For example, I don't hope a real SSD implementation transition zones to full when it has thousands of open simultaneously. That could be a disaster for the PE cycles, and a lot of media going to waste. One would want applications to support that kind of failure mode as well. 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-29 18:15                 ` Keith Busch
@ 2020-09-29 18:18                   ` Klaus Jensen
  0 siblings, 0 replies; 42+ messages in thread
From: Klaus Jensen @ 2020-09-29 18:18 UTC (permalink / raw)
  To: Keith Busch
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Alistair Francis,
	Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 3353 bytes --]

On Sep 29 11:15, Keith Busch wrote:
> On Tue, Sep 29, 2020 at 08:00:04PM +0200, Klaus Jensen wrote:
> > On Sep 29 10:29, Keith Busch wrote:
> > > On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > > > It is unmistakably clear that you are invalidating my arguments about
> > > > portability and endianness issues by suggesting that we just remove
> > > > persistent state and deal with it later, but persistence is the killer
> > > > feature that sets the QEMU emulated device apart from other emulation
> > > > options. It is not about using emulation in production (because yeah,
> > > > why would you?), but persistence is what makes it possible to develop
> > > > and test "zoned FTLs" or something that requires recovery at power up.
> > > > This is what allows testing of how your host software deals with opened
> > > > zones being transitioned to FULL on power up and the persistent tracking
> > > > of LBA allocation (in my series) can be used to properly test error
> > > > recovery if you lost state in the app.
> > > 
> > > Hold up -- why does an OPEN zone transition to FULL on power up? The
> > > spec suggests it should be CLOSED. The spec does appear to support going
> > > to FULL on a NVM Subsystem Reset, though. Actually, now that I'm looking
> > > at this part of the spec, these implicit transitions seem a bit less
> > > clear than I expected. I'm not sure it's clear enough to evaluate qemu's
> > > compliance right now.
> > > 
> > > But I don't see what testing these transitions has to do with having a
> > > persistent state. You can reboot your VM without tearing down the
> > > running QEMU instance. You can also unbind the driver or shutdown the
> > > controller within the running operating system. That should make those
> > > implicit state transitions reachable in order to exercise your FTL's
> > > recovery.
> > > 
> > 
> > Oh dear - don't "spec" with me ;)
> > 
> > NVMe v1.4 Section 7.3.1:
> > 
> >     An NVM Subsystem Reset is initiated when:
> >       * Main power is applied to the NVM subsystem;
> >       * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
> >         field;
> >       * Requested using a method defined in the NVMe Management
> >         Interface specification; or
> >       * A vendor specific event occurs.
>  
> Okay. I wish the nvme twg would strip the changelog from the published
> TPs. We have unhelpful statements like this in the ZNS spec:
> 
>   "Default active zones to transition to Closed state on power/controller reset."
> 
> > In the context of QEMU, "Main power" is tearing down QEMU and starting
> > it from scratch. Just like on a "real" host, unbinding the driver,
> > rebooting or shutting down the controller does not cause a subsystem
> > reset (and does not cause the zones to change state). 
> 
> That can't be right. The ZNS spec says:
> 
>   The initial state of a zone state machine is set as a result of:
>     a) an NVM Subsystem Reset; or
>     b) all controllers in the NVM subsystem reporting Shutdown
>        processing complete ((i.e., 10b in the Shutdown Status (SHST)
>        register)
> 
> So a CC.SHN had better cause an implicit transition of open zones to
> their "initial" state since 'open' is not a valid initial state.

Oh snap; true, you got me there.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-29 18:17                 ` Matias Bjorling
@ 2020-09-29 18:36                   ` Klaus Jensen
  2020-09-29 19:42                     ` Matias Bjorling
  0 siblings, 1 reply; 42+ messages in thread
From: Klaus Jensen @ 2020-09-29 18:36 UTC (permalink / raw)
  To: Matias Bjorling
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 4677 bytes --]

On Sep 29 18:17, Matias Bjorling wrote:
> 
> 
> > -----Original Message-----
> > From: Klaus Jensen <its@irrelevant.dk>
> > Sent: Tuesday, 29 September 2020 20.00
> > To: Keith Busch <kbusch@kernel.org>
> > Cc: Damien Le Moal <Damien.LeMoal@wdc.com>; Fam Zheng
> > <fam@euphon.net>; Kevin Wolf <kwolf@redhat.com>; qemu-
> > block@nongnu.org; Niklas Cassel <Niklas.Cassel@wdc.com>; Klaus Jensen
> > <k.jensen@samsung.com>; qemu-devel@nongnu.org; Alistair Francis
> > <Alistair.Francis@wdc.com>; Philippe Mathieu-Daudé <philmd@redhat.com>;
> > Matias Bjorling <Matias.Bjorling@wdc.com>
> > Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and
> > Zoned Namespace Command Set
> > 
> > On Sep 29 10:29, Keith Busch wrote:
> > > On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > > > It is unmistakably clear that you are invalidating my arguments
> > > > about portability and endianness issues by suggesting that we just
> > > > remove persistent state and deal with it later, but persistence is
> > > > the killer feature that sets the QEMU emulated device apart from
> > > > other emulation options. It is not about using emulation in
> > > > production (because yeah, why would you?), but persistence is what
> > > > makes it possible to develop and test "zoned FTLs" or something that
> > requires recovery at power up.
> > > > This is what allows testing of how your host software deals with
> > > > opened zones being transitioned to FULL on power up and the
> > > > persistent tracking of LBA allocation (in my series) can be used to
> > > > properly test error recovery if you lost state in the app.
> > >
> > > Hold up -- why does an OPEN zone transition to FULL on power up? The
> > > spec suggests it should be CLOSED. The spec does appear to support
> > > going to FULL on a NVM Subsystem Reset, though. Actually, now that I'm
> > > looking at this part of the spec, these implicit transitions seem a
> > > bit less clear than I expected. I'm not sure it's clear enough to
> > > evaluate qemu's compliance right now.
> > >
> > > But I don't see what testing these transitions has to do with having a
> > > persistent state. You can reboot your VM without tearing down the
> > > running QEMU instance. You can also unbind the driver or shutdown the
> > > controller within the running operating system. That should make those
> > > implicit state transitions reachable in order to exercise your FTL's
> > > recovery.
> > >
> > 
> > Oh dear - don't "spec" with me ;)
> > 
> > NVMe v1.4 Section 7.3.1:
> > 
> >     An NVM Subsystem Reset is initiated when:
> >       * Main power is applied to the NVM subsystem;
> >       * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
> >         field;
> >       * Requested using a method defined in the NVMe Management
> >         Interface specification; or
> >       * A vendor specific event occurs.
> > 
> > In the context of QEMU, "Main power" is tearing down QEMU and starting it
> > from scratch. Just like on a "real" host, unbinding the driver, rebooting or
> > shutting down the controller does not cause a subsystem reset (and does not
> > cause the zones to change state). And since the device does not indicate
> > support for the optional NSSR.NSSRC register, that way to initiate a subsystem
> > cannot be used.
> > 
> > The reason for moving to FULL is that write pointer updates are not persisted
> > on each advancement, only when the zone state changes. So zones that were
> > opened might have valid data, but invalid write pointer.
> > So the device transitions them to FULL as it is allowed to.
> > 
> 
> How about when one must also recover from intermediate states (i.e.,
> open/closed upon power loss). For example, I don't hope a real SSD
> implementation transition zones to full when it has thousands of open
> simultaneously. That could be a disaster for the PE cycles, and a lot
> of media going to waste. One would want applications to support that
> kind of failure mode as well. 

Christ. The WDC Strike Force is really jumping out of lightspeed here.
I'm afraid I don't have an opposing force to engage with. So I'll be
your only boxing bag for the evening.

As Keith just said, "Opened" is not a valid intial state. Didn't you
write the spec? ;) As for Closed, they will be brought up as is.

With that in mind, I'm not sure what you specifically refer to? I'll
gently remind you that the QEMU nvme device is not a real SSD and does
not deal with NAND so it does not really do any "recovering" of
intermediate states on power on if that is what you refer to?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-29 15:42       ` Dmitry Fomichev
@ 2020-09-29 18:39         ` Klaus Jensen
  2020-09-29 19:22           ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Klaus Jensen @ 2020-09-29 18:39 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Fam Zheng, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 7940 bytes --]

On Sep 29 15:42, Dmitry Fomichev wrote:
> > -----Original Message-----
> > From: Klaus Jensen <its@irrelevant.dk>
> > Sent: Monday, September 28, 2020 2:37 AM
> > To: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
> > Cc: Keith Busch <kbusch@kernel.org>; Damien Le Moal
> > <Damien.LeMoal@wdc.com>; Klaus Jensen <k.jensen@samsung.com>; Kevin
> > Wolf <kwolf@redhat.com>; Philippe Mathieu-Daudé <philmd@redhat.com>;
> > Maxim Levitsky <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>;
> > Niklas Cassel <Niklas.Cassel@wdc.com>; qemu-block@nongnu.org; qemu-
> > devel@nongnu.org; Alistair Francis <Alistair.Francis@wdc.com>; Matias
> > Bjorling <Matias.Bjorling@wdc.com>
> > Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> > and Zoned Namespace Command Set
> > 
> > On Sep 28 02:33, Dmitry Fomichev wrote:
> > > > -----Original Message-----
> > > > From: Klaus Jensen <its@irrelevant.dk>
> > > >
> > > > If it really needs to be memory mapped, then I think a hostmem-based
> > > > approach similar to what Andrzej did for PMR is needed (I think that
> > > > will get rid of the CONFIG_POSIX ifdef at least, but still leave it
> > > > slightly tricky to get it to work on all platforms AFAIK).
> > >
> > > Ok, it looks that using the HostMemoryBackendFile backend will be
> > > more appropriate. This will remove the need for conditional compile.
> > >
> > > The mmap() portability is pretty decent across software platforms.
> > > Any poor Windows user who is forced to emulate ZNS on mingw will be
> > > able to do so, just without having zone state persistency. Considering
> > > how specialized this stuff is in first place, I estimate the number of users
> > > affected by this "limitation" to be exactly zero.
> > >
> > 
> > QEMU is a cross platform project - we should strive for portability.
> > 
> > Alienating developers that use a Windows platform and calling them out
> > as "poor" is not exactly good for the zoned ecosystem.
> > 
> 
> Wow. By bringing up political correctness here you are basically admitting
> the fact that you have no real technical argument here.

I prefer that we support all platforms if and when we can. That's a
technical argument, not a personal one like you those you start using
now.

> The whole Windows issue is red herring that you are using to attack
> the code that is absolutely legit, but comes from a competitor.

I can't even...

> Your initial complaint was that it doesn't compile in mingw and that
> it uses "wrong" API. You have even suggested the API to use. Now, the
> code uses that API and builds fine, but now it's still not good simply
> because you "do not like it". It's a disgrace.
> 

I answered this in a previous reply.

> > > > But really,
> > > > since we do not require memory semantics for this, then I think the
> > > > abstraction is fundamentally wrong.
> > > >
> > >
> > > Seriously, what is wrong with using mmap :) ? It is used successfully for
> > > similar applications, for example -
> > > https://github.com/open-iscsi/tcmu-runner/blob/master/file_zbc.c
> > >
> > 
> > There is nothing fundamentally wrong with mmap. I just think it is the
> > wrong abstraction here (and it limits portability for no good reason).
> > For PMR there is a good reason - it requires memory semantics.
> > 
> 
> We are trying to emulate NVMEe controller NVRAM.  The best abstraction
> for emulating NVRAM would be... NVRAM!
> 

You never brought that up before and sure it could be a fair argument,
except it is not true.

PMR is emulating NVRAM (and requires memory semantics). Persistent state
is not emulating anything. It is an implementation detail.

> > > > I am, of course, blowing my own horn, since my implementation uses a
> > > > portable blockdev for this.
> > > >
> > >
> > > You are making it sound like the entire WDC series relies on this approach.
> > > Actually, the persistency is introduced in the second to last patch in the
> > > series and it only adds a couple of lines of code in the i/o path to mark
> > > zones dirty. This is possible because of using mmap() and I find the way
> > > it is done to be quite elegant, not ugly :)
> > >
> > 
> > No, I understand that your implementation works fine without
> > persistance, but persistance is key. That is why my series adds it in
> > the first patch. Without persistence it is just a toy. And the QEMU
> > device is not just an "NVMe-version" of null_blk.
> > 
> > And I don't think I ever called the use of mmap ugly. I called out the
> > physical memory API shenanigans as a hack.
> > 
> > > > Another issue is the complete lack of endian conversions. Does it
> > > > matter? It depends. Will anyone ever use this on a big endian host and
> > > > move the meta data backing file to a little endian host? Probably not.
> > > > So does it really matter? Probably not, but it is cutting corners.
> > > >
> > 
> > After I had replied this, I considered a follow-up, because there are
> > probably QEMU developers that would call me out on this.
> > 
> > This definitely DOES matter to QEMU.
> > 
> > >
> > > Great point on endianness! Naturally, all file backed values are stored in
> > > their native endianness. This way, there is no extra overhead on big endian
> > > hardware architectures. Portability concerns can be easily addressed by
> > > storing metadata endianness as a byte flag in its header. Then, during
> > > initialization, the metadata validation code can detect the possible
> > > discrepancy in endianness and automatically convert the metadata to the
> > > endianness of the host. This part is out of scope of this series, but I would
> > > be able to contribute such a solution as an enhancement in the future.
> > >
> > 
> > It is not out of scope. I don't see why we should merge something that
> > is arguably buggy.
> 
> Again, wow! Now you turned around and arbitrarily elevated this issue from
> moderate ("Does it matter?, cutting corners") to severe ("buggy"). Likely
> because v5 of WDC patchset has been posted.

No, exactly as I wrote above, after I hit reply I considered a
follow-up. I guess I should have.

> This, again, just shows your lack of integrity as a maintainer.
> 

I can't believe I just read that.

I will not put up with this. It is completely non-called for. I stand up
for my opinions and I will fight to make sure the best possible code
goes upstream. Yes, I am paid by Samsung. But I can compartmentalize. I
have been working on QEMU before Samsung and I know how to separate
corporate interest and open source. I have a proven record on this list
to show that. I really cannot believe that you brought it down to that
level. I have been putting forth technical arguments throughout this
entire review process and now you are getting personal.

Not. Cool. Please keep things professional from now.

> This "issue" is a real trivial one to fix as I described above and you are
> blowing it up way out of proportion, making it look like it is a fundamental
> problem that can not be resolved. It's not.
> 

If it is so trival to fix, please fix it. I think I made it clear that I
won't be happy until it is portable.

And please note that I have *not* complained about other parts of your
series. I have complained ALOT about the persistence implementation -
and I continue to stand behind those complaints.

I'm getting super tired of this one-sided process. I have continuously
reviewed and commented your series, I have found multiple bugs, I have
suggested improvements. Maybe if just one or two of those 9 people who
signed off on your zoned implementation could look past their own nose
and look at my series - you might just realize that its decent, portable
and offers some niceties that yours do not have (at the cost of the same
amount of code mind you).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-29 18:39         ` Klaus Jensen
@ 2020-09-29 19:22           ` Keith Busch
  2020-09-29 19:53             ` Dmitry Fomichev
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2020-09-29 19:22 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, Fam Zheng, Damien Le Moal, qemu-block, Niklas Cassel,
	Dmitry Fomichev, Klaus Jensen, qemu-devel, Alistair Francis,
	Philippe Mathieu-Daudé,
	Matias Bjorling

All,

Let's de-escalate this, please. There's no reason to doubt Klaus wants
to see this to work well, just as everyone else does. We unfortunately
have conflicting proposals posted, and everyone is passionate enough
about their work, but please simmer down.

As I mentioned earlier, I'd like to refocus on the basic implementation
and save the persistent state discussion once the core is solid. After
going through it all, I feel there's enough to discuss there to keep us
busy for little while longer. Additional comments on the code will be
coming from me later today.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-29 18:36                   ` Klaus Jensen
@ 2020-09-29 19:42                     ` Matias Bjorling
  0 siblings, 0 replies; 42+ messages in thread
From: Matias Bjorling @ 2020-09-29 19:42 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé



> -----Original Message-----
> From: Klaus Jensen <its@irrelevant.dk>
> Sent: Tuesday, 29 September 2020 20.36
> To: Matias Bjorling <Matias.Bjorling@wdc.com>
> Cc: Keith Busch <kbusch@kernel.org>; Damien Le Moal
> <Damien.LeMoal@wdc.com>; Fam Zheng <fam@euphon.net>; Kevin Wolf
> <kwolf@redhat.com>; qemu-block@nongnu.org; Niklas Cassel
> <Niklas.Cassel@wdc.com>; Klaus Jensen <k.jensen@samsung.com>; qemu-
> devel@nongnu.org; Alistair Francis <Alistair.Francis@wdc.com>; Philippe
> Mathieu-Daudé <philmd@redhat.com>
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and
> Zoned Namespace Command Set
> 
> On Sep 29 18:17, Matias Bjorling wrote:
> >
> >
> > > -----Original Message-----
> > > From: Klaus Jensen <its@irrelevant.dk>
> > > Sent: Tuesday, 29 September 2020 20.00
> > > To: Keith Busch <kbusch@kernel.org>
> > > Cc: Damien Le Moal <Damien.LeMoal@wdc.com>; Fam Zheng
> > > <fam@euphon.net>; Kevin Wolf <kwolf@redhat.com>; qemu-
> > > block@nongnu.org; Niklas Cassel <Niklas.Cassel@wdc.com>; Klaus
> > > Jensen <k.jensen@samsung.com>; qemu-devel@nongnu.org; Alistair
> > > Francis <Alistair.Francis@wdc.com>; Philippe Mathieu-Daudé
> > > <philmd@redhat.com>; Matias Bjorling <Matias.Bjorling@wdc.com>
> > > Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> > > and Zoned Namespace Command Set
> > >
> > > On Sep 29 10:29, Keith Busch wrote:
> > > > On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > > > > It is unmistakably clear that you are invalidating my arguments
> > > > > about portability and endianness issues by suggesting that we
> > > > > just remove persistent state and deal with it later, but
> > > > > persistence is the killer feature that sets the QEMU emulated
> > > > > device apart from other emulation options. It is not about using
> > > > > emulation in production (because yeah, why would you?), but
> > > > > persistence is what makes it possible to develop and test "zoned
> > > > > FTLs" or something that
> > > requires recovery at power up.
> > > > > This is what allows testing of how your host software deals with
> > > > > opened zones being transitioned to FULL on power up and the
> > > > > persistent tracking of LBA allocation (in my series) can be used
> > > > > to properly test error recovery if you lost state in the app.
> > > >
> > > > Hold up -- why does an OPEN zone transition to FULL on power up?
> > > > The spec suggests it should be CLOSED. The spec does appear to
> > > > support going to FULL on a NVM Subsystem Reset, though. Actually,
> > > > now that I'm looking at this part of the spec, these implicit
> > > > transitions seem a bit less clear than I expected. I'm not sure
> > > > it's clear enough to evaluate qemu's compliance right now.
> > > >
> > > > But I don't see what testing these transitions has to do with
> > > > having a persistent state. You can reboot your VM without tearing
> > > > down the running QEMU instance. You can also unbind the driver or
> > > > shutdown the controller within the running operating system. That
> > > > should make those implicit state transitions reachable in order to
> > > > exercise your FTL's recovery.
> > > >
> > >
> > > Oh dear - don't "spec" with me ;)
> > >
> > > NVMe v1.4 Section 7.3.1:
> > >
> > >     An NVM Subsystem Reset is initiated when:
> > >       * Main power is applied to the NVM subsystem;
> > >       * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
> > >         field;
> > >       * Requested using a method defined in the NVMe Management
> > >         Interface specification; or
> > >       * A vendor specific event occurs.
> > >
> > > In the context of QEMU, "Main power" is tearing down QEMU and
> > > starting it from scratch. Just like on a "real" host, unbinding the
> > > driver, rebooting or shutting down the controller does not cause a
> > > subsystem reset (and does not cause the zones to change state). And
> > > since the device does not indicate support for the optional
> > > NSSR.NSSRC register, that way to initiate a subsystem cannot be used.
> > >
> > > The reason for moving to FULL is that write pointer updates are not
> > > persisted on each advancement, only when the zone state changes. So
> > > zones that were opened might have valid data, but invalid write pointer.
> > > So the device transitions them to FULL as it is allowed to.
> > >
> >
> > How about when one must also recover from intermediate states (i.e.,
> > open/closed upon power loss). For example, I don't hope a real SSD
> > implementation transition zones to full when it has thousands of open
> > simultaneously. That could be a disaster for the PE cycles, and a lot
> > of media going to waste. One would want applications to support that
> > kind of failure mode as well.
> 
> Christ. The WDC Strike Force is really jumping out of lightspeed here.
> I'm afraid I don't have an opposing force to engage with. So I'll be your only
> boxing bag for the evening.
> 
> As Keith just said, "Opened" is not a valid intial state. Didn't you write the
> spec? ;) As for Closed, they will be brought up as is.

Upon power failure, a zone in the Explicitly Opened state or the Implicitly Opened state, and has LBAs written, can either be transitioned to Full or Closed state by the controller.

In the previous mail, I wanted to point out that if the intention of qemu was to test applications upon power failures, it could be beneficial to have an option that allowed transitioning open zones to closed upon power failure.

Then applications can be tested with that in mind as well, without having access to an SSD that provided that kind of implementation.

> 
> With that in mind, I'm not sure what you specifically refer to? I'll gently remind
> you that the QEMU nvme device is not a real SSD and does not deal with NAND
> so it does not really do any "recovering" of intermediate states on power on if
> that is what you refer to?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-09-29 19:22           ` Keith Busch
@ 2020-09-29 19:53             ` Dmitry Fomichev
  0 siblings, 0 replies; 42+ messages in thread
From: Dmitry Fomichev @ 2020-09-29 19:53 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen
  Cc: Kevin Wolf, Fam Zheng, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Alistair Francis,
	Philippe Mathieu-Daudé,
	Matias Bjorling



> -----Original Message-----
> From: Keith Busch <kbusch@kernel.org>
> Sent: Tuesday, September 29, 2020 3:22 PM
> To: Klaus Jensen <its@irrelevant.dk>
> Cc: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>; Kevin Wolf
> <kwolf@redhat.com>; Fam Zheng <fam@euphon.net>; Damien Le Moal
> <Damien.LeMoal@wdc.com>; qemu-block@nongnu.org; Niklas Cassel
> <Niklas.Cassel@wdc.com>; Klaus Jensen <k.jensen@samsung.com>; qemu-
> devel@nongnu.org; Alistair Francis <Alistair.Francis@wdc.com>; Philippe
> Mathieu-Daudé <philmd@redhat.com>; Matias Bjorling
> <Matias.Bjorling@wdc.com>
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> and Zoned Namespace Command Set
> 
> All,
> 
> Let's de-escalate this, please. There's no reason to doubt Klaus wants
> to see this to work well, just as everyone else does. We unfortunately
> have conflicting proposals posted, and everyone is passionate enough
> about their work, but please simmer down.
> 
> As I mentioned earlier, I'd like to refocus on the basic implementation
> and save the persistent state discussion once the core is solid. After
> going through it all, I feel there's enough to discuss there to keep us
> busy for little while longer. Additional comments on the code will be
> coming from me later today.

OK, I agree with this and I will not be replying to the email prior to this
one it the thread. Let's calm down so we will be able to have a beer at a
conference one day :)

The only one thing that I would like to cover is lack of response to Klaus'
ZNS patchset. Klaus, you are right to complain about it. Since discovering
about the large backlog of NVMe patches that you had pending
(something that we were not aware at the time of publishing our patches),
we made the decision to rebase our series on top of the patches that you
had posted before the publication time of WDC ZNS patchset. Since then,
I got caught in the constant cycle of rebasing our patches on top of your
series and that prevented me from doing much in terms of reviewing of
your commits. Now, once we seem to catch up with the current head of
development, I should be able to do more of this. There is absolutely no
ill will involved :)

Dmitry


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2020-09-29 20:49 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-23 18:20 [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
2020-09-23 18:20 ` [PATCH v4 01/14] hw/block/nvme: Report actual LBA data shift in LBAF Dmitry Fomichev
2020-09-24 12:12   ` Klaus Jensen
2020-09-23 18:20 ` [PATCH v4 02/14] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
2020-09-23 18:20 ` [PATCH v4 03/14] hw/block/nvme: Introduce the Namespace Types definitions Dmitry Fomichev
2020-09-23 18:20 ` [PATCH v4 04/14] hw/block/nvme: Define trace events related to NS Types Dmitry Fomichev
2020-09-23 18:20 ` [PATCH v4 05/14] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
2020-09-23 18:20 ` [PATCH v4 06/14] hw/block/nvme: Add support for active/inactive namespaces Dmitry Fomichev
2020-09-24 12:12   ` Klaus Jensen
2020-09-24 18:17     ` Niklas Cassel
2020-09-24 18:55       ` Klaus Jensen
2020-09-24 19:40         ` Niklas Cassel
2020-09-23 18:20 ` [PATCH v4 07/14] hw/block/nvme: Make Zoned NS Command Set definitions Dmitry Fomichev
2020-09-23 18:20 ` [PATCH v4 08/14] hw/block/nvme: Define Zoned NS Command Set trace events Dmitry Fomichev
2020-09-23 18:20 ` [PATCH v4 09/14] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
2020-09-25 18:24   ` Klaus Jensen
2020-09-23 18:20 ` [PATCH v4 10/14] hw/block/nvme: Introduce max active and open zone limits Dmitry Fomichev
2020-09-23 18:20 ` [PATCH v4 11/14] hw/block/nvme: Support Zone Descriptor Extensions Dmitry Fomichev
2020-09-23 18:20 ` [PATCH v4 12/14] hw/block/nvme: Add injection of Offline/Read-Only zones Dmitry Fomichev
2020-09-23 18:20 ` [PATCH v4 13/14] hw/block/nvme: Use zone metadata file for persistence Dmitry Fomichev
2020-09-23 18:20 ` [PATCH v4 14/14] hw/block/nvme: Document zoned parameters in usage text Dmitry Fomichev
2020-09-24 21:07 ` [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Klaus Jensen
2020-09-28  2:33   ` Dmitry Fomichev
2020-09-28  6:36     ` Klaus Jensen
2020-09-28 21:25       ` Keith Busch
2020-09-28 22:54         ` Damien Le Moal
2020-09-29 10:46           ` Klaus Jensen
2020-09-29 11:13             ` Damien Le Moal
2020-09-29 17:44               ` Keith Busch
2020-09-29 15:43             ` Dmitry Fomichev
2020-09-29 16:36               ` Klaus Jensen
2020-09-29 17:29             ` Keith Busch
2020-09-29 18:00               ` Klaus Jensen
2020-09-29 18:15                 ` Keith Busch
2020-09-29 18:18                   ` Klaus Jensen
2020-09-29 18:17                 ` Matias Bjorling
2020-09-29 18:36                   ` Klaus Jensen
2020-09-29 19:42                     ` Matias Bjorling
2020-09-29 15:42       ` Dmitry Fomichev
2020-09-29 18:39         ` Klaus Jensen
2020-09-29 19:22           ` Keith Busch
2020-09-29 19:53             ` Dmitry Fomichev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.