All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
@ 2020-09-28  2:35 Dmitry Fomichev
  2020-09-28  2:35 ` [PATCH v5 01/14] hw/block/nvme: Report actual LBA data shift in LBAF Dmitry Fomichev
                   ` (13 more replies)
  0 siblings, 14 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:35 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

v4 -> v5

 - Rebase to the current qemu-nvme.

 - Use HostMemoryBackendFile as the backing storage for persistent
   zone metadata.

 - Fix the issue with filling the valid data in the next zone if RAZBi
   is enabled.

v3 -> v4

 - Fix bugs introduced in v2/v3 for QD > 1 operation. Now, all writes
   to a zone happen at the new write pointer variable, zone->w_ptr,
   that is advanced right after submitting the backend i/o. The existing
   zone->d.wp variable is updated upon the successful write completion
   and it is used for zone reporting. Some code has been split from
   nvme_finalize_zoned_write() function to a new function,
   nvme_advance_zone_wp().

 - Make the code compile under mingw. Switch to using QEMU API for
   mmap/msync, i.e. memory_region...(). Since mmap is not available in
   mingw (even though there is mman-win32 library available on Github),
   conditional compilation is added around these calls to avoid
   undefined symbols under mingw. A better fix would be to add stub
   functions to softmmu/memory.c for the case when CONFIG_POSIX is not
   defined, but such change is beyond the scope of this patchset and it
   can be made in a separate patch.

 - Correct permission mask used to open zone metadata file.

 - Fold "Define 64 bit cqe.result" patch into ZNS commit.

 - Use clz64/clz32 instead of defining nvme_ilog2() function.

 - Simplify rpt_empty_id_struct() code, move nvme_fill_data() back
   to ZNS patch.

 - Fix a power-on processing bug.

 - Rename NVME_CMD_ZONE_APND to NVME_CMD_ZONE_APPEND.

 - Make the list of review comments addressed in v2 of the series
   (see below).

v2 -> v3:

 - Moved nvme_fill_data() function to the NSTypes patch as it is
   now used there to output empty namespace identify structs.
 - Fixed typo in Maxim's email address.

v1 -> v2:

 - Rebased on top of qemu-nvme/next branch.
 - Incorporated feedback from Klaus and Alistair.
    * Allow a subset of CSE log to be read, not the entire log
    * Assign admin command entries in CSE log to ACS fields
    * Set LPA bit 1 to indicate support of CSE log page
    * Rename CC.CSS value CSS_ALL_NSTYPES (110b) to CSS_CSI
    * Move the code to assign lbaf.ds to a separate patch
    * Remove the change in firmware revision
    * Change "driver" to "device" in comments and annotations
    * Rename ZAMDS to ZASL
    * Correct a few format expressions and some wording in
      trace event definitions
    * Remove validation code to return NVME_CAP_EXCEEDED error
    * Make ZASL to be equal to MDTS if "zone_append_size_limit"
      module parameter is not set
    * Clean up nvme_zoned_init_ctrl() to make size calculations
      less confusing
    * Avoid changing module parameters, use separate n/s variables
      if additional calculations are necessary to convert parameters
      to running values
    * Use NVME_DEFAULT_ZONE_SIZE to assign the default zone size value
    * Use default 0 for zone capacity meaning that zone capacity will
      be equal to zone size by default
    * Issue warnings if user MAR/MOR values are too large and have
      to be adjusted
    * Use unsigned values for MAR/MOR
 - Dropped "Simulate Zone Active excursions" patch.
   Excursion behavior may depend on the internal controller
   architecture and therefore be vendor-specific.
 - Dropped support for Zone Attributes and zoned AENs for now.
   These features can be added in a future series.
 - NS Types support is extended to handle active/inactive namespaces.
 - Update the write pointer after backing storage I/O completion, not
   before. This makes the emulation to run correctly in case of
   backing device failures.
 - Avoid division in the I/O path if the device zone size is
   a power of two (the most common case). Zone index then can be
   calculated by using bit shift.
 - A few reported bugs have been fixed.
 - Indentation in function definitions has been changed to make it
   the same as the rest of the code.


Zoned Namespace (ZNS) Command Set is a newly introduced command set
published by the NVM Express, Inc. organization as TP 4053. The main
design goals of ZNS are to provide hardware designers the means to
reduce NVMe controller complexity and to allow achieving a better I/O
latency and throughput. SSDs that implement this interface are
commonly known as ZNS SSDs.

This command set is implementing a zoned storage model, similarly to
ZAC/ZBC. As such, there is already support in Linux, allowing one to
perform the majority of tasks needed for managing ZNS SSDs.

The Zoned Namespace Command Set relies on another TP, known as
Namespace Types (NVMe TP 4056), which introduces support for having
multiple command sets per namespace.

Both ZNS and Namespace Types specifications can be downloaded by
visiting the following link -

https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs.zip

This patch series adds Namespace Types support and zoned namespace
emulation capability to the existing NVMe PCI device.

The patchset is organized as follows -

The first several patches are preparatory and are added to allow for
an easier review of the subsequent commits. The group of patches that
follows adds NS Types support with only NVM Command Set being
available. Finally, the last group of commits makes definitions and
adds new code to support Zoned Namespace Command Set.

Based-on: <20200922084533.1273962-1-its@irrelevant.dk>

Dmitry Fomichev (11):
  hw/block/nvme: Report actual LBA data shift in LBAF
  hw/block/nvme: Add Commands Supported and Effects log
  hw/block/nvme: Define trace events related to NS Types
  hw/block/nvme: Make Zoned NS Command Set definitions
  hw/block/nvme: Define Zoned NS Command Set trace events
  hw/block/nvme: Support Zoned Namespace Command Set
  hw/block/nvme: Introduce max active and open zone limits
  hw/block/nvme: Support Zone Descriptor Extensions
  hw/block/nvme: Add injection of Offline/Read-Only zones
  hw/block/nvme: Use zone metadata file for persistence
  hw/block/nvme: Document zoned parameters in usage text

Niklas Cassel (3):
  hw/block/nvme: Introduce the Namespace Types definitions
  hw/block/nvme: Add support for Namespace Types
  hw/block/nvme: Add support for active/inactive namespaces

 block/nvme.c          |    2 +-
 hw/block/nvme-ns.c    |  610 ++++++++++++++++++-
 hw/block/nvme-ns.h    |  206 +++++++
 hw/block/nvme.c       | 1332 +++++++++++++++++++++++++++++++++++++++--
 hw/block/nvme.h       |   10 +
 hw/block/trace-events |   39 ++
 include/block/nvme.h  |  210 ++++++-
 7 files changed, 2332 insertions(+), 77 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v5 01/14] hw/block/nvme: Report actual LBA data shift in LBAF
  2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
@ 2020-09-28  2:35 ` Dmitry Fomichev
  2020-09-28  8:51   ` Klaus Jensen
  2020-09-28  2:35 ` [PATCH v5 02/14] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:35 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

Calculate the data shift value to report based on the set value of
logical_block_size device property.

In the process, use a local variable to calculate the LBA format
index instead of the hardcoded value 0. This makes the code more
readable and it will make it easier to add support for multiple LBA
formats in the future.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme-ns.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 2ba0263dda..bbd7879492 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -47,6 +47,8 @@ static void nvme_ns_init(NvmeNamespace *ns)
 
 static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
 {
+    int lba_index;
+
     if (!blkconf_blocksizes(&ns->blkconf, errp)) {
         return -1;
     }
@@ -67,6 +69,9 @@ static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
         n->features.vwc = 0x1;
     }
 
+    lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
+    ns->id_ns.lbaf[lba_index].ds = 31 - clz32(n->conf.logical_block_size);
+
     return 0;
 }
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 02/14] hw/block/nvme: Add Commands Supported and Effects log
  2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
  2020-09-28  2:35 ` [PATCH v5 01/14] hw/block/nvme: Report actual LBA data shift in LBAF Dmitry Fomichev
@ 2020-09-28  2:35 ` Dmitry Fomichev
  2020-09-28  2:35 ` [PATCH v5 03/14] hw/block/nvme: Introduce the Namespace Types definitions Dmitry Fomichev
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:35 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

This log page becomes necessary to implement to allow checking for
Zone Append command support in Zoned Namespace Command Set.

This commit adds the code to report this log page for NVM Command
Set only. The parts that are specific to zoned operation will be
added later in the series.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c       | 41 ++++++++++++++++++++++++++++++++++++++++-
 hw/block/trace-events |  2 ++
 include/block/nvme.h  | 19 +++++++++++++++++++
 3 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index da8344f196..1ddc7e52cc 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1301,6 +1301,43 @@ static uint16_t nvme_error_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
                     DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
+                                 uint64_t off, NvmeRequest *req)
+{
+    NvmeEffectsLog cmd_eff_log = {};
+    uint32_t *iocs = cmd_eff_log.iocs;
+    uint32_t *acs = cmd_eff_log.acs;
+    uint32_t trans_len;
+
+    trace_pci_nvme_cmd_supp_and_effects_log_read();
+
+    if (off >= sizeof(cmd_eff_log)) {
+        trace_pci_nvme_err_invalid_effects_log_offset(off);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    acs[NVME_ADM_CMD_DELETE_SQ] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_CREATE_SQ] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_DELETE_CQ] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_CREATE_CQ] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_IDENTIFY] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_SET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;
+    acs[NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFFECTS_CSUPP;
+
+    iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
+    iocs[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFFECTS_CSUPP |
+                                  NVME_CMD_EFFECTS_LBCC;
+    iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
+    iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
+
+    trans_len = MIN(sizeof(cmd_eff_log) - off, buf_len);
+
+    return nvme_dma(n, ((uint8_t *)&cmd_eff_log) + off, trans_len,
+                    DMA_DIRECTION_FROM_DEVICE, req);
+}
+
 static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeCmd *cmd = &req->cmd;
@@ -1344,6 +1381,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
         return nvme_smart_info(n, rae, len, off, req);
     case NVME_LOG_FW_SLOT_INFO:
         return nvme_fw_log_info(n, len, off, req);
+    case NVME_LOG_CMD_EFFECTS:
+        return nvme_cmd_effects(n, len, off, req);
     default:
         trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -2743,7 +2782,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     id->acl = 3;
     id->aerl = n->params.aerl;
     id->frmw = (NVME_NUM_FW_SLOTS << 1) | NVME_FRMW_SLOT1_RO;
-    id->lpa = NVME_LPA_EXTENDED;
+    id->lpa = NVME_LPA_CSE | NVME_LPA_EXTENDED;
 
     /* recommended default value (~70 C) */
     id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
diff --git a/hw/block/trace-events b/hw/block/trace-events
index bbe6f27367..2929a8df11 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -86,6 +86,7 @@ pci_nvme_mmio_start_success(void) "setting controller enable bit succeeded"
 pci_nvme_mmio_stopped(void) "cleared controller enable bit"
 pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
 pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
+pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
 
 # nvme traces for error conditions
 pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
@@ -104,6 +105,7 @@ pci_nvme_err_invalid_prp(void) "invalid PRP"
 pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
 pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
 pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
+pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
 pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 58647bcdad..a738c8f9ba 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -734,10 +734,27 @@ enum NvmeSmartWarn {
     NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,
 };
 
+typedef struct NvmeEffectsLog {
+    uint32_t    acs[256];
+    uint32_t    iocs[256];
+    uint8_t     resv[2048];
+} NvmeEffectsLog;
+
+enum {
+    NVME_CMD_EFFECTS_CSUPP             = 1 << 0,
+    NVME_CMD_EFFECTS_LBCC              = 1 << 1,
+    NVME_CMD_EFFECTS_NCC               = 1 << 2,
+    NVME_CMD_EFFECTS_NIC               = 1 << 3,
+    NVME_CMD_EFFECTS_CCC               = 1 << 4,
+    NVME_CMD_EFFECTS_CSE_MASK          = 3 << 16,
+    NVME_CMD_EFFECTS_UUID_SEL          = 1 << 19,
+};
+
 enum NvmeLogIdentifier {
     NVME_LOG_ERROR_INFO     = 0x01,
     NVME_LOG_SMART_INFO     = 0x02,
     NVME_LOG_FW_SLOT_INFO   = 0x03,
+    NVME_LOG_CMD_EFFECTS    = 0x05,
 };
 
 typedef struct QEMU_PACKED NvmePSD {
@@ -849,6 +866,7 @@ enum NvmeIdCtrlFrmw {
 };
 
 enum NvmeIdCtrlLpa {
+    NVME_LPA_CSE      = 1 << 1,
     NVME_LPA_EXTENDED = 1 << 2,
 };
 
@@ -1048,6 +1066,7 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 03/14] hw/block/nvme: Introduce the Namespace Types definitions
  2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
  2020-09-28  2:35 ` [PATCH v5 01/14] hw/block/nvme: Report actual LBA data shift in LBAF Dmitry Fomichev
  2020-09-28  2:35 ` [PATCH v5 02/14] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
@ 2020-09-28  2:35 ` Dmitry Fomichev
  2020-09-30  8:08   ` Klaus Jensen
  2020-09-30 15:21   ` Keith Busch
  2020-09-28  2:35 ` [PATCH v5 04/14] hw/block/nvme: Define trace events related to NS Types Dmitry Fomichev
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:35 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

From: Niklas Cassel <niklas.cassel@wdc.com>

Define the structures and constants required to implement
Namespace Types support.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme-ns.h   |  2 ++
 hw/block/nvme.c      |  2 +-
 include/block/nvme.h | 74 +++++++++++++++++++++++++++++++++++---------
 3 files changed, 63 insertions(+), 15 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 83734f4606..cca23bc0b3 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -21,6 +21,8 @@
 
 typedef struct NvmeNamespaceParams {
     uint32_t nsid;
+    uint8_t  csi;
+    QemuUUID uuid;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 1ddc7e52cc..29fa005fa2 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1598,7 +1598,7 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
      * here.
      */
     ns_descrs->uuid.hdr.nidt = NVME_NIDT_UUID;
-    ns_descrs->uuid.hdr.nidl = NVME_NIDT_UUID_LEN;
+    ns_descrs->uuid.hdr.nidl = NVME_NIDL_UUID;
     stl_be_p(&ns_descrs->uuid.v, nsid);
 
     return nvme_dma(n, list, NVME_IDENTIFY_DATA_SIZE,
diff --git a/include/block/nvme.h b/include/block/nvme.h
index a738c8f9ba..4587311783 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -51,6 +51,11 @@ enum NvmeCapMask {
     CAP_PMR_MASK       = 0x1,
 };
 
+enum NvmeCapCssBits {
+    CAP_CSS_NVM        = 0x01,
+    CAP_CSS_CSI_SUPP   = 0x40,
+};
+
 #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
 #define NVME_CAP_CQR(cap)   (((cap) >> CAP_CQR_SHIFT)    & CAP_CQR_MASK)
 #define NVME_CAP_AMS(cap)   (((cap) >> CAP_AMS_SHIFT)    & CAP_AMS_MASK)
@@ -102,6 +107,12 @@ enum NvmeCcMask {
     CC_IOCQES_MASK  = 0xf,
 };
 
+enum NvmeCcCss {
+    CSS_NVM_ONLY        = 0,
+    CSS_CSI             = 6,
+    CSS_ADMIN_ONLY      = 7,
+};
+
 #define NVME_CC_EN(cc)     ((cc >> CC_EN_SHIFT)     & CC_EN_MASK)
 #define NVME_CC_CSS(cc)    ((cc >> CC_CSS_SHIFT)    & CC_CSS_MASK)
 #define NVME_CC_MPS(cc)    ((cc >> CC_MPS_SHIFT)    & CC_MPS_MASK)
@@ -110,6 +121,21 @@ enum NvmeCcMask {
 #define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK)
 #define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK)
 
+#define NVME_SET_CC_EN(cc, val)     \
+    (cc |= (uint32_t)((val) & CC_EN_MASK) << CC_EN_SHIFT)
+#define NVME_SET_CC_CSS(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_CSS_MASK) << CC_CSS_SHIFT)
+#define NVME_SET_CC_MPS(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_MPS_MASK) << CC_MPS_SHIFT)
+#define NVME_SET_CC_AMS(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_AMS_MASK) << CC_AMS_SHIFT)
+#define NVME_SET_CC_SHN(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_SHN_MASK) << CC_SHN_SHIFT)
+#define NVME_SET_CC_IOSQES(cc, val) \
+    (cc |= (uint32_t)((val) & CC_IOSQES_MASK) << CC_IOSQES_SHIFT)
+#define NVME_SET_CC_IOCQES(cc, val) \
+    (cc |= (uint32_t)((val) & CC_IOCQES_MASK) << CC_IOCQES_SHIFT)
+
 enum NvmeCstsShift {
     CSTS_RDY_SHIFT      = 0,
     CSTS_CFS_SHIFT      = 1,
@@ -524,8 +550,13 @@ typedef struct QEMU_PACKED NvmeIdentify {
     uint64_t    rsvd2[2];
     uint64_t    prp1;
     uint64_t    prp2;
-    uint32_t    cns;
-    uint32_t    rsvd11[5];
+    uint8_t     cns;
+    uint8_t     rsvd10;
+    uint16_t    ctrlid;
+    uint16_t    nvmsetid;
+    uint8_t     rsvd11;
+    uint8_t     csi;
+    uint32_t    rsvd12[4];
 } NvmeIdentify;
 
 typedef struct QEMU_PACKED NvmeRwCmd {
@@ -645,6 +676,7 @@ enum NvmeStatusCodes {
     NVME_MD_SGL_LEN_INVALID     = 0x0010,
     NVME_SGL_DESCR_TYPE_INVALID = 0x0011,
     NVME_INVALID_USE_OF_CMB     = 0x0012,
+    NVME_CMD_SET_CMB_REJECTED   = 0x002b,
     NVME_LBA_RANGE              = 0x0080,
     NVME_CAP_EXCEEDED           = 0x0081,
     NVME_NS_NOT_READY           = 0x0082,
@@ -771,11 +803,15 @@ typedef struct QEMU_PACKED NvmePSD {
 
 #define NVME_IDENTIFY_DATA_SIZE 4096
 
-enum {
-    NVME_ID_CNS_NS             = 0x0,
-    NVME_ID_CNS_CTRL           = 0x1,
-    NVME_ID_CNS_NS_ACTIVE_LIST = 0x2,
-    NVME_ID_CNS_NS_DESCR_LIST  = 0x3,
+enum NvmeIdCns {
+    NVME_ID_CNS_NS                = 0x00,
+    NVME_ID_CNS_CTRL              = 0x01,
+    NVME_ID_CNS_NS_ACTIVE_LIST    = 0x02,
+    NVME_ID_CNS_NS_DESCR_LIST     = 0x03,
+    NVME_ID_CNS_CS_NS             = 0x05,
+    NVME_ID_CNS_CS_CTRL           = 0x06,
+    NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
+    NVME_ID_CNS_IO_COMMAND_SET    = 0x1c,
 };
 
 typedef struct QEMU_PACKED NvmeIdCtrl {
@@ -922,6 +958,7 @@ enum NvmeFeatureIds {
     NVME_WRITE_ATOMICITY            = 0xa,
     NVME_ASYNCHRONOUS_EVENT_CONF    = 0xb,
     NVME_TIMESTAMP                  = 0xe,
+    NVME_COMMAND_SET_PROFILE        = 0x19,
     NVME_SOFTWARE_PROGRESS_MARKER   = 0x80,
     NVME_FID_MAX                    = 0x100,
 };
@@ -1006,18 +1043,26 @@ typedef struct QEMU_PACKED NvmeIdNsDescr {
     uint8_t rsvd2[2];
 } NvmeIdNsDescr;
 
-enum {
-    NVME_NIDT_EUI64_LEN =  8,
-    NVME_NIDT_NGUID_LEN = 16,
-    NVME_NIDT_UUID_LEN  = 16,
+enum NvmeNsIdentifierLength {
+    NVME_NIDL_EUI64             = 8,
+    NVME_NIDL_NGUID             = 16,
+    NVME_NIDL_UUID              = 16,
+    NVME_NIDL_CSI               = 1,
 };
 
 enum NvmeNsIdentifierType {
-    NVME_NIDT_EUI64 = 0x1,
-    NVME_NIDT_NGUID = 0x2,
-    NVME_NIDT_UUID  = 0x3,
+    NVME_NIDT_EUI64             = 0x01,
+    NVME_NIDT_NGUID             = 0x02,
+    NVME_NIDT_UUID              = 0x03,
+    NVME_NIDT_CSI               = 0x04,
 };
 
+enum NvmeCsi {
+    NVME_CSI_NVM                = 0x00,
+};
+
+#define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
+
 /*Deallocate Logical Block Features*/
 #define NVME_ID_NS_DLFEAT_GUARD_CRC(dlfeat)       ((dlfeat) & 0x10)
 #define NVME_ID_NS_DLFEAT_WRITE_ZEROES(dlfeat)    ((dlfeat) & 0x08)
@@ -1068,6 +1113,7 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 04/14] hw/block/nvme: Define trace events related to NS Types
  2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (2 preceding siblings ...)
  2020-09-28  2:35 ` [PATCH v5 03/14] hw/block/nvme: Introduce the Namespace Types definitions Dmitry Fomichev
@ 2020-09-28  2:35 ` Dmitry Fomichev
  2020-09-28  2:35 ` [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:35 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

A few trace events are defined that are relevant to implementing
Namespace Types (NVMe TP 4056).

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/trace-events | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/hw/block/trace-events b/hw/block/trace-events
index 2929a8df11..b93429b04c 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -49,8 +49,12 @@ pci_nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size,
 pci_nvme_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
 pci_nvme_del_cq(uint16_t cqid) "deleted completion queue, cqid=%"PRIu16""
 pci_nvme_identify_ctrl(void) "identify controller"
+pci_nvme_identify_ctrl_csi(uint8_t csi) "identify controller, csi=0x%"PRIx8""
 pci_nvme_identify_ns(uint32_t ns) "nsid %"PRIu32""
+pci_nvme_identify_ns_csi(uint32_t ns, uint8_t csi) "nsid=%"PRIu32", csi=0x%"PRIx8""
 pci_nvme_identify_nslist(uint32_t ns) "nsid %"PRIu32""
+pci_nvme_identify_nslist_csi(uint16_t ns, uint8_t csi) "nsid=%"PRIu16", csi=0x%"PRIx8""
+pci_nvme_identify_cmd_set(void) "identify i/o command set"
 pci_nvme_identify_ns_descr_list(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_get_log(uint16_t cid, uint8_t lid, uint8_t lsp, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" lsp 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
 pci_nvme_getfeat(uint16_t cid, uint8_t fid, uint8_t sel, uint32_t cdw11) "cid %"PRIu16" fid 0x%"PRIx8" sel 0x%"PRIx8" cdw11 0x%"PRIx32""
@@ -87,6 +91,8 @@ pci_nvme_mmio_stopped(void) "cleared controller enable bit"
 pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
 pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
 pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
+pci_nvme_css_nvm_cset_selected_by_host(uint32_t cc) "NVM command set selected by host, bar.cc=0x%"PRIx32""
+pci_nvme_css_all_csets_sel_by_host(uint32_t cc) "all supported command sets selected by host, bar.cc=0x%"PRIx32""
 
 # nvme traces for error conditions
 pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
@@ -106,6 +112,9 @@ pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
 pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
 pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
 pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
+pci_nvme_err_change_css_when_enabled(void) "changing CC.CSS while controller is enabled"
+pci_nvme_err_only_nvm_cmd_set_avail(void) "setting 110b CC.CSS, but only NVM command set is enabled"
+pci_nvme_err_invalid_iocsci(uint32_t idx) "unsupported command set combination index %"PRIu32""
 pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
@@ -161,6 +170,7 @@ pci_nvme_ub_db_wr_invalid_cq(uint32_t qid) "completion queue doorbell write for
 pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t new_head) "completion queue doorbell write value beyond queue size, cqid=%"PRIu32", new_head=%"PRIu16", ignoring"
 pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write for nonexistent queue, sqid=%"PRIu32", ignoring"
 pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission queue doorbell write value beyond queue size, sqid=%"PRIu32", new_head=%"PRIu16", ignoring"
+pci_nvme_ub_unknown_css_value(void) "unknown value in cc.css field"
 
 # xen-block.c
 xen_block_realize(const char *type, uint32_t disk, uint32_t partition) "%s d%up%u"
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types
  2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (3 preceding siblings ...)
  2020-09-28  2:35 ` [PATCH v5 04/14] hw/block/nvme: Define trace events related to NS Types Dmitry Fomichev
@ 2020-09-28  2:35 ` Dmitry Fomichev
  2020-09-30  8:15   ` Klaus Jensen
                     ` (3 more replies)
  2020-09-28  2:35 ` [PATCH v5 06/14] hw/block/nvme: Add support for active/inactive namespaces Dmitry Fomichev
                   ` (8 subsequent siblings)
  13 siblings, 4 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:35 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

From: Niklas Cassel <niklas.cassel@wdc.com>

Namespace Types introduce a new command set, "I/O Command Sets",
that allows the host to retrieve the command sets associated with
a namespace. Introduce support for the command set and enable
detection for the NVM Command Set.

The new workflows for identify commands rely heavily on zero-filled
identify structs. E.g., certain CNS commands are defined to return
a zero-filled identify struct when an inactive namespace NSID
is supplied.

Add a helper function in order to avoid code duplication when
reporting zero-filled identify structures.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme-ns.c |   3 +
 hw/block/nvme.c    | 210 +++++++++++++++++++++++++++++++++++++--------
 2 files changed, 175 insertions(+), 38 deletions(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index bbd7879492..31b7f986c3 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -40,6 +40,9 @@ static void nvme_ns_init(NvmeNamespace *ns)
 
     id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(ns));
 
+    ns->params.csi = NVME_CSI_NVM;
+    qemu_uuid_generate(&ns->params.uuid); /* TODO make UUIDs persistent */
+
     /* no thin provisioning */
     id_ns->ncap = id_ns->nsze;
     id_ns->nuse = id_ns->ncap;
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 29fa005fa2..4ec1ddc90a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1495,6 +1495,13 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
     return NVME_SUCCESS;
 }
 
+static uint16_t nvme_rpt_empty_id_struct(NvmeCtrl *n, NvmeRequest *req)
+{
+    uint8_t id[NVME_IDENTIFY_DATA_SIZE] = {};
+
+    return nvme_dma(n, id, sizeof(id), DMA_DIRECTION_FROM_DEVICE, req);
+}
+
 static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
 {
     trace_pci_nvme_identify_ctrl();
@@ -1503,11 +1510,23 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
                     DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+
+    trace_pci_nvme_identify_ctrl_csi(c->csi);
+
+    if (c->csi == NVME_CSI_NVM) {
+        return nvme_rpt_empty_id_struct(n, req);
+    }
+
+    return NVME_INVALID_FIELD | NVME_DNR;
+}
+
 static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
-    NvmeIdNs *id_ns, inactive = { 0 };
     uint32_t nsid = le32_to_cpu(c->nsid);
 
     trace_pci_nvme_identify_ns(nsid);
@@ -1518,23 +1537,46 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
 
     ns = nvme_ns(n, nsid);
     if (unlikely(!ns)) {
-        id_ns = &inactive;
-    } else {
-        id_ns = &ns->id_ns;
+        return nvme_rpt_empty_id_struct(n, req);
     }
 
-    return nvme_dma(n, (uint8_t *)id_ns, sizeof(NvmeIdNs),
+    return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
                     DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeNamespace *ns;
+    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+    uint32_t nsid = le32_to_cpu(c->nsid);
+
+    trace_pci_nvme_identify_ns_csi(nsid, c->csi);
+
+    if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
+        return NVME_INVALID_NSID | NVME_DNR;
+    }
+
+    ns = nvme_ns(n, nsid);
+    if (unlikely(!ns)) {
+        return nvme_rpt_empty_id_struct(n, req);
+    }
+
+    if (c->csi == NVME_CSI_NVM) {
+        return nvme_rpt_empty_id_struct(n, req);
+    }
+
+    return NVME_INVALID_FIELD | NVME_DNR;
+}
+
 static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
 {
+    NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
-    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
     uint32_t min_nsid = le32_to_cpu(c->nsid);
-    uint32_t *list;
-    uint16_t ret;
-    int j = 0;
+    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
+    static const int data_len = sizeof(list);
+    uint32_t *list_ptr = (uint32_t *)list;
+    int i, j = 0;
 
     trace_pci_nvme_identify_nslist(min_nsid);
 
@@ -1548,48 +1590,76 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
         return NVME_INVALID_NSID | NVME_DNR;
     }
 
-    list = g_malloc0(data_len);
-    for (int i = 1; i <= n->num_namespaces; i++) {
-        if (i <= min_nsid || !nvme_ns(n, i)) {
+    for (i = 1; i <= n->num_namespaces; i++) {
+        ns = nvme_ns(n, i);
+        if (!ns) {
             continue;
         }
-        list[j++] = cpu_to_le32(i);
+        if (ns->params.nsid < min_nsid) {
+            continue;
+        }
+        list_ptr[j++] = cpu_to_le32(ns->params.nsid);
         if (j == data_len / sizeof(uint32_t)) {
             break;
         }
     }
-    ret = nvme_dma(n, (uint8_t *)list, data_len, DMA_DIRECTION_FROM_DEVICE,
-                   req);
-    g_free(list);
-    return ret;
+
+    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
+}
+
+static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeNamespace *ns;
+    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+    uint32_t min_nsid = le32_to_cpu(c->nsid);
+    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
+    static const int data_len = sizeof(list);
+    uint32_t *list_ptr = (uint32_t *)list;
+    int i, j = 0;
+
+    trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
+
+    if (c->csi != NVME_CSI_NVM) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    for (i = 1; i <= n->num_namespaces; i++) {
+        ns = nvme_ns(n, i);
+        if (!ns) {
+            continue;
+        }
+        if (ns->params.nsid < min_nsid) {
+            continue;
+        }
+        list_ptr[j++] = cpu_to_le32(ns->params.nsid);
+        if (j == data_len / sizeof(uint32_t)) {
+            break;
+        }
+    }
+
+    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+    NvmeNamespace *ns;
     uint32_t nsid = le32_to_cpu(c->nsid);
-    uint8_t list[NVME_IDENTIFY_DATA_SIZE];
-
-    struct data {
-        struct {
-            NvmeIdNsDescr hdr;
-            uint8_t v[16];
-        } uuid;
-    };
-
-    struct data *ns_descrs = (struct data *)list;
+    NvmeIdNsDescr *desc;
+    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
+    static const int data_len = sizeof(list);
+    void *list_ptr = list;
 
     trace_pci_nvme_identify_ns_descr_list(nsid);
 
-    if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
-        return NVME_INVALID_NSID | NVME_DNR;
-    }
-
     if (unlikely(!nvme_ns(n, nsid))) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
-    memset(list, 0x0, sizeof(list));
+    ns = nvme_ns(n, nsid);
+    if (unlikely(!ns)) {
+        return nvme_rpt_empty_id_struct(n, req);
+    }
 
     /*
      * Because the NGUID and EUI64 fields are 0 in the Identify Namespace data
@@ -1597,12 +1667,31 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
      * Namespace Identification Descriptor. Add a very basic Namespace UUID
      * here.
      */
-    ns_descrs->uuid.hdr.nidt = NVME_NIDT_UUID;
-    ns_descrs->uuid.hdr.nidl = NVME_NIDL_UUID;
-    stl_be_p(&ns_descrs->uuid.v, nsid);
+    desc = list_ptr;
+    desc->nidt = NVME_NIDT_UUID;
+    desc->nidl = NVME_NIDL_UUID;
+    list_ptr += sizeof(*desc);
+    memcpy(list_ptr, ns->params.uuid.data, NVME_NIDL_UUID);
+    list_ptr += NVME_NIDL_UUID;
 
-    return nvme_dma(n, list, NVME_IDENTIFY_DATA_SIZE,
-                    DMA_DIRECTION_FROM_DEVICE, req);
+    desc = list_ptr;
+    desc->nidt = NVME_NIDT_CSI;
+    desc->nidl = NVME_NIDL_CSI;
+    list_ptr += sizeof(*desc);
+    *(uint8_t *)list_ptr = NVME_CSI_NVM;
+
+    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
+}
+
+static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeRequest *req)
+{
+    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
+    static const int data_len = sizeof(list);
+
+    trace_pci_nvme_identify_cmd_set();
+
+    NVME_SET_CSI(*list, NVME_CSI_NVM);
+    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
@@ -1612,12 +1701,20 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
     switch (le32_to_cpu(c->cns)) {
     case NVME_ID_CNS_NS:
         return nvme_identify_ns(n, req);
+    case NVME_ID_CNS_CS_NS:
+        return nvme_identify_ns_csi(n, req);
     case NVME_ID_CNS_CTRL:
         return nvme_identify_ctrl(n, req);
+    case NVME_ID_CNS_CS_CTRL:
+        return nvme_identify_ctrl_csi(n, req);
     case NVME_ID_CNS_NS_ACTIVE_LIST:
         return nvme_identify_nslist(n, req);
+    case NVME_ID_CNS_CS_NS_ACTIVE_LIST:
+        return nvme_identify_nslist_csi(n, req);
     case NVME_ID_CNS_NS_DESCR_LIST:
         return nvme_identify_ns_descr_list(n, req);
+    case NVME_ID_CNS_IO_COMMAND_SET:
+        return nvme_identify_cmd_set(n, req);
     default:
         trace_pci_nvme_err_invalid_identify_cns(le32_to_cpu(c->cns));
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1799,6 +1896,9 @@ defaults:
             result |= NVME_INTVC_NOCOALESCING;
         }
 
+        break;
+    case NVME_COMMAND_SET_PROFILE:
+        result = 0;
         break;
     default:
         result = nvme_feature_default[fid];
@@ -1939,6 +2039,12 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
         break;
     case NVME_TIMESTAMP:
         return nvme_set_feature_timestamp(n, req);
+    case NVME_COMMAND_SET_PROFILE:
+        if (dw11 & 0x1ff) {
+            trace_pci_nvme_err_invalid_iocsci(dw11 & 0x1ff);
+            return NVME_CMD_SET_CMB_REJECTED | NVME_DNR;
+        }
+        break;
     default:
         return NVME_FEAT_NOT_CHANGEABLE | NVME_DNR;
     }
@@ -2222,6 +2328,30 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
         break;
     case 0x14:  /* CC */
         trace_pci_nvme_mmio_cfg(data & 0xffffffff);
+
+        if (NVME_CC_CSS(data) != NVME_CC_CSS(n->bar.cc)) {
+            if (NVME_CC_EN(n->bar.cc)) {
+                NVME_GUEST_ERR(pci_nvme_err_change_css_when_enabled,
+                               "changing selected command set when enabled");
+            } else {
+                switch (NVME_CC_CSS(data)) {
+                case CSS_NVM_ONLY:
+                    trace_pci_nvme_css_nvm_cset_selected_by_host(data &
+                                                                 0xffffffff);
+                    break;
+                case CSS_CSI:
+                    NVME_SET_CC_CSS(n->bar.cc, CSS_CSI);
+                    trace_pci_nvme_css_all_csets_sel_by_host(data & 0xffffffff);
+                    break;
+                case CSS_ADMIN_ONLY:
+                    break;
+                default:
+                    NVME_GUEST_ERR(pci_nvme_ub_unknown_css_value,
+                                   "unknown value in CC.CSS field");
+                }
+            }
+        }
+
         /* Windows first sends data, then sends enable bit */
         if (!NVME_CC_EN(data) && !NVME_CC_EN(n->bar.cc) &&
             !NVME_CC_SHN(data) && !NVME_CC_SHN(n->bar.cc))
@@ -2810,7 +2940,11 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
     NVME_CAP_SET_CQR(n->bar.cap, 1);
     NVME_CAP_SET_TO(n->bar.cap, 0xf);
-    NVME_CAP_SET_CSS(n->bar.cap, 1);
+    /*
+     * The device now always supports NS Types, but all commands
+     * that support CSI field will only handle NVM Command Set.
+     */
+    NVME_CAP_SET_CSS(n->bar.cap, (CAP_CSS_NVM | CAP_CSS_CSI_SUPP));
     NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
 
     n->bar.vs = NVME_SPEC_VER;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 06/14] hw/block/nvme: Add support for active/inactive namespaces
  2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (4 preceding siblings ...)
  2020-09-28  2:35 ` [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
@ 2020-09-28  2:35 ` Dmitry Fomichev
  2020-09-30 13:50   ` Niklas Cassel
  2020-09-28  2:35 ` [PATCH v5 07/14] hw/block/nvme: Make Zoned NS Command Set definitions Dmitry Fomichev
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:35 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

From: Niklas Cassel <niklas.cassel@wdc.com>

In NVMe, a namespace is active if it exists and is attached to the
controller.

CAP.CSS (together with the I/O Command Set data structure) defines what
command sets are supported by the controller.

CC.CSS (together with Set Profile) can be set to enable a subset of the
available command sets. The namespaces belonging to a disabled command set
will not be able to attach to the controller, and will thus be inactive.

E.g., if the user sets CC.CSS to Admin Only, NVM namespaces should be
marked as inactive.

The identify namespace, the identify namespace CSI specific, and the namespace
list commands have two different versions, one that only shows active
namespaces, and the other version that shows existing namespaces, regardless
of whether the namespace is attached or not.

Add an attached member to struct NvmeNamespace, and implement the missing CNS
commands.

The added functionality will also simplify the implementation of namespace
management in the future, since namespace management can also attach and
detach namespaces.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme-ns.h   |  1 +
 hw/block/nvme.c      | 60 ++++++++++++++++++++++++++++++++++++++------
 include/block/nvme.h | 20 +++++++++------
 3 files changed, 65 insertions(+), 16 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index cca23bc0b3..acdb76f058 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -22,6 +22,7 @@
 typedef struct NvmeNamespaceParams {
     uint32_t nsid;
     uint8_t  csi;
+    bool     attached;
     QemuUUID uuid;
 } NvmeNamespaceParams;
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 4ec1ddc90a..63ad03d6d6 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1523,7 +1523,8 @@ static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
     return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req,
+                                 bool only_active)
 {
     NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -1540,11 +1541,16 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
         return nvme_rpt_empty_id_struct(n, req);
     }
 
+    if (only_active && !ns->params.attached) {
+        return nvme_rpt_empty_id_struct(n, req);
+    }
+
     return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
                     DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
+                                     bool only_active)
 {
     NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -1561,6 +1567,10 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
         return nvme_rpt_empty_id_struct(n, req);
     }
 
+    if (only_active && !ns->params.attached) {
+        return nvme_rpt_empty_id_struct(n, req);
+    }
+
     if (c->csi == NVME_CSI_NVM) {
         return nvme_rpt_empty_id_struct(n, req);
     }
@@ -1568,7 +1578,8 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
     return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req,
+                                     bool only_active)
 {
     NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -1598,6 +1609,9 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
         if (ns->params.nsid < min_nsid) {
             continue;
         }
+        if (only_active && !ns->params.attached) {
+            continue;
+        }
         list_ptr[j++] = cpu_to_le32(ns->params.nsid);
         if (j == data_len / sizeof(uint32_t)) {
             break;
@@ -1607,7 +1621,8 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
     return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req,
+                                         bool only_active)
 {
     NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -1631,6 +1646,9 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
         if (ns->params.nsid < min_nsid) {
             continue;
         }
+        if (only_active && !ns->params.attached) {
+            continue;
+        }
         list_ptr[j++] = cpu_to_le32(ns->params.nsid);
         if (j == data_len / sizeof(uint32_t)) {
             break;
@@ -1700,17 +1718,25 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
 
     switch (le32_to_cpu(c->cns)) {
     case NVME_ID_CNS_NS:
-        return nvme_identify_ns(n, req);
+        return nvme_identify_ns(n, req, true);
     case NVME_ID_CNS_CS_NS:
-        return nvme_identify_ns_csi(n, req);
+        return nvme_identify_ns_csi(n, req, true);
+    case NVME_ID_CNS_NS_PRESENT:
+        return nvme_identify_ns(n, req, false);
+    case NVME_ID_CNS_CS_NS_PRESENT:
+        return nvme_identify_ns_csi(n, req, false);
     case NVME_ID_CNS_CTRL:
         return nvme_identify_ctrl(n, req);
     case NVME_ID_CNS_CS_CTRL:
         return nvme_identify_ctrl_csi(n, req);
     case NVME_ID_CNS_NS_ACTIVE_LIST:
-        return nvme_identify_nslist(n, req);
+        return nvme_identify_nslist(n, req, true);
     case NVME_ID_CNS_CS_NS_ACTIVE_LIST:
-        return nvme_identify_nslist_csi(n, req);
+        return nvme_identify_nslist_csi(n, req, true);
+    case NVME_ID_CNS_NS_PRESENT_LIST:
+        return nvme_identify_nslist(n, req, false);
+    case NVME_ID_CNS_CS_NS_PRESENT_LIST:
+        return nvme_identify_nslist_csi(n, req, false);
     case NVME_ID_CNS_NS_DESCR_LIST:
         return nvme_identify_ns_descr_list(n, req);
     case NVME_ID_CNS_IO_COMMAND_SET:
@@ -2188,8 +2214,10 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 
 static int nvme_start_ctrl(NvmeCtrl *n)
 {
+    NvmeNamespace *ns;
     uint32_t page_bits = NVME_CC_MPS(n->bar.cc) + 12;
     uint32_t page_size = 1 << page_bits;
+    int i;
 
     if (unlikely(n->cq[0])) {
         trace_pci_nvme_err_startfail_cq();
@@ -2276,6 +2304,22 @@ static int nvme_start_ctrl(NvmeCtrl *n)
     nvme_init_sq(&n->admin_sq, n, n->bar.asq, 0, 0,
                  NVME_AQA_ASQS(n->bar.aqa) + 1);
 
+    for (i = 1; i <= n->num_namespaces; i++) {
+        ns = nvme_ns(n, i);
+        if (!ns) {
+            continue;
+        }
+        ns->params.attached = false;
+        switch (ns->params.csi) {
+        case NVME_CSI_NVM:
+            if (NVME_CC_CSS(n->bar.cc) == CSS_NVM_ONLY ||
+                NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
+                ns->params.attached = true;
+            }
+            break;
+        }
+    }
+
     nvme_set_timestamp(n, 0ULL);
 
     QTAILQ_INIT(&n->aer_queue);
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 4587311783..b182fe40b2 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -804,14 +804,18 @@ typedef struct QEMU_PACKED NvmePSD {
 #define NVME_IDENTIFY_DATA_SIZE 4096
 
 enum NvmeIdCns {
-    NVME_ID_CNS_NS                = 0x00,
-    NVME_ID_CNS_CTRL              = 0x01,
-    NVME_ID_CNS_NS_ACTIVE_LIST    = 0x02,
-    NVME_ID_CNS_NS_DESCR_LIST     = 0x03,
-    NVME_ID_CNS_CS_NS             = 0x05,
-    NVME_ID_CNS_CS_CTRL           = 0x06,
-    NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
-    NVME_ID_CNS_IO_COMMAND_SET    = 0x1c,
+    NVME_ID_CNS_NS                    = 0x00,
+    NVME_ID_CNS_CTRL                  = 0x01,
+    NVME_ID_CNS_NS_ACTIVE_LIST        = 0x02,
+    NVME_ID_CNS_NS_DESCR_LIST         = 0x03,
+    NVME_ID_CNS_CS_NS                 = 0x05,
+    NVME_ID_CNS_CS_CTRL               = 0x06,
+    NVME_ID_CNS_CS_NS_ACTIVE_LIST     = 0x07,
+    NVME_ID_CNS_NS_PRESENT_LIST       = 0x10,
+    NVME_ID_CNS_NS_PRESENT            = 0x11,
+    NVME_ID_CNS_CS_NS_PRESENT_LIST    = 0x1a,
+    NVME_ID_CNS_CS_NS_PRESENT         = 0x1b,
+    NVME_ID_CNS_IO_COMMAND_SET        = 0x1c,
 };
 
 typedef struct QEMU_PACKED NvmeIdCtrl {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 07/14] hw/block/nvme: Make Zoned NS Command Set definitions
  2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (5 preceding siblings ...)
  2020-09-28  2:35 ` [PATCH v5 06/14] hw/block/nvme: Add support for active/inactive namespaces Dmitry Fomichev
@ 2020-09-28  2:35 ` Dmitry Fomichev
  2020-09-28  2:35 ` [PATCH v5 08/14] hw/block/nvme: Define Zoned NS Command Set trace events Dmitry Fomichev
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:35 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

Define values and structures that are needed to support Zoned
Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.

All new protocol definitions are located in include/block/nvme.h
and everything added that is specific to this implementation is kept
in hw/block/nvme.h.

In order to improve scalability, all open, closed and full zones
are organized in separate linked lists. Consequently, almost all
zone operations don't require scanning of the entire zone array
(which potentially can be quite large) - it is only necessary to
enumerate one or more zone lists. Zone lists are designed to be
position-independent as they can be persisted to the backing file
as a part of zone metadata. NvmeZoneList struct defined in this patch
serves as a head of every zone list.

NvmeZone structure encapsulates NvmeZoneDescriptor defined in Zoned
Command Set specification and adds a few more fields that are
internal to this implementation.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme-ns.h   | 114 +++++++++++++++++++++++++++++++++++++++++++
 hw/block/nvme.h      |  10 ++++
 include/block/nvme.h | 107 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 231 insertions(+)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index acdb76f058..04172f083e 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -19,11 +19,33 @@
 #define NVME_NS(obj) \
     OBJECT_CHECK(NvmeNamespace, (obj), TYPE_NVME_NS)
 
+typedef struct NvmeZone {
+    NvmeZoneDescr   d;
+    uint64_t        w_ptr;
+    uint32_t        next;
+    uint32_t        prev;
+    uint8_t         rsvd80[8];
+} NvmeZone;
+
+#define NVME_ZONE_LIST_NIL    UINT_MAX
+
+typedef struct NvmeZoneList {
+    uint32_t        head;
+    uint32_t        tail;
+    uint32_t        size;
+    uint8_t         rsvd12[4];
+} NvmeZoneList;
+
 typedef struct NvmeNamespaceParams {
     uint32_t nsid;
     uint8_t  csi;
     bool     attached;
     QemuUUID uuid;
+
+    bool     zoned;
+    bool     cross_zone_read;
+    uint64_t zone_size_mb;
+    uint64_t zone_capacity_mb;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
@@ -33,6 +55,18 @@ typedef struct NvmeNamespace {
     int64_t      size;
     NvmeIdNs     id_ns;
 
+    NvmeIdNsZoned   *id_ns_zoned;
+    NvmeZone        *zone_array;
+    NvmeZoneList    *exp_open_zones;
+    NvmeZoneList    *imp_open_zones;
+    NvmeZoneList    *closed_zones;
+    NvmeZoneList    *full_zones;
+    uint32_t        num_zones;
+    uint64_t        zone_size;
+    uint64_t        zone_capacity;
+    uint64_t        zone_array_size;
+    uint32_t        zone_size_log2;
+
     NvmeNamespaceParams params;
 } NvmeNamespace;
 
@@ -74,4 +108,84 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
 void nvme_ns_drain(NvmeNamespace *ns);
 void nvme_ns_flush(NvmeNamespace *ns);
 
+static inline uint8_t nvme_get_zone_state(NvmeZone *zone)
+{
+    return zone->d.zs >> 4;
+}
+
+static inline void nvme_set_zone_state(NvmeZone *zone, enum NvmeZoneState state)
+{
+    zone->d.zs = state << 4;
+}
+
+static inline uint64_t nvme_zone_rd_boundary(NvmeNamespace *ns, NvmeZone *zone)
+{
+    return zone->d.zslba + ns->zone_size;
+}
+
+static inline uint64_t nvme_zone_wr_boundary(NvmeZone *zone)
+{
+    return zone->d.zslba + zone->d.zcap;
+}
+
+static inline bool nvme_wp_is_valid(NvmeZone *zone)
+{
+    uint8_t st = nvme_get_zone_state(zone);
+
+    return st != NVME_ZONE_STATE_FULL &&
+           st != NVME_ZONE_STATE_READ_ONLY &&
+           st != NVME_ZONE_STATE_OFFLINE;
+}
+
+/*
+ * Initialize a zone list head.
+ */
+static inline void nvme_init_zone_list(NvmeZoneList *zl)
+{
+    zl->head = NVME_ZONE_LIST_NIL;
+    zl->tail = NVME_ZONE_LIST_NIL;
+    zl->size = 0;
+}
+
+/*
+ * Initialize the number of entries contained in a zone list.
+ */
+static inline uint32_t nvme_zone_list_size(NvmeZoneList *zl)
+{
+    return zl->size;
+}
+
+/*
+ * Check if the zone is not currently included into any zone list.
+ */
+static inline bool nvme_zone_not_in_list(NvmeZone *zone)
+{
+    return (bool)(zone->prev == 0 && zone->next == 0);
+}
+
+/*
+ * Return the zone at the head of zone list or NULL if the list is empty.
+ */
+static inline NvmeZone *nvme_peek_zone_head(NvmeNamespace *ns, NvmeZoneList *zl)
+{
+    if (zl->head == NVME_ZONE_LIST_NIL) {
+        return NULL;
+    }
+    return &ns->zone_array[zl->head];
+}
+
+/*
+ * Return the next zone in the list.
+ */
+static inline NvmeZone *nvme_next_zone_in_list(NvmeNamespace *ns, NvmeZone *z,
+                                               NvmeZoneList *zl)
+{
+    assert(!nvme_zone_not_in_list(z));
+
+    if (z->next == NVME_ZONE_LIST_NIL) {
+        return NULL;
+    }
+    return &ns->zone_array[z->next];
+}
+
 #endif /* NVME_NS_H */
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index e080a2318a..f09e741d9a 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -6,6 +6,9 @@
 
 #define NVME_MAX_NAMESPACES 256
 
+#define NVME_DEFAULT_ZONE_SIZE   128 /* MiB */
+#define NVME_DEFAULT_MAX_ZA_SIZE 128 /* KiB */
+
 typedef struct NvmeParams {
     char     *serial;
     uint32_t num_queues; /* deprecated since 5.1 */
@@ -16,6 +19,8 @@ typedef struct NvmeParams {
     uint32_t aer_max_queued;
     uint8_t  mdts;
     bool     use_intel_id;
+    uint8_t  fill_pattern;
+    uint32_t zasl_kb;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
@@ -28,6 +33,8 @@ typedef struct NvmeRequest {
     struct NvmeNamespace    *ns;
     BlockAIOCB              *aiocb;
     uint16_t                status;
+    int64_t                 fill_ofs;
+    uint32_t                fill_len;
     NvmeCqe                 cqe;
     NvmeCmd                 cmd;
     BlockAcctCookie         acct;
@@ -147,6 +154,9 @@ typedef struct NvmeCtrl {
     QTAILQ_HEAD(, NvmeAsyncEvent) aer_queue;
     int         aer_queued;
 
+    uint32_t    zasl_bs;
+    uint8_t     zasl;
+
     NvmeNamespace   namespace;
     NvmeNamespace   *namespaces[NVME_MAX_NAMESPACES];
     NvmeSQueue      **sq;
diff --git a/include/block/nvme.h b/include/block/nvme.h
index b182fe40b2..a7126e123f 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -488,6 +488,9 @@ enum NvmeIoCommands {
     NVME_CMD_COMPARE            = 0x05,
     NVME_CMD_WRITE_ZEROES       = 0x08,
     NVME_CMD_DSM                = 0x09,
+    NVME_CMD_ZONE_MGMT_SEND     = 0x79,
+    NVME_CMD_ZONE_MGMT_RECV     = 0x7a,
+    NVME_CMD_ZONE_APPEND        = 0x7d,
 };
 
 typedef struct QEMU_PACKED NvmeDeleteQ {
@@ -677,6 +680,7 @@ enum NvmeStatusCodes {
     NVME_SGL_DESCR_TYPE_INVALID = 0x0011,
     NVME_INVALID_USE_OF_CMB     = 0x0012,
     NVME_CMD_SET_CMB_REJECTED   = 0x002b,
+    NVME_INVALID_CMD_SET        = 0x002c,
     NVME_LBA_RANGE              = 0x0080,
     NVME_CAP_EXCEEDED           = 0x0081,
     NVME_NS_NOT_READY           = 0x0082,
@@ -701,6 +705,14 @@ enum NvmeStatusCodes {
     NVME_CONFLICTING_ATTRS      = 0x0180,
     NVME_INVALID_PROT_INFO      = 0x0181,
     NVME_WRITE_TO_RO            = 0x0182,
+    NVME_ZONE_BOUNDARY_ERROR    = 0x01b8,
+    NVME_ZONE_FULL              = 0x01b9,
+    NVME_ZONE_READ_ONLY         = 0x01ba,
+    NVME_ZONE_OFFLINE           = 0x01bb,
+    NVME_ZONE_INVALID_WRITE     = 0x01bc,
+    NVME_ZONE_TOO_MANY_ACTIVE   = 0x01bd,
+    NVME_ZONE_TOO_MANY_OPEN     = 0x01be,
+    NVME_ZONE_INVAL_TRANSITION  = 0x01bf,
     NVME_WRITE_FAULT            = 0x0280,
     NVME_UNRECOVERED_READ       = 0x0281,
     NVME_E2E_GUARD_ERROR        = 0x0282,
@@ -885,6 +897,11 @@ typedef struct QEMU_PACKED NvmeIdCtrl {
     uint8_t     vs[1024];
 } NvmeIdCtrl;
 
+typedef struct NvmeIdCtrlZoned {
+    uint8_t     zasl;
+    uint8_t     rsvd1[4095];
+} NvmeIdCtrlZoned;
+
 enum NvmeIdCtrlOacs {
     NVME_OACS_SECURITY  = 1 << 0,
     NVME_OACS_FORMAT    = 1 << 1,
@@ -1009,6 +1026,12 @@ typedef struct QEMU_PACKED NvmeLBAF {
     uint8_t     rp;
 } NvmeLBAF;
 
+typedef struct QEMU_PACKED NvmeLBAFE {
+    uint64_t    zsze;
+    uint8_t     zdes;
+    uint8_t     rsvd9[7];
+} NvmeLBAFE;
+
 #define NVME_NSID_BROADCAST 0xffffffff
 
 typedef struct QEMU_PACKED NvmeIdNs {
@@ -1063,10 +1086,24 @@ enum NvmeNsIdentifierType {
 
 enum NvmeCsi {
     NVME_CSI_NVM                = 0x00,
+    NVME_CSI_ZONED              = 0x02,
 };
 
 #define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
 
+typedef struct QEMU_PACKED NvmeIdNsZoned {
+    uint16_t    zoc;
+    uint16_t    ozcs;
+    uint32_t    mar;
+    uint32_t    mor;
+    uint32_t    rrl;
+    uint32_t    frl;
+    uint8_t     rsvd20[2796];
+    NvmeLBAFE   lbafe[16];
+    uint8_t     rsvd3072[768];
+    uint8_t     vs[256];
+} NvmeIdNsZoned;
+
 /*Deallocate Logical Block Features*/
 #define NVME_ID_NS_DLFEAT_GUARD_CRC(dlfeat)       ((dlfeat) & 0x10)
 #define NVME_ID_NS_DLFEAT_WRITE_ZEROES(dlfeat)    ((dlfeat) & 0x08)
@@ -1098,6 +1135,71 @@ enum NvmeIdNsDps {
     DPS_FIRST_EIGHT = 8,
 };
 
+enum NvmeZoneAttr {
+    NVME_ZA_FINISHED_BY_CTLR         = 1 << 0,
+    NVME_ZA_FINISH_RECOMMENDED       = 1 << 1,
+    NVME_ZA_RESET_RECOMMENDED        = 1 << 2,
+    NVME_ZA_ZD_EXT_VALID             = 1 << 7,
+};
+
+typedef struct QEMU_PACKED NvmeZoneReportHeader {
+    uint64_t    nr_zones;
+    uint8_t     rsvd[56];
+} NvmeZoneReportHeader;
+
+enum NvmeZoneReceiveAction {
+    NVME_ZONE_REPORT                 = 0,
+    NVME_ZONE_REPORT_EXTENDED        = 1,
+};
+
+enum NvmeZoneReportType {
+    NVME_ZONE_REPORT_ALL             = 0,
+    NVME_ZONE_REPORT_EMPTY           = 1,
+    NVME_ZONE_REPORT_IMPLICITLY_OPEN = 2,
+    NVME_ZONE_REPORT_EXPLICITLY_OPEN = 3,
+    NVME_ZONE_REPORT_CLOSED          = 4,
+    NVME_ZONE_REPORT_FULL            = 5,
+    NVME_ZONE_REPORT_READ_ONLY       = 6,
+    NVME_ZONE_REPORT_OFFLINE         = 7,
+};
+
+enum NvmeZoneType {
+    NVME_ZONE_TYPE_RESERVED          = 0x00,
+    NVME_ZONE_TYPE_SEQ_WRITE         = 0x02,
+};
+
+enum NvmeZoneSendAction {
+    NVME_ZONE_ACTION_RSD             = 0x00,
+    NVME_ZONE_ACTION_CLOSE           = 0x01,
+    NVME_ZONE_ACTION_FINISH          = 0x02,
+    NVME_ZONE_ACTION_OPEN            = 0x03,
+    NVME_ZONE_ACTION_RESET           = 0x04,
+    NVME_ZONE_ACTION_OFFLINE         = 0x05,
+    NVME_ZONE_ACTION_SET_ZD_EXT      = 0x10,
+};
+
+typedef struct QEMU_PACKED NvmeZoneDescr {
+    uint8_t     zt;
+    uint8_t     zs;
+    uint8_t     za;
+    uint8_t     rsvd3[5];
+    uint64_t    zcap;
+    uint64_t    zslba;
+    uint64_t    wp;
+    uint8_t     rsvd32[32];
+} NvmeZoneDescr;
+
+enum NvmeZoneState {
+    NVME_ZONE_STATE_RESERVED         = 0x00,
+    NVME_ZONE_STATE_EMPTY            = 0x01,
+    NVME_ZONE_STATE_IMPLICITLY_OPEN  = 0x02,
+    NVME_ZONE_STATE_EXPLICITLY_OPEN  = 0x03,
+    NVME_ZONE_STATE_CLOSED           = 0x04,
+    NVME_ZONE_STATE_READ_ONLY        = 0x0D,
+    NVME_ZONE_STATE_FULL             = 0x0E,
+    NVME_ZONE_STATE_OFFLINE          = 0x0F,
+};
+
 static inline void _nvme_check_size(void)
 {
     QEMU_BUILD_BUG_ON(sizeof(NvmeBar) != 4096);
@@ -1117,9 +1219,14 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrlZoned) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeLBAF) != 4);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeLBAFE) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsZoned) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeZoneDescr) != 64);
 }
 #endif
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 08/14] hw/block/nvme: Define Zoned NS Command Set trace events
  2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (6 preceding siblings ...)
  2020-09-28  2:35 ` [PATCH v5 07/14] hw/block/nvme: Make Zoned NS Command Set definitions Dmitry Fomichev
@ 2020-09-28  2:35 ` Dmitry Fomichev
  2020-09-28  2:35 ` [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:35 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

The Zoned Namespace Command Set / Namespace Types implementation that
is being introduced in this series adds a good number of trace events.
Combine all tracepoint definitions into a separate patch to make
reviewing more convenient.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/trace-events | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/hw/block/trace-events b/hw/block/trace-events
index b93429b04c..386f28e457 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -93,6 +93,17 @@ pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
 pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
 pci_nvme_css_nvm_cset_selected_by_host(uint32_t cc) "NVM command set selected by host, bar.cc=0x%"PRIx32""
 pci_nvme_css_all_csets_sel_by_host(uint32_t cc) "all supported command sets selected by host, bar.cc=0x%"PRIx32""
+pci_nvme_open_zone(uint64_t slba, uint32_t zone_idx, int all) "open zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_close_zone(uint64_t slba, uint32_t zone_idx, int all) "close zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_finish_zone(uint64_t slba, uint32_t zone_idx, int all) "finish zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_reset_zone(uint64_t slba, uint32_t zone_idx, int all) "reset zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_offline_zone(uint64_t slba, uint32_t zone_idx, int all) "offline zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_set_descriptor_extension(uint64_t slba, uint32_t zone_idx) "set zone descriptor extension, slba=%"PRIu64", idx=%"PRIu32""
+pci_nvme_zd_extension_set(uint32_t zone_idx) "set descriptor extension for zone_idx=%"PRIu32""
+pci_nvme_power_on_close(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Closed state"
+pci_nvme_power_on_reset(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Empty state"
+pci_nvme_power_on_full(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Full state"
+pci_nvme_mapped_zone_file(char *zfile_name, int ret) "mapped zone file %s, error %d"
 
 # nvme traces for error conditions
 pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
@@ -111,9 +122,23 @@ pci_nvme_err_invalid_prp(void) "invalid PRP"
 pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
 pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
 pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
+pci_nvme_err_unaligned_zone_cmd(uint8_t action, uint64_t slba, uint64_t zslba) "unaligned zone op 0x%"PRIx32", got slba=%"PRIu64", zslba=%"PRIu64""
+pci_nvme_err_invalid_zone_state_transition(uint8_t state, uint8_t action, uint64_t slba, uint8_t attrs) "0x%"PRIx32"->0x%"PRIx32", slba=%"PRIu64", attrs=0x%"PRIx32""
+pci_nvme_err_write_not_at_wp(uint64_t slba, uint64_t zone, uint64_t wp) "writing at slba=%"PRIu64", zone=%"PRIu64", but wp=%"PRIu64""
+pci_nvme_err_append_not_at_start(uint64_t slba, uint64_t zone) "appending at slba=%"PRIu64", but zone=%"PRIu64""
+pci_nvme_err_zone_write_not_ok(uint64_t slba, uint32_t nlb, uint32_t status) "slba=%"PRIu64", nlb=%"PRIu32", status=0x%"PRIx16""
+pci_nvme_err_zone_read_not_ok(uint64_t slba, uint32_t nlb, uint32_t status) "slba=%"PRIu64", nlb=%"PRIu32", status=0x%"PRIx16""
+pci_nvme_err_append_too_large(uint64_t slba, uint32_t nlb, uint8_t zasl) "slba=%"PRIu64", nlb=%"PRIu32", zasl=%"PRIu8""
+pci_nvme_err_insuff_active_res(uint32_t max_active) "max_active=%"PRIu32" zone limit exceeded"
+pci_nvme_err_insuff_open_res(uint32_t max_open) "max_open=%"PRIu32" zone limit exceeded"
+pci_nvme_err_zone_file_invalid(int error) "validation error=%"PRIi32""
+pci_nvme_err_zd_extension_map_error(uint32_t zone_idx) "can't map descriptor extension for zone_idx=%"PRIu32""
+pci_nvme_err_invalid_changed_zone_list_offset(uint64_t ofs) "changed zone list log offset must be 0, got %"PRIu64""
+pci_nvme_err_invalid_changed_zone_list_len(uint32_t len) "changed zone list log size is 4096, got %"PRIu32""
 pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
 pci_nvme_err_change_css_when_enabled(void) "changing CC.CSS while controller is enabled"
 pci_nvme_err_only_nvm_cmd_set_avail(void) "setting 110b CC.CSS, but only NVM command set is enabled"
+pci_nvme_err_only_zoned_cmd_set_avail(void) "setting 001b CC.CSS, but only ZONED+NVM command set is enabled"
 pci_nvme_err_invalid_iocsci(uint32_t idx) "unsupported command set combination index %"PRIu32""
 pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
@@ -147,6 +172,7 @@ pci_nvme_err_startfail_sqent_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_
 pci_nvme_err_startfail_asqent_sz_zero(void) "nvme_start_ctrl failed because the admin submission queue size is zero"
 pci_nvme_err_startfail_acqent_sz_zero(void) "nvme_start_ctrl failed because the admin completion queue size is zero"
 pci_nvme_err_startfail(void) "setting controller enable bit failed"
+pci_nvme_err_invalid_mgmt_action(int action) "action=0x%"PRIx8""
 
 # Traces for undefined behavior
 pci_nvme_ub_mmiowr_misaligned32(uint64_t offset) "MMIO write not 32-bit aligned, offset=0x%"PRIx64""
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set
  2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (7 preceding siblings ...)
  2020-09-28  2:35 ` [PATCH v5 08/14] hw/block/nvme: Define Zoned NS Command Set trace events Dmitry Fomichev
@ 2020-09-28  2:35 ` Dmitry Fomichev
  2020-09-28  6:44   ` Klaus Jensen
                     ` (4 more replies)
  2020-09-28  2:35 ` [PATCH v5 10/14] hw/block/nvme: Introduce max active and open zone limits Dmitry Fomichev
                   ` (4 subsequent siblings)
  13 siblings, 5 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:35 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

The emulation code has been changed to advertise NVM Command Set when
"zoned" device property is not set (default) and Zoned Namespace
Command Set otherwise.

Handlers for three new NVMe commands introduced in Zoned Namespace
Command Set specification are added, namely for Zone Management
Receive, Zone Management Send and Zone Append.

Device initialization code has been extended to create a proper
configuration for zoned operation using device properties.

Read/Write command handler is modified to only allow writes at the
write pointer if the namespace is zoned. For Zone Append command,
writes implicitly happen at the write pointer and the starting write
pointer value is returned as the result of the command. Write Zeroes
handler is modified to add zoned checks that are identical to those
done as a part of Write flow.

The code to support for Zone Descriptor Extensions is not included in
this commit and ZDES 0 is always reported. A later commit in this
series will add ZDE support.

This commit doesn't yet include checks for active and open zone
limits. It is assumed that there are no limits on either active or
open zones.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 block/nvme.c         |   2 +-
 hw/block/nvme-ns.c   | 185 ++++++++-
 hw/block/nvme-ns.h   |   6 +-
 hw/block/nvme.c      | 872 +++++++++++++++++++++++++++++++++++++++++--
 include/block/nvme.h |   6 +-
 5 files changed, 1033 insertions(+), 38 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 05485fdd11..7a513c9a17 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -333,7 +333,7 @@ static inline int nvme_translate_error(const NvmeCqe *c)
 {
     uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF;
     if (status) {
-        trace_nvme_error(le32_to_cpu(c->result),
+        trace_nvme_error(le32_to_cpu(c->result32),
                          le16_to_cpu(c->sq_head),
                          le16_to_cpu(c->sq_id),
                          le16_to_cpu(c->cid),
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 31b7f986c3..6d9dc9205b 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -33,14 +33,14 @@ static void nvme_ns_init(NvmeNamespace *ns)
     NvmeIdNs *id_ns = &ns->id_ns;
 
     if (blk_get_flags(ns->blkconf.blk) & BDRV_O_UNMAP) {
-        ns->id_ns.dlfeat = 0x9;
+        ns->id_ns.dlfeat = 0x8;
     }
 
     id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
 
     id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(ns));
 
-    ns->params.csi = NVME_CSI_NVM;
+    ns->csi = NVME_CSI_NVM;
     qemu_uuid_generate(&ns->params.uuid); /* TODO make UUIDs persistent */
 
     /* no thin provisioning */
@@ -73,7 +73,162 @@ static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
     }
 
     lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
-    ns->id_ns.lbaf[lba_index].ds = 31 - clz32(n->conf.logical_block_size);
+    ns->id_ns.lbaf[lba_index].ds = 31 - clz32(ns->blkconf.logical_block_size);
+
+    return 0;
+}
+
+/*
+ * Add a zone to the tail of a zone list.
+ */
+void nvme_add_zone_tail(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone)
+{
+    uint32_t idx = (uint32_t)(zone - ns->zone_array);
+
+    assert(nvme_zone_not_in_list(zone));
+
+    if (!zl->size) {
+        zl->head = zl->tail = idx;
+        zone->next = zone->prev = NVME_ZONE_LIST_NIL;
+    } else {
+        ns->zone_array[zl->tail].next = idx;
+        zone->prev = zl->tail;
+        zone->next = NVME_ZONE_LIST_NIL;
+        zl->tail = idx;
+    }
+    zl->size++;
+}
+
+/*
+ * Remove a zone from a zone list. The zone must be linked in the list.
+ */
+void nvme_remove_zone(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone)
+{
+    uint32_t idx = (uint32_t)(zone - ns->zone_array);
+
+    assert(!nvme_zone_not_in_list(zone));
+
+    --zl->size;
+    if (zl->size == 0) {
+        zl->head = NVME_ZONE_LIST_NIL;
+        zl->tail = NVME_ZONE_LIST_NIL;
+    } else if (idx == zl->head) {
+        zl->head = zone->next;
+        ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
+    } else if (idx == zl->tail) {
+        zl->tail = zone->prev;
+        ns->zone_array[zl->tail].next = NVME_ZONE_LIST_NIL;
+    } else {
+        ns->zone_array[zone->next].prev = zone->prev;
+        ns->zone_array[zone->prev].next = zone->next;
+    }
+
+    zone->prev = zone->next = 0;
+}
+
+static int nvme_calc_zone_geometry(NvmeNamespace *ns, Error **errp)
+{
+    uint64_t zone_size, zone_cap;
+    uint32_t nz, lbasz = ns->blkconf.logical_block_size;
+
+    if (ns->params.zone_size_mb) {
+        zone_size = ns->params.zone_size_mb;
+    } else {
+        zone_size = NVME_DEFAULT_ZONE_SIZE;
+    }
+    if (ns->params.zone_capacity_mb) {
+        zone_cap = ns->params.zone_capacity_mb;
+    } else {
+        zone_cap = zone_size;
+    }
+    ns->zone_size = zone_size * MiB / lbasz;
+    ns->zone_capacity = zone_cap * MiB / lbasz;
+    if (ns->zone_capacity > ns->zone_size) {
+        error_setg(errp, "zone capacity exceeds zone size");
+        return -1;
+    }
+
+    nz = DIV_ROUND_UP(ns->size / lbasz, ns->zone_size);
+    ns->num_zones = nz;
+    ns->zone_array_size = sizeof(NvmeZone) * nz;
+    ns->zone_size_log2 = 0;
+    if (is_power_of_2(ns->zone_size)) {
+        ns->zone_size_log2 = 63 - clz64(ns->zone_size);
+    }
+
+    return 0;
+}
+
+static void nvme_init_zone_meta(NvmeNamespace *ns)
+{
+    uint64_t start = 0, zone_size = ns->zone_size;
+    uint64_t capacity = ns->num_zones * zone_size;
+    NvmeZone *zone;
+    int i;
+
+    ns->zone_array = g_malloc0(ns->zone_array_size);
+    ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
+    ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
+    ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
+    ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
+
+    nvme_init_zone_list(ns->exp_open_zones);
+    nvme_init_zone_list(ns->imp_open_zones);
+    nvme_init_zone_list(ns->closed_zones);
+    nvme_init_zone_list(ns->full_zones);
+
+    zone = ns->zone_array;
+    for (i = 0; i < ns->num_zones; i++, zone++) {
+        if (start + zone_size > capacity) {
+            zone_size = capacity - start;
+        }
+        zone->d.zt = NVME_ZONE_TYPE_SEQ_WRITE;
+        nvme_set_zone_state(zone, NVME_ZONE_STATE_EMPTY);
+        zone->d.za = 0;
+        zone->d.zcap = ns->zone_capacity;
+        zone->d.zslba = start;
+        zone->d.wp = start;
+        zone->w_ptr = start;
+        zone->prev = 0;
+        zone->next = 0;
+        start += zone_size;
+    }
+}
+
+static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
+                              Error **errp)
+{
+    NvmeIdNsZoned *id_ns_z;
+
+    if (n->params.fill_pattern == 0) {
+        ns->id_ns.dlfeat |= 0x01;
+    } else if (n->params.fill_pattern == 0xff) {
+        ns->id_ns.dlfeat |= 0x02;
+    }
+
+    if (nvme_calc_zone_geometry(ns, errp) != 0) {
+        return -1;
+    }
+
+    nvme_init_zone_meta(ns);
+
+    id_ns_z = g_malloc0(sizeof(NvmeIdNsZoned));
+
+    /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
+    id_ns_z->mar = 0xffffffff;
+    id_ns_z->mor = 0xffffffff;
+    id_ns_z->zoc = 0;
+    id_ns_z->ozcs = ns->params.cross_zone_read ? 0x01 : 0x00;
+
+    id_ns_z->lbafe[lba_index].zsze = cpu_to_le64(ns->zone_size);
+    id_ns_z->lbafe[lba_index].zdes = 0; /* FIXME make helper */
+
+    ns->csi = NVME_CSI_ZONED;
+    ns->id_ns.ncap = cpu_to_le64(ns->zone_capacity * ns->num_zones);
+    ns->id_ns.nuse = ns->id_ns.ncap;
+    ns->id_ns.nsze = ns->id_ns.ncap;
+
+    ns->id_ns_zoned = id_ns_z;
 
     return 0;
 }
@@ -103,6 +258,12 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
         return -1;
     }
 
+    if (ns->params.zoned) {
+        if (nvme_zoned_init_ns(n, ns, 0, errp) != 0) {
+            return -1;
+        }
+    }
+
     return 0;
 }
 
@@ -116,6 +277,16 @@ void nvme_ns_flush(NvmeNamespace *ns)
     blk_flush(ns->blkconf.blk);
 }
 
+void nvme_ns_cleanup(NvmeNamespace *ns)
+{
+    g_free(ns->id_ns_zoned);
+    g_free(ns->zone_array);
+    g_free(ns->exp_open_zones);
+    g_free(ns->imp_open_zones);
+    g_free(ns->closed_zones);
+    g_free(ns->full_zones);
+}
+
 static void nvme_ns_realize(DeviceState *dev, Error **errp)
 {
     NvmeNamespace *ns = NVME_NS(dev);
@@ -133,6 +304,14 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 static Property nvme_ns_props[] = {
     DEFINE_BLOCK_PROPERTIES(NvmeNamespace, blkconf),
     DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
+
+    DEFINE_PROP_BOOL("zoned", NvmeNamespace, params.zoned, false),
+    DEFINE_PROP_UINT64("zone_size", NvmeNamespace, params.zone_size_mb,
+                       NVME_DEFAULT_ZONE_SIZE),
+    DEFINE_PROP_UINT64("zone_capacity", NvmeNamespace,
+                       params.zone_capacity_mb, 0),
+    DEFINE_PROP_BOOL("cross_zone_read", NvmeNamespace,
+                      params.cross_zone_read, false),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 04172f083e..daa13546c4 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -38,7 +38,6 @@ typedef struct NvmeZoneList {
 
 typedef struct NvmeNamespaceParams {
     uint32_t nsid;
-    uint8_t  csi;
     bool     attached;
     QemuUUID uuid;
 
@@ -52,6 +51,7 @@ typedef struct NvmeNamespace {
     DeviceState  parent_obj;
     BlockConf    blkconf;
     int32_t      bootindex;
+    uint8_t      csi;
     int64_t      size;
     NvmeIdNs     id_ns;
 
@@ -107,6 +107,7 @@ typedef struct NvmeCtrl NvmeCtrl;
 int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
 void nvme_ns_drain(NvmeNamespace *ns);
 void nvme_ns_flush(NvmeNamespace *ns);
+void nvme_ns_cleanup(NvmeNamespace *ns);
 
 static inline uint8_t nvme_get_zone_state(NvmeZone *zone)
 {
@@ -188,4 +189,7 @@ static inline NvmeZone *nvme_next_zone_in_list(NvmeNamespace *ns, NvmeZone *z,
     return &ns->zone_array[z->next];
 }
 
+void nvme_add_zone_tail(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone);
+void nvme_remove_zone(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone);
+
 #endif /* NVME_NS_H */
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 63ad03d6d6..38e25a4d1f 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -54,6 +54,7 @@
 #include "qemu/osdep.h"
 #include "qemu/units.h"
 #include "qemu/error-report.h"
+#include "crypto/random.h"
 #include "hw/block/block.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pci.h"
@@ -127,6 +128,46 @@ static uint16_t nvme_sqid(NvmeRequest *req)
     return le16_to_cpu(req->sq->sqid);
 }
 
+static void nvme_assign_zone_state(NvmeNamespace *ns, NvmeZone *zone,
+                                   uint8_t state)
+{
+    if (!nvme_zone_not_in_list(zone)) {
+        switch (nvme_get_zone_state(zone)) {
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+            nvme_remove_zone(ns, ns->exp_open_zones, zone);
+            break;
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+            nvme_remove_zone(ns, ns->imp_open_zones, zone);
+            break;
+        case NVME_ZONE_STATE_CLOSED:
+            nvme_remove_zone(ns, ns->closed_zones, zone);
+            break;
+        case NVME_ZONE_STATE_FULL:
+            nvme_remove_zone(ns, ns->full_zones, zone);
+        }
+   }
+
+    nvme_set_zone_state(zone, state);
+
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+        nvme_add_zone_tail(ns, ns->exp_open_zones, zone);
+        break;
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_add_zone_tail(ns, ns->imp_open_zones, zone);
+        break;
+    case NVME_ZONE_STATE_CLOSED:
+        nvme_add_zone_tail(ns, ns->closed_zones, zone);
+        break;
+    case NVME_ZONE_STATE_FULL:
+        nvme_add_zone_tail(ns, ns->full_zones, zone);
+    case NVME_ZONE_STATE_READ_ONLY:
+        break;
+    default:
+        zone->d.za = 0;
+    }
+}
+
 static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
 {
     hwaddr low = n->ctrl_mem.addr;
@@ -813,7 +854,7 @@ static void nvme_process_aers(void *opaque)
 
         req = n->aer_reqs[n->outstanding_aers];
 
-        result = (NvmeAerResult *) &req->cqe.result;
+        result = (NvmeAerResult *) &req->cqe.result32;
         result->event_type = event->result.event_type;
         result->event_info = event->result.event_info;
         result->log_page = event->result.log_page;
@@ -882,6 +923,200 @@ static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace *ns,
     return NVME_SUCCESS;
 }
 
+static void nvme_fill_data(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t offset,
+                           uint32_t max_len, uint8_t pattern)
+{
+    ScatterGatherEntry *entry;
+    uint32_t len, ent_len;
+
+    if (qsg->nsg > 0) {
+        entry = qsg->sg;
+        len = qsg->size;
+        if (max_len) {
+            len = MIN(len, max_len);
+        }
+        for (; len > 0; len -= ent_len) {
+            ent_len = MIN(len, entry->len);
+            if (offset > ent_len) {
+                offset -= ent_len;
+            } else if (offset != 0) {
+                dma_memory_set(qsg->as, entry->base + offset,
+                               pattern, ent_len - offset);
+                offset = 0;
+            } else {
+                dma_memory_set(qsg->as, entry->base, pattern, ent_len);
+            }
+            entry++;
+        }
+    } else if (iov->iov) {
+        len = iov_size(iov->iov, iov->niov);
+        if (max_len) {
+            len = MIN(len, max_len);
+        }
+        qemu_iovec_memset(iov, offset, pattern, len - offset);
+    }
+}
+
+static uint16_t nvme_check_zone_write(NvmeZone *zone, uint64_t slba,
+                                      uint32_t nlb)
+{
+    uint16_t status;
+
+    if (unlikely((slba + nlb) > nvme_zone_wr_boundary(zone))) {
+        return NVME_ZONE_BOUNDARY_ERROR;
+    }
+
+    switch (nvme_get_zone_state(zone)) {
+    case NVME_ZONE_STATE_EMPTY:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_CLOSED:
+        status = NVME_SUCCESS;
+        break;
+    case NVME_ZONE_STATE_FULL:
+        status = NVME_ZONE_FULL;
+        break;
+    case NVME_ZONE_STATE_OFFLINE:
+        status = NVME_ZONE_OFFLINE;
+        break;
+    case NVME_ZONE_STATE_READ_ONLY:
+        status = NVME_ZONE_READ_ONLY;
+        break;
+    default:
+        assert(false);
+    }
+    return status;
+}
+
+static uint16_t nvme_check_zone_read(NvmeNamespace *ns, NvmeZone *zone,
+                                     uint64_t slba, uint32_t nlb)
+{
+    uint64_t lba = slba, count;
+    uint16_t status;
+    uint8_t zs;
+
+    do {
+        if (!ns->params.cross_zone_read &&
+            (lba + nlb > nvme_zone_rd_boundary(ns, zone))) {
+            return NVME_ZONE_BOUNDARY_ERROR | NVME_DNR;
+        }
+
+        zs = nvme_get_zone_state(zone);
+        switch (zs) {
+        case NVME_ZONE_STATE_EMPTY:
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+        case NVME_ZONE_STATE_FULL:
+        case NVME_ZONE_STATE_CLOSED:
+        case NVME_ZONE_STATE_READ_ONLY:
+            status = NVME_SUCCESS;
+            break;
+        case NVME_ZONE_STATE_OFFLINE:
+            status = NVME_ZONE_OFFLINE | NVME_DNR;
+            break;
+        default:
+            assert(false);
+        }
+        if (status != NVME_SUCCESS) {
+            break;
+        }
+
+        if (lba + nlb > nvme_zone_rd_boundary(ns, zone)) {
+            count = nvme_zone_rd_boundary(ns, zone) - lba;
+        } else {
+            count = nlb;
+        }
+
+        lba += count;
+        nlb -= count;
+        zone++;
+    } while (nlb);
+
+    return status;
+}
+
+static inline uint32_t nvme_zone_idx(NvmeNamespace *ns, uint64_t slba)
+{
+    return ns->zone_size_log2 > 0 ? slba >> ns->zone_size_log2 :
+                                    slba / ns->zone_size;
+}
+
+static bool nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
+                                      bool failed)
+{
+    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
+    NvmeZone *zone;
+    uint64_t slba, start_wp = req->cqe.result64;
+    uint32_t nlb, zone_idx;
+    uint8_t zs;
+
+    if (rw->opcode != NVME_CMD_WRITE &&
+        rw->opcode != NVME_CMD_ZONE_APPEND &&
+        rw->opcode != NVME_CMD_WRITE_ZEROES) {
+        return false;
+    }
+
+    slba = le64_to_cpu(rw->slba);
+    nlb = le16_to_cpu(rw->nlb) + 1;
+    zone_idx = nvme_zone_idx(ns, slba);
+    assert(zone_idx < ns->num_zones);
+    zone = &ns->zone_array[zone_idx];
+
+    if (!failed && zone->w_ptr < start_wp + nlb) {
+        /*
+         * A preceding queued write to the zone has failed,
+         * now this write is not at the WP, fail it too.
+         */
+        failed = true;
+    }
+
+    if (failed) {
+        if (zone->w_ptr > start_wp) {
+            zone->w_ptr = start_wp;
+        }
+        req->cqe.result64 = 0;
+    } else if (zone->w_ptr == nvme_zone_wr_boundary(zone)) {
+        zs = nvme_get_zone_state(zone);
+        switch (zs) {
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+        case NVME_ZONE_STATE_CLOSED:
+        case NVME_ZONE_STATE_EMPTY:
+            nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_FULL);
+            /* fall through */
+        case NVME_ZONE_STATE_FULL:
+            break;
+        default:
+            assert(false);
+        }
+        zone->d.wp = zone->w_ptr;
+    } else {
+        zone->d.wp += nlb;
+    }
+
+    return failed;
+}
+
+static uint64_t nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone,
+                                     uint32_t nlb)
+{
+    uint64_t result = zone->w_ptr;
+    uint8_t zs;
+
+    zone->w_ptr += nlb;
+
+    if (zone->w_ptr < nvme_zone_wr_boundary(zone)) {
+        zs = nvme_get_zone_state(zone);
+        switch (zs) {
+        case NVME_ZONE_STATE_EMPTY:
+        case NVME_ZONE_STATE_CLOSED:
+            nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_IMPLICITLY_OPEN);
+        }
+    }
+
+    return result;
+}
+
 static void nvme_rw_cb(void *opaque, int ret)
 {
     NvmeRequest *req = opaque;
@@ -896,10 +1131,27 @@ static void nvme_rw_cb(void *opaque, int ret)
     trace_pci_nvme_rw_cb(nvme_cid(req), blk_name(blk));
 
     if (!ret) {
-        block_acct_done(stats, acct);
+        if (ns->params.zoned) {
+            if (nvme_finalize_zoned_write(ns, req, false)) {
+                ret = EIO;
+                block_acct_failed(stats, acct);
+                req->status = NVME_ZONE_INVALID_WRITE;
+            } else if (req->fill_ofs >= 0) {
+                nvme_fill_data(&req->qsg, &req->iov, req->fill_ofs,
+                               req->fill_len,
+                               nvme_ctrl(req)->params.fill_pattern);
+            }
+        }
+        if (!ret) {
+            block_acct_done(stats, acct);
+        }
     } else {
         uint16_t status;
 
+        if (ns->params.zoned) {
+            nvme_finalize_zoned_write(ns, req, true);
+        }
+
         block_acct_failed(stats, acct);
 
         switch (req->cmd.opcode) {
@@ -953,6 +1205,7 @@ static uint16_t nvme_do_aio(BlockBackend *blk, int64_t offset, size_t len,
         break;
 
     case NVME_CMD_WRITE:
+    case NVME_CMD_ZONE_APPEND:
         is_write = true;
 
         /* fallthrough */
@@ -997,8 +1250,10 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
     NvmeNamespace *ns = req->ns;
     uint64_t slba = le64_to_cpu(rw->slba);
     uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
+    NvmeZone *zone = NULL;
     uint64_t offset = nvme_l2b(ns, slba);
     uint32_t count = nvme_l2b(ns, nlb);
+    uint32_t zone_idx;
     uint16_t status;
 
     trace_pci_nvme_write_zeroes(nvme_cid(req), nvme_nsid(ns), slba, nlb);
@@ -1009,20 +1264,43 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
         return status;
     }
 
+    if (ns->params.zoned) {
+        zone_idx = nvme_zone_idx(ns, slba);
+        assert(zone_idx < ns->num_zones);
+        zone = &ns->zone_array[zone_idx];
+
+        status = nvme_check_zone_write(zone, slba, nlb);
+        if (status != NVME_SUCCESS) {
+            trace_pci_nvme_err_zone_write_not_ok(slba, nlb, status);
+            return status | NVME_DNR;
+        }
+
+        assert(nvme_wp_is_valid(zone));
+        if (unlikely(slba != zone->w_ptr)) {
+            trace_pci_nvme_err_write_not_at_wp(slba, zone->d.zslba,
+                                               zone->w_ptr);
+            return NVME_ZONE_INVALID_WRITE | NVME_DNR;
+        }
+
+        req->cqe.result64 = nvme_advance_zone_wp(ns, zone, nlb);
+    }
+
     return nvme_do_aio(ns->blkconf.blk, offset, count, req);
 }
 
-static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req, bool append)
 {
     NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
     NvmeNamespace *ns = req->ns;
-    uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
+    uint32_t nlb  = le32_to_cpu(rw->nlb) + 1;
     uint64_t slba = le64_to_cpu(rw->slba);
-
     uint64_t data_size = nvme_l2b(ns, nlb);
-    uint64_t data_offset = nvme_l2b(ns, slba);
-    enum BlockAcctType acct = req->cmd.opcode == NVME_CMD_WRITE ?
-        BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
+    uint64_t data_offset;
+
+    NvmeZone *zone = NULL;
+    uint32_t zone_idx = 0;
+    bool is_write = rw->opcode == NVME_CMD_WRITE || append;
+    enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
     uint16_t status;
 
     trace_pci_nvme_rw(nvme_cid(req), nvme_io_opc_str(rw->opcode),
@@ -1040,18 +1318,468 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
         goto invalid;
     }
 
+    if (ns->params.zoned) {
+        zone_idx = nvme_zone_idx(ns, slba);
+        assert(zone_idx < ns->num_zones);
+        zone = &ns->zone_array[zone_idx];
+
+        if (is_write) {
+            status = nvme_check_zone_write(zone, slba, nlb);
+            if (status != NVME_SUCCESS) {
+                trace_pci_nvme_err_zone_write_not_ok(slba, nlb, status);
+                goto invalid;
+            }
+
+            assert(nvme_wp_is_valid(zone));
+            if (append) {
+                if (unlikely(slba != zone->d.zslba)) {
+                    trace_pci_nvme_err_append_not_at_start(slba, zone->d.zslba);
+                    status = NVME_ZONE_INVALID_WRITE | NVME_DNR;
+                    goto invalid;
+                }
+                if (data_size > (n->page_size << n->zasl)) {
+                    trace_pci_nvme_err_append_too_large(slba, nlb, n->zasl);
+                    status = NVME_INVALID_FIELD | NVME_DNR;
+                    goto invalid;
+                }
+                slba = zone->w_ptr;
+            } else if (unlikely(slba != zone->w_ptr)) {
+                trace_pci_nvme_err_write_not_at_wp(slba, zone->d.zslba,
+                                                   zone->w_ptr);
+                status = NVME_ZONE_INVALID_WRITE | NVME_DNR;
+                goto invalid;
+            }
+            req->fill_ofs = -1LL;
+        } else {
+            status = nvme_check_zone_read(ns, zone, slba, nlb);
+            if (status != NVME_SUCCESS) {
+                trace_pci_nvme_err_zone_read_not_ok(slba, nlb, status);
+                goto invalid;
+            }
+
+            if (slba + nlb > zone->w_ptr) {
+                /*
+                 * All or some data is read above the WP. Need to
+                 * fill out the buffer area that has no backing data
+                 * with a predefined data pattern (zeros by default)
+                 */
+                if (slba >= zone->w_ptr) {
+                    req->fill_ofs = 0;
+                } else {
+                    req->fill_ofs = nvme_l2b(ns, zone->w_ptr - slba);
+                }
+                req->fill_len = nvme_l2b(ns,
+                    nvme_zone_rd_boundary(ns, zone) - slba);
+            } else {
+                req->fill_ofs = -1LL;
+            }
+        }
+    } else if (append) {
+        trace_pci_nvme_err_invalid_opc(rw->opcode);
+        status = NVME_INVALID_OPCODE | NVME_DNR;
+        goto invalid;
+    }
+
     status = nvme_map_dptr(n, data_size, req);
     if (status) {
         goto invalid;
     }
 
+    if (ns->params.zoned) {
+        if (unlikely(req->fill_ofs == 0 &&
+            slba + nlb <= nvme_zone_rd_boundary(ns, zone))) {
+            /* No backend I/O necessary, only need to fill the buffer */
+            nvme_fill_data(&req->qsg, &req->iov, 0, 0, n->params.fill_pattern);
+            req->status = NVME_SUCCESS;
+            return NVME_SUCCESS;
+        }
+        if (is_write) {
+            req->cqe.result64 = nvme_advance_zone_wp(ns, zone, nlb);
+        }
+    }
+
+    data_offset = nvme_l2b(ns, slba);
+
     return nvme_do_aio(ns->blkconf.blk, data_offset, data_size, req);
 
 invalid:
     block_acct_invalid(blk_get_stats(ns->blkconf.blk), acct);
+    return status | NVME_DNR;
+}
+
+static uint16_t nvme_get_mgmt_zone_slba_idx(NvmeNamespace *ns, NvmeCmd *c,
+                                            uint64_t *slba, uint32_t *zone_idx)
+{
+    uint32_t dw10 = le32_to_cpu(c->cdw10);
+    uint32_t dw11 = le32_to_cpu(c->cdw11);
+
+    if (!ns->params.zoned) {
+        trace_pci_nvme_err_invalid_opc(c->opcode);
+        return NVME_INVALID_OPCODE | NVME_DNR;
+    }
+
+    *slba = ((uint64_t)dw11) << 32 | dw10;
+    if (unlikely(*slba >= ns->id_ns.nsze)) {
+        trace_pci_nvme_err_invalid_lba_range(*slba, 0, ns->id_ns.nsze);
+        *slba = 0;
+        return NVME_LBA_RANGE | NVME_DNR;
+    }
+
+    *zone_idx = nvme_zone_idx(ns, *slba);
+    assert(*zone_idx < ns->num_zones);
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_open_zone(NvmeNamespace *ns, NvmeZone *zone,
+                               uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EMPTY:
+    case NVME_ZONE_STATE_CLOSED:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_EXPLICITLY_OPEN);
+        /* fall through */
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_open_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_CLOSED;
+}
+
+static uint16_t nvme_close_zone(NvmeNamespace *ns, NvmeZone *zone,
+                                uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_CLOSED);
+        /* fall through */
+    case NVME_ZONE_STATE_CLOSED:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_close_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_EXPLICITLY_OPEN;
+}
+
+static uint16_t nvme_finish_zone(NvmeNamespace *ns, NvmeZone *zone,
+                                 uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+    case NVME_ZONE_STATE_CLOSED:
+    case NVME_ZONE_STATE_EMPTY:
+        zone->w_ptr = nvme_zone_wr_boundary(zone);
+        zone->d.wp = zone->w_ptr;
+        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_FULL);
+        /* fall through */
+    case NVME_ZONE_STATE_FULL:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_finish_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_EXPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_CLOSED;
+}
+
+static uint16_t nvme_reset_zone(NvmeNamespace *ns, NvmeZone *zone,
+                                uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+    case NVME_ZONE_STATE_CLOSED:
+    case NVME_ZONE_STATE_FULL:
+        zone->w_ptr = zone->d.zslba;
+        zone->d.wp = zone->w_ptr;
+        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_EMPTY);
+        /* fall through */
+    case NVME_ZONE_STATE_EMPTY:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_reset_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_EXPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_CLOSED ||
+           state == NVME_ZONE_STATE_FULL;
+}
+
+static uint16_t nvme_offline_zone(NvmeNamespace *ns, NvmeZone *zone,
+                                  uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_READ_ONLY:
+        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_OFFLINE);
+        /* fall through */
+    case NVME_ZONE_STATE_OFFLINE:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_offline_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_READ_ONLY;
+}
+
+typedef uint16_t (*op_handler_t)(NvmeNamespace *, NvmeZone *,
+                                 uint8_t);
+typedef bool (*need_to_proc_zone_t)(uint8_t);
+
+static uint16_t name_do_zone_op(NvmeNamespace *ns, NvmeZone *zone,
+                                uint8_t state, bool all,
+                                op_handler_t op_hndlr,
+                                need_to_proc_zone_t proc_zone)
+{
+    int i;
+    uint16_t status = 0;
+
+    if (!all) {
+        status = op_hndlr(ns, zone, state);
+    } else {
+        for (i = 0; i < ns->num_zones; i++, zone++) {
+            state = nvme_get_zone_state(zone);
+            if (proc_zone(state)) {
+                status = op_hndlr(ns, zone, state);
+                if (status != NVME_SUCCESS) {
+                    break;
+                }
+            }
+        }
+    }
+
     return status;
 }
 
+static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeCmd *cmd = (NvmeCmd *)&req->cmd;
+    NvmeNamespace *ns = req->ns;
+    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
+    uint64_t slba = 0;
+    uint32_t zone_idx = 0;
+    uint16_t status;
+    uint8_t action, state;
+    bool all;
+    NvmeZone *zone;
+
+    action = dw13 & 0xff;
+    all = dw13 & 0x100;
+
+    req->status = NVME_SUCCESS;
+
+    if (!all) {
+        status = nvme_get_mgmt_zone_slba_idx(ns, cmd, &slba, &zone_idx);
+        if (status) {
+            return status;
+        }
+    }
+
+    zone = &ns->zone_array[zone_idx];
+    if (slba != zone->d.zslba) {
+        trace_pci_nvme_err_unaligned_zone_cmd(action, slba, zone->d.zslba);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+    state = nvme_get_zone_state(zone);
+
+    switch (action) {
+
+    case NVME_ZONE_ACTION_OPEN:
+        trace_pci_nvme_open_zone(slba, zone_idx, all);
+        status = name_do_zone_op(ns, zone, state, all,
+                                 nvme_open_zone, nvme_cond_open_all);
+        break;
+
+    case NVME_ZONE_ACTION_CLOSE:
+        trace_pci_nvme_close_zone(slba, zone_idx, all);
+        status = name_do_zone_op(ns, zone, state, all,
+                                 nvme_close_zone, nvme_cond_close_all);
+        break;
+
+    case NVME_ZONE_ACTION_FINISH:
+        trace_pci_nvme_finish_zone(slba, zone_idx, all);
+        status = name_do_zone_op(ns, zone, state, all,
+                                 nvme_finish_zone, nvme_cond_finish_all);
+        break;
+
+    case NVME_ZONE_ACTION_RESET:
+        trace_pci_nvme_reset_zone(slba, zone_idx, all);
+        status = name_do_zone_op(ns, zone, state, all,
+                                 nvme_reset_zone, nvme_cond_reset_all);
+        break;
+
+    case NVME_ZONE_ACTION_OFFLINE:
+        trace_pci_nvme_offline_zone(slba, zone_idx, all);
+        status = name_do_zone_op(ns, zone, state, all,
+                                 nvme_offline_zone, nvme_cond_offline_all);
+        break;
+
+    case NVME_ZONE_ACTION_SET_ZD_EXT:
+        trace_pci_nvme_set_descriptor_extension(slba, zone_idx);
+        return NVME_INVALID_FIELD | NVME_DNR;
+        break;
+
+    default:
+        trace_pci_nvme_err_invalid_mgmt_action(action);
+        status = NVME_INVALID_FIELD;
+    }
+
+    if (status == NVME_ZONE_INVAL_TRANSITION) {
+        trace_pci_nvme_err_invalid_zone_state_transition(state, action, slba,
+                                                         zone->d.za);
+    }
+    if (status) {
+        status |= NVME_DNR;
+    }
+
+    return status;
+}
+
+static bool nvme_zone_matches_filter(uint32_t zafs, NvmeZone *zl)
+{
+    int zs = nvme_get_zone_state(zl);
+
+    switch (zafs) {
+    case NVME_ZONE_REPORT_ALL:
+        return true;
+    case NVME_ZONE_REPORT_EMPTY:
+        return zs == NVME_ZONE_STATE_EMPTY;
+    case NVME_ZONE_REPORT_IMPLICITLY_OPEN:
+        return zs == NVME_ZONE_STATE_IMPLICITLY_OPEN;
+    case NVME_ZONE_REPORT_EXPLICITLY_OPEN:
+        return zs == NVME_ZONE_STATE_EXPLICITLY_OPEN;
+    case NVME_ZONE_REPORT_CLOSED:
+        return zs == NVME_ZONE_STATE_CLOSED;
+    case NVME_ZONE_REPORT_FULL:
+        return zs == NVME_ZONE_STATE_FULL;
+    case NVME_ZONE_REPORT_READ_ONLY:
+        return zs == NVME_ZONE_STATE_READ_ONLY;
+    case NVME_ZONE_REPORT_OFFLINE:
+        return zs == NVME_ZONE_STATE_OFFLINE;
+    default:
+        return false;
+    }
+}
+
+static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeCmd *cmd = (NvmeCmd *)&req->cmd;
+    NvmeNamespace *ns = req->ns;
+    /* cdw12 is zero-based number of dwords to return. Convert to bytes */
+    uint32_t len = (le32_to_cpu(cmd->cdw12) + 1) << 2;
+    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
+    uint32_t zone_idx, zra, zrasf, partial;
+    uint64_t max_zones, nr_zones = 0;
+    uint16_t ret;
+    uint64_t slba;
+    NvmeZoneDescr *z;
+    NvmeZone *zs;
+    NvmeZoneReportHeader *header;
+    void *buf, *buf_p;
+    size_t zone_entry_sz;
+
+    req->status = NVME_SUCCESS;
+
+    ret = nvme_get_mgmt_zone_slba_idx(ns, cmd, &slba, &zone_idx);
+    if (ret) {
+        return ret;
+    }
+
+    if (len < sizeof(NvmeZoneReportHeader)) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    zra = dw13 & 0xff;
+    if (!(zra == NVME_ZONE_REPORT || zra == NVME_ZONE_REPORT_EXTENDED)) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    if (zra == NVME_ZONE_REPORT_EXTENDED) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    zrasf = (dw13 >> 8) & 0xff;
+    if (zrasf > NVME_ZONE_REPORT_OFFLINE) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    partial = (dw13 >> 16) & 0x01;
+
+    zone_entry_sz = sizeof(NvmeZoneDescr);
+
+    max_zones = (len - sizeof(NvmeZoneReportHeader)) / zone_entry_sz;
+    buf = g_malloc0(len);
+
+    header = (NvmeZoneReportHeader *)buf;
+    buf_p = buf + sizeof(NvmeZoneReportHeader);
+
+    while (zone_idx < ns->num_zones && nr_zones < max_zones) {
+        zs = &ns->zone_array[zone_idx];
+
+        if (!nvme_zone_matches_filter(zrasf, zs)) {
+            zone_idx++;
+            continue;
+        }
+
+        z = (NvmeZoneDescr *)buf_p;
+        buf_p += sizeof(NvmeZoneDescr);
+        nr_zones++;
+
+        z->zt = zs->d.zt;
+        z->zs = zs->d.zs;
+        z->zcap = cpu_to_le64(zs->d.zcap);
+        z->zslba = cpu_to_le64(zs->d.zslba);
+        z->za = zs->d.za;
+
+        if (nvme_wp_is_valid(zs)) {
+            z->wp = cpu_to_le64(zs->d.wp);
+        } else {
+            z->wp = cpu_to_le64(~0ULL);
+        }
+
+        zone_idx++;
+    }
+
+    if (!partial) {
+        for (; zone_idx < ns->num_zones; zone_idx++) {
+            zs = &ns->zone_array[zone_idx];
+            if (nvme_zone_matches_filter(zrasf, zs)) {
+                nr_zones++;
+            }
+        }
+    }
+    header->nr_zones = cpu_to_le64(nr_zones);
+
+    ret = nvme_dma(n, (uint8_t *)buf, len, DMA_DIRECTION_FROM_DEVICE, req);
+
+    g_free(buf);
+
+    return ret;
+}
+
 static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
     uint32_t nsid = le32_to_cpu(req->cmd.nsid);
@@ -1073,9 +1801,15 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
         return nvme_flush(n, req);
     case NVME_CMD_WRITE_ZEROES:
         return nvme_write_zeroes(n, req);
+    case NVME_CMD_ZONE_APPEND:
+        return nvme_rw(n, req, true);
     case NVME_CMD_WRITE:
     case NVME_CMD_READ:
-        return nvme_rw(n, req);
+        return nvme_rw(n, req, false);
+    case NVME_CMD_ZONE_MGMT_SEND:
+        return nvme_zone_mgmt_send(n, req);
+    case NVME_CMD_ZONE_MGMT_RECV:
+        return nvme_zone_mgmt_recv(n, req);
     default:
         trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
         return NVME_INVALID_OPCODE | NVME_DNR;
@@ -1301,7 +2035,7 @@ static uint16_t nvme_error_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
                     DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
+static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint8_t csi, uint32_t buf_len,
                                  uint64_t off, NvmeRequest *req)
 {
     NvmeEffectsLog cmd_eff_log = {};
@@ -1326,11 +2060,20 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
     acs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;
     acs[NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFFECTS_CSUPP;
 
-    iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
-    iocs[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFFECTS_CSUPP |
-                                  NVME_CMD_EFFECTS_LBCC;
-    iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
-    iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
+    if (NVME_CC_CSS(n->bar.cc) != CSS_ADMIN_ONLY) {
+        iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
+        iocs[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFFECTS_CSUPP |
+                                      NVME_CMD_EFFECTS_LBCC;
+        iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
+        iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
+    }
+
+    if (csi == NVME_CSI_ZONED && NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
+        iocs[NVME_CMD_ZONE_APPEND] = NVME_CMD_EFFECTS_CSUPP |
+                                     NVME_CMD_EFFECTS_LBCC;
+        iocs[NVME_CMD_ZONE_MGMT_SEND] = NVME_CMD_EFFECTS_CSUPP;
+        iocs[NVME_CMD_ZONE_MGMT_RECV] = NVME_CMD_EFFECTS_CSUPP;
+    }
 
     trans_len = MIN(sizeof(cmd_eff_log) - off, buf_len);
 
@@ -1349,6 +2092,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
     uint8_t  lid = dw10 & 0xff;
     uint8_t  lsp = (dw10 >> 8) & 0xf;
     uint8_t  rae = (dw10 >> 15) & 0x1;
+    uint8_t csi = le32_to_cpu(cmd->cdw14) >> 24;
     uint32_t numdl, numdu;
     uint64_t off, lpol, lpou;
     size_t   len;
@@ -1382,7 +2126,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
     case NVME_LOG_FW_SLOT_INFO:
         return nvme_fw_log_info(n, len, off, req);
     case NVME_LOG_CMD_EFFECTS:
-        return nvme_cmd_effects(n, len, off, req);
+        return nvme_cmd_effects(n, csi, len, off, req);
     default:
         trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1502,6 +2246,16 @@ static uint16_t nvme_rpt_empty_id_struct(NvmeCtrl *n, NvmeRequest *req)
     return nvme_dma(n, id, sizeof(id), DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static inline bool nvme_csi_has_nvm_support(NvmeNamespace *ns)
+{
+    switch (ns->csi) {
+    case NVME_CSI_NVM:
+    case NVME_CSI_ZONED:
+        return true;
+    }
+    return false;
+}
+
 static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
 {
     trace_pci_nvme_identify_ctrl();
@@ -1513,11 +2267,16 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
 static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+    NvmeIdCtrlZoned id = {};
 
     trace_pci_nvme_identify_ctrl_csi(c->csi);
 
     if (c->csi == NVME_CSI_NVM) {
         return nvme_rpt_empty_id_struct(n, req);
+    } else if (c->csi == NVME_CSI_ZONED) {
+        id.zasl = n->zasl;
+        return nvme_dma(n, (uint8_t *)&id, sizeof(id),
+                        DMA_DIRECTION_FROM_DEVICE, req);
     }
 
     return NVME_INVALID_FIELD | NVME_DNR;
@@ -1545,8 +2304,12 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req,
         return nvme_rpt_empty_id_struct(n, req);
     }
 
-    return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
-                    DMA_DIRECTION_FROM_DEVICE, req);
+    if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
+        return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
+                        DMA_DIRECTION_FROM_DEVICE, req);
+    }
+
+    return NVME_INVALID_CMD_SET | NVME_DNR;
 }
 
 static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
@@ -1571,8 +2334,11 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
         return nvme_rpt_empty_id_struct(n, req);
     }
 
-    if (c->csi == NVME_CSI_NVM) {
+    if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
         return nvme_rpt_empty_id_struct(n, req);
+    } else if (c->csi == NVME_CSI_ZONED && ns->csi == NVME_CSI_ZONED) {
+        return nvme_dma(n, (uint8_t *)ns->id_ns_zoned, sizeof(NvmeIdNsZoned),
+                        DMA_DIRECTION_FROM_DEVICE, req);
     }
 
     return NVME_INVALID_FIELD | NVME_DNR;
@@ -1634,7 +2400,7 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req,
 
     trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
 
-    if (c->csi != NVME_CSI_NVM) {
+    if (c->csi != NVME_CSI_NVM && c->csi != NVME_CSI_ZONED) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
@@ -1643,7 +2409,7 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req,
         if (!ns) {
             continue;
         }
-        if (ns->params.nsid < min_nsid) {
+        if (ns->params.nsid < min_nsid || c->csi != ns->csi) {
             continue;
         }
         if (only_active && !ns->params.attached) {
@@ -1696,19 +2462,29 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
     desc->nidt = NVME_NIDT_CSI;
     desc->nidl = NVME_NIDL_CSI;
     list_ptr += sizeof(*desc);
-    *(uint8_t *)list_ptr = NVME_CSI_NVM;
+    *(uint8_t *)list_ptr = ns->csi;
 
     return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeRequest *req)
 {
+    NvmeNamespace *ns;
     uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
     static const int data_len = sizeof(list);
+    int i;
 
     trace_pci_nvme_identify_cmd_set();
 
     NVME_SET_CSI(*list, NVME_CSI_NVM);
+    for (i = 1; i <= n->num_namespaces; i++) {
+        ns = nvme_ns(n, i);
+        if (ns && ns->params.zoned) {
+            NVME_SET_CSI(*list, NVME_CSI_ZONED);
+            break;
+        }
+    }
+
     return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
@@ -1751,7 +2527,7 @@ static uint16_t nvme_abort(NvmeCtrl *n, NvmeRequest *req)
 {
     uint16_t sqid = le32_to_cpu(req->cmd.cdw10) & 0xffff;
 
-    req->cqe.result = 1;
+    req->cqe.result32 = 1;
     if (nvme_check_sqid(n, sqid)) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
@@ -1932,7 +2708,7 @@ defaults:
     }
 
 out:
-    req->cqe.result = cpu_to_le32(result);
+    req->cqe.result32 = cpu_to_le32(result);
     return NVME_SUCCESS;
 }
 
@@ -2057,8 +2833,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
                                     ((dw11 >> 16) & 0xFFFF) + 1,
                                     n->params.max_ioqpairs,
                                     n->params.max_ioqpairs);
-        req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
-                                      ((n->params.max_ioqpairs - 1) << 16));
+        req->cqe.result32 = cpu_to_le32((n->params.max_ioqpairs - 1) |
+                                        ((n->params.max_ioqpairs - 1) << 16));
         break;
     case NVME_ASYNCHRONOUS_EVENT_CONF:
         n->features.async_config = dw11;
@@ -2310,16 +3086,28 @@ static int nvme_start_ctrl(NvmeCtrl *n)
             continue;
         }
         ns->params.attached = false;
-        switch (ns->params.csi) {
+        switch (ns->csi) {
         case NVME_CSI_NVM:
             if (NVME_CC_CSS(n->bar.cc) == CSS_NVM_ONLY ||
                 NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
                 ns->params.attached = true;
             }
             break;
+        case NVME_CSI_ZONED:
+            if (NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
+                ns->params.attached = true;
+            }
+            break;
         }
     }
 
+    if (!n->zasl_bs) {
+        assert(n->params.mdts);
+        n->zasl = n->params.mdts;
+    } else {
+        n->zasl = 31 - clz32(n->zasl_bs / n->page_size);
+    }
+
     nvme_set_timestamp(n, 0ULL);
 
     QTAILQ_INIT(&n->aer_queue);
@@ -2382,10 +3170,11 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
                 case CSS_NVM_ONLY:
                     trace_pci_nvme_css_nvm_cset_selected_by_host(data &
                                                                  0xffffffff);
-                    break;
+                break;
                 case CSS_CSI:
                     NVME_SET_CC_CSS(n->bar.cc, CSS_CSI);
-                    trace_pci_nvme_css_all_csets_sel_by_host(data & 0xffffffff);
+                    trace_pci_nvme_css_all_csets_sel_by_host(data &
+                                                             0xffffffff);
                     break;
                 case CSS_ADMIN_ONLY:
                     break;
@@ -2780,6 +3569,12 @@ static void nvme_init_state(NvmeCtrl *n)
     n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
     n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
     n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
+
+    if (!n->params.zasl_kb) {
+        n->zasl_bs = n->params.mdts ? 0 : NVME_DEFAULT_MAX_ZA_SIZE * KiB;
+    } else {
+        n->zasl_bs = n->params.zasl_kb * KiB;
+    }
 }
 
 int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
@@ -2985,8 +3780,9 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     NVME_CAP_SET_CQR(n->bar.cap, 1);
     NVME_CAP_SET_TO(n->bar.cap, 0xf);
     /*
-     * The device now always supports NS Types, but all commands
-     * that support CSI field will only handle NVM Command Set.
+     * The device now always supports NS Types, even when "zoned" property
+     * is set to zero. If this is the case, all commands that support CSI
+     * field only handle NVM Command Set.
      */
     NVME_CAP_SET_CSS(n->bar.cap, (CAP_CSS_NVM | CAP_CSS_CSI_SUPP));
     NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
@@ -3033,9 +3829,21 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 static void nvme_exit(PCIDevice *pci_dev)
 {
     NvmeCtrl *n = NVME(pci_dev);
+    NvmeNamespace *ns;
+    int i;
 
     nvme_clear_ctrl(n);
+
+    for (i = 1; i <= n->num_namespaces; i++) {
+        ns = nvme_ns(n, i);
+        if (!ns) {
+            continue;
+        }
+
+        nvme_ns_cleanup(ns);
+    }
     g_free(n->namespaces);
+
     g_free(n->cq);
     g_free(n->sq);
     g_free(n->aer_reqs);
@@ -3063,6 +3871,8 @@ static Property nvme_props[] = {
     DEFINE_PROP_UINT32("aer_max_queued", NvmeCtrl, params.aer_max_queued, 64),
     DEFINE_PROP_UINT8("mdts", NvmeCtrl, params.mdts, 7),
     DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false),
+    DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
+    DEFINE_PROP_UINT32("zone_append_size_limit", NvmeCtrl, params.zasl_kb, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/block/nvme.h b/include/block/nvme.h
index a7126e123f..628c665728 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -651,8 +651,10 @@ typedef struct QEMU_PACKED NvmeAerResult {
 } NvmeAerResult;
 
 typedef struct QEMU_PACKED NvmeCqe {
-    uint32_t    result;
-    uint32_t    rsvd;
+    union {
+        uint64_t     result64;
+        uint32_t     result32;
+    };
     uint16_t    sq_head;
     uint16_t    sq_id;
     uint16_t    cid;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 10/14] hw/block/nvme: Introduce max active and open zone limits
  2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (8 preceding siblings ...)
  2020-09-28  2:35 ` [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
@ 2020-09-28  2:35 ` Dmitry Fomichev
  2020-09-28  2:35 ` [PATCH v5 11/14] hw/block/nvme: Support Zone Descriptor Extensions Dmitry Fomichev
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:35 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

Added two module properties, "max_active" and "max_open" to control
the maximum number of zones that can be active or open. Once these
variables are set to non-default values, these limits are checked
during I/O and Too Many Active or Too Many Open command status is
returned if they are exceeded.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme-ns.c | 42 +++++++++++++++++++-
 hw/block/nvme-ns.h | 42 ++++++++++++++++++++
 hw/block/nvme.c    | 99 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 181 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 6d9dc9205b..63a2e3f47d 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -126,6 +126,28 @@ void nvme_remove_zone(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone)
     zone->prev = zone->next = 0;
 }
 
+/*
+ * Take the first zone out from a list, return NULL if the list is empty.
+ */
+NvmeZone *nvme_remove_zone_head(NvmeNamespace *ns, NvmeZoneList *zl)
+{
+    NvmeZone *zone = nvme_peek_zone_head(ns, zl);
+
+    if (zone) {
+        --zl->size;
+        if (zl->size == 0) {
+            zl->head = NVME_ZONE_LIST_NIL;
+            zl->tail = NVME_ZONE_LIST_NIL;
+        } else {
+            zl->head = zone->next;
+            ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
+        }
+        zone->prev = zone->next = 0;
+    }
+
+    return zone;
+}
+
 static int nvme_calc_zone_geometry(NvmeNamespace *ns, Error **errp)
 {
     uint64_t zone_size, zone_cap;
@@ -156,6 +178,20 @@ static int nvme_calc_zone_geometry(NvmeNamespace *ns, Error **errp)
         ns->zone_size_log2 = 63 - clz64(ns->zone_size);
     }
 
+    /* Make sure that the values of all ZNS properties are sane */
+    if (ns->params.max_open_zones > nz) {
+        error_setg(errp,
+                   "max_open_zones value %u exceeds the number of zones %u",
+                   ns->params.max_open_zones, nz);
+        return -1;
+    }
+    if (ns->params.max_active_zones > nz) {
+        error_setg(errp,
+                   "max_active_zones value %u exceeds the number of zones %u",
+                   ns->params.max_active_zones, nz);
+        return -1;
+    }
+
     return 0;
 }
 
@@ -215,8 +251,8 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
     id_ns_z = g_malloc0(sizeof(NvmeIdNsZoned));
 
     /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
-    id_ns_z->mar = 0xffffffff;
-    id_ns_z->mor = 0xffffffff;
+    id_ns_z->mar = cpu_to_le32(ns->params.max_active_zones - 1);
+    id_ns_z->mor = cpu_to_le32(ns->params.max_open_zones - 1);
     id_ns_z->zoc = 0;
     id_ns_z->ozcs = ns->params.cross_zone_read ? 0x01 : 0x00;
 
@@ -312,6 +348,8 @@ static Property nvme_ns_props[] = {
                        params.zone_capacity_mb, 0),
     DEFINE_PROP_BOOL("cross_zone_read", NvmeNamespace,
                       params.cross_zone_read, false),
+    DEFINE_PROP_UINT32("max_active", NvmeNamespace, params.max_active_zones, 0),
+    DEFINE_PROP_UINT32("max_open", NvmeNamespace, params.max_open_zones, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index daa13546c4..0664fe0892 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -45,6 +45,8 @@ typedef struct NvmeNamespaceParams {
     bool     cross_zone_read;
     uint64_t zone_size_mb;
     uint64_t zone_capacity_mb;
+    uint32_t max_active_zones;
+    uint32_t max_open_zones;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
@@ -66,6 +68,8 @@ typedef struct NvmeNamespace {
     uint64_t        zone_capacity;
     uint64_t        zone_array_size;
     uint32_t        zone_size_log2;
+    int32_t         nr_open_zones;
+    int32_t         nr_active_zones;
 
     NvmeNamespaceParams params;
 } NvmeNamespace;
@@ -189,7 +193,45 @@ static inline NvmeZone *nvme_next_zone_in_list(NvmeNamespace *ns, NvmeZone *z,
     return &ns->zone_array[z->next];
 }
 
+static inline void nvme_aor_inc_open(NvmeNamespace *ns)
+{
+    assert(ns->nr_open_zones >= 0);
+    if (ns->params.max_open_zones) {
+        ns->nr_open_zones++;
+        assert(ns->nr_open_zones <= ns->params.max_open_zones);
+    }
+}
+
+static inline void nvme_aor_dec_open(NvmeNamespace *ns)
+{
+    if (ns->params.max_open_zones) {
+        assert(ns->nr_open_zones > 0);
+        ns->nr_open_zones--;
+    }
+    assert(ns->nr_open_zones >= 0);
+}
+
+static inline void nvme_aor_inc_active(NvmeNamespace *ns)
+{
+    assert(ns->nr_active_zones >= 0);
+    if (ns->params.max_active_zones) {
+        ns->nr_active_zones++;
+        assert(ns->nr_active_zones <= ns->params.max_active_zones);
+    }
+}
+
+static inline void nvme_aor_dec_active(NvmeNamespace *ns)
+{
+    if (ns->params.max_active_zones) {
+        assert(ns->nr_active_zones > 0);
+        ns->nr_active_zones--;
+        assert(ns->nr_active_zones >= ns->nr_open_zones);
+    }
+    assert(ns->nr_active_zones >= 0);
+}
+
 void nvme_add_zone_tail(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone);
 void nvme_remove_zone(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone);
+NvmeZone *nvme_remove_zone_head(NvmeNamespace *ns, NvmeZoneList *zl);
 
 #endif /* NVME_NS_H */
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 38e25a4d1f..40947aa659 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -168,6 +168,26 @@ static void nvme_assign_zone_state(NvmeNamespace *ns, NvmeZone *zone,
     }
 }
 
+/*
+ * Check if we can open a zone without exceeding open/active limits.
+ * AOR stands for "Active and Open Resources" (see TP 4053 section 2.5).
+ */
+static int nvme_aor_check(NvmeNamespace *ns, uint32_t act, uint32_t opn)
+{
+    if (ns->params.max_active_zones != 0 &&
+        ns->nr_active_zones + act > ns->params.max_active_zones) {
+        trace_pci_nvme_err_insuff_active_res(ns->params.max_active_zones);
+        return NVME_ZONE_TOO_MANY_ACTIVE | NVME_DNR;
+    }
+    if (ns->params.max_open_zones != 0 &&
+        ns->nr_open_zones + opn > ns->params.max_open_zones) {
+        trace_pci_nvme_err_insuff_open_res(ns->params.max_open_zones);
+        return NVME_ZONE_TOO_MANY_OPEN | NVME_DNR;
+    }
+
+    return NVME_SUCCESS;
+}
+
 static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
 {
     hwaddr low = n->ctrl_mem.addr;
@@ -1035,6 +1055,40 @@ static uint16_t nvme_check_zone_read(NvmeNamespace *ns, NvmeZone *zone,
     return status;
 }
 
+static void nvme_auto_transition_zone(NvmeNamespace *ns, bool implicit,
+                                      bool adding_active)
+{
+    NvmeZone *zone;
+
+    if (implicit && ns->params.max_open_zones &&
+        ns->nr_open_zones == ns->params.max_open_zones) {
+        zone = nvme_remove_zone_head(ns, ns->imp_open_zones);
+        if (zone) {
+            /*
+             * Automatically close this implicitly open zone.
+             */
+            nvme_aor_dec_open(ns);
+            nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_CLOSED);
+        }
+    }
+}
+
+static uint16_t nvme_auto_open_zone(NvmeNamespace *ns, NvmeZone *zone)
+{
+    uint16_t status = NVME_SUCCESS;
+    uint8_t zs = nvme_get_zone_state(zone);
+
+    if (zs == NVME_ZONE_STATE_EMPTY) {
+        nvme_auto_transition_zone(ns, true, true);
+        status = nvme_aor_check(ns, 1, 1);
+    } else if (zs == NVME_ZONE_STATE_CLOSED) {
+        nvme_auto_transition_zone(ns, true, false);
+        status = nvme_aor_check(ns, 0, 1);
+    }
+
+    return status;
+}
+
 static inline uint32_t nvme_zone_idx(NvmeNamespace *ns, uint64_t slba)
 {
     return ns->zone_size_log2 > 0 ? slba >> ns->zone_size_log2 :
@@ -1080,7 +1134,11 @@ static bool nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
         switch (zs) {
         case NVME_ZONE_STATE_IMPLICITLY_OPEN:
         case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+            nvme_aor_dec_open(ns);
+            /* fall through */
         case NVME_ZONE_STATE_CLOSED:
+            nvme_aor_dec_active(ns);
+            /* fall through */
         case NVME_ZONE_STATE_EMPTY:
             nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_FULL);
             /* fall through */
@@ -1109,7 +1167,10 @@ static uint64_t nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone,
         zs = nvme_get_zone_state(zone);
         switch (zs) {
         case NVME_ZONE_STATE_EMPTY:
+            nvme_aor_inc_active(ns);
+            /* fall through */
         case NVME_ZONE_STATE_CLOSED:
+            nvme_aor_inc_open(ns);
             nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_IMPLICITLY_OPEN);
         }
     }
@@ -1282,6 +1343,11 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
             return NVME_ZONE_INVALID_WRITE | NVME_DNR;
         }
 
+        status = nvme_auto_open_zone(ns, zone);
+        if (status != NVME_SUCCESS) {
+            return status;
+        }
+
         req->cqe.result64 = nvme_advance_zone_wp(ns, zone, nlb);
     }
 
@@ -1349,6 +1415,12 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req, bool append)
                 status = NVME_ZONE_INVALID_WRITE | NVME_DNR;
                 goto invalid;
             }
+
+            status = nvme_auto_open_zone(ns, zone);
+            if (status != NVME_SUCCESS) {
+                return status;
+            }
+
             req->fill_ofs = -1LL;
         } else {
             status = nvme_check_zone_read(ns, zone, slba, nlb);
@@ -1434,9 +1506,27 @@ static uint16_t nvme_get_mgmt_zone_slba_idx(NvmeNamespace *ns, NvmeCmd *c,
 static uint16_t nvme_open_zone(NvmeNamespace *ns, NvmeZone *zone,
                                uint8_t state)
 {
+    uint16_t status;
+
     switch (state) {
     case NVME_ZONE_STATE_EMPTY:
+        nvme_auto_transition_zone(ns, false, true);
+        status = nvme_aor_check(ns, 1, 0);
+        if (status != NVME_SUCCESS) {
+            return status;
+        }
+        nvme_aor_inc_active(ns);
+        /* fall through */
     case NVME_ZONE_STATE_CLOSED:
+        status = nvme_aor_check(ns, 0, 1);
+        if (status != NVME_SUCCESS) {
+            if (state == NVME_ZONE_STATE_EMPTY) {
+                nvme_aor_dec_active(ns);
+            }
+            return status;
+        }
+        nvme_aor_inc_open(ns);
+        /* fall through */
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
         nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_EXPLICITLY_OPEN);
         /* fall through */
@@ -1458,6 +1548,7 @@ static uint16_t nvme_close_zone(NvmeNamespace *ns, NvmeZone *zone,
     switch (state) {
     case NVME_ZONE_STATE_EXPLICITLY_OPEN:
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_aor_dec_open(ns);
         nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_CLOSED);
         /* fall through */
     case NVME_ZONE_STATE_CLOSED:
@@ -1479,7 +1570,11 @@ static uint16_t nvme_finish_zone(NvmeNamespace *ns, NvmeZone *zone,
     switch (state) {
     case NVME_ZONE_STATE_EXPLICITLY_OPEN:
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_aor_dec_open(ns);
+        /* fall through */
     case NVME_ZONE_STATE_CLOSED:
+        nvme_aor_dec_active(ns);
+        /* fall through */
     case NVME_ZONE_STATE_EMPTY:
         zone->w_ptr = nvme_zone_wr_boundary(zone);
         zone->d.wp = zone->w_ptr;
@@ -1505,7 +1600,11 @@ static uint16_t nvme_reset_zone(NvmeNamespace *ns, NvmeZone *zone,
     switch (state) {
     case NVME_ZONE_STATE_EXPLICITLY_OPEN:
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_aor_dec_open(ns);
+        /* fall through */
     case NVME_ZONE_STATE_CLOSED:
+        nvme_aor_dec_active(ns);
+        /* fall through */
     case NVME_ZONE_STATE_FULL:
         zone->w_ptr = zone->d.zslba;
         zone->d.wp = zone->w_ptr;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 11/14] hw/block/nvme: Support Zone Descriptor Extensions
  2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (9 preceding siblings ...)
  2020-09-28  2:35 ` [PATCH v5 10/14] hw/block/nvme: Introduce max active and open zone limits Dmitry Fomichev
@ 2020-09-28  2:35 ` Dmitry Fomichev
  2020-09-28  2:35 ` [PATCH v5 12/14] hw/block/nvme: Add injection of Offline/Read-Only zones Dmitry Fomichev
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:35 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

Zone Descriptor Extension is a label that can be assigned to a zone.
It can be set to an Empty zone and it stays assigned until the zone
is reset.

This commit adds a new optional module property, "zone_descr_ext_size".
Its value must be a multiple of 64 bytes. If this value is non-zero,
it becomes possible to assign extensions of that size to any Empty
zones. The default value for this property is 0, therefore setting
extensions is disabled by default.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme-ns.c | 10 ++++++++-
 hw/block/nvme-ns.h |  8 ++++++++
 hw/block/nvme.c    | 51 ++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 63a2e3f47d..60156dfeaf 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -207,6 +207,10 @@ static void nvme_init_zone_meta(NvmeNamespace *ns)
     ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
     ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
     ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
+    if (ns->params.zd_extension_size) {
+        ns->zd_extensions = g_malloc0(ns->params.zd_extension_size *
+                                      ns->num_zones);
+    }
 
     nvme_init_zone_list(ns->exp_open_zones);
     nvme_init_zone_list(ns->imp_open_zones);
@@ -257,7 +261,8 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
     id_ns_z->ozcs = ns->params.cross_zone_read ? 0x01 : 0x00;
 
     id_ns_z->lbafe[lba_index].zsze = cpu_to_le64(ns->zone_size);
-    id_ns_z->lbafe[lba_index].zdes = 0; /* FIXME make helper */
+    id_ns_z->lbafe[lba_index].zdes =
+        ns->params.zd_extension_size >> 6; /* Units of 64B */
 
     ns->csi = NVME_CSI_ZONED;
     ns->id_ns.ncap = cpu_to_le64(ns->zone_capacity * ns->num_zones);
@@ -321,6 +326,7 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
     g_free(ns->imp_open_zones);
     g_free(ns->closed_zones);
     g_free(ns->full_zones);
+    g_free(ns->zd_extensions);
 }
 
 static void nvme_ns_realize(DeviceState *dev, Error **errp)
@@ -350,6 +356,8 @@ static Property nvme_ns_props[] = {
                       params.cross_zone_read, false),
     DEFINE_PROP_UINT32("max_active", NvmeNamespace, params.max_active_zones, 0),
     DEFINE_PROP_UINT32("max_open", NvmeNamespace, params.max_open_zones, 0),
+    DEFINE_PROP_UINT32("zone_descr_ext_size", NvmeNamespace,
+                       params.zd_extension_size, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 0664fe0892..ed14644e09 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -47,6 +47,7 @@ typedef struct NvmeNamespaceParams {
     uint64_t zone_capacity_mb;
     uint32_t max_active_zones;
     uint32_t max_open_zones;
+    uint32_t zd_extension_size;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
@@ -68,6 +69,7 @@ typedef struct NvmeNamespace {
     uint64_t        zone_capacity;
     uint64_t        zone_array_size;
     uint32_t        zone_size_log2;
+    uint8_t         *zd_extensions;
     int32_t         nr_open_zones;
     int32_t         nr_active_zones;
 
@@ -142,6 +144,12 @@ static inline bool nvme_wp_is_valid(NvmeZone *zone)
            st != NVME_ZONE_STATE_OFFLINE;
 }
 
+static inline uint8_t *nvme_get_zd_extension(NvmeNamespace *ns,
+                                             uint32_t zone_idx)
+{
+    return &ns->zd_extensions[zone_idx * ns->params.zd_extension_size];
+}
+
 /*
  * Initialize a zone list head.
  */
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 40947aa659..27d191c659 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1644,6 +1644,26 @@ static bool nvme_cond_offline_all(uint8_t state)
     return state == NVME_ZONE_STATE_READ_ONLY;
 }
 
+static uint16_t nvme_set_zd_ext(NvmeNamespace *ns, NvmeZone *zone,
+                                uint8_t state)
+{
+    uint16_t status;
+
+    if (state == NVME_ZONE_STATE_EMPTY) {
+        nvme_auto_transition_zone(ns, false, true);
+        status = nvme_aor_check(ns, 1, 0);
+        if (status != NVME_SUCCESS) {
+            return status;
+        }
+        nvme_aor_inc_active(ns);
+        zone->d.za |= NVME_ZA_ZD_EXT_VALID;
+        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_CLOSED);
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
 typedef uint16_t (*op_handler_t)(NvmeNamespace *, NvmeZone *,
                                  uint8_t);
 typedef bool (*need_to_proc_zone_t)(uint8_t);
@@ -1684,6 +1704,7 @@ static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeRequest *req)
     uint8_t action, state;
     bool all;
     NvmeZone *zone;
+    uint8_t *zd_ext;
 
     action = dw13 & 0xff;
     all = dw13 & 0x100;
@@ -1738,7 +1759,22 @@ static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeRequest *req)
 
     case NVME_ZONE_ACTION_SET_ZD_EXT:
         trace_pci_nvme_set_descriptor_extension(slba, zone_idx);
-        return NVME_INVALID_FIELD | NVME_DNR;
+        if (all || !ns->params.zd_extension_size) {
+            return NVME_INVALID_FIELD | NVME_DNR;
+        }
+        zd_ext = nvme_get_zd_extension(ns, zone_idx);
+        status = nvme_dma(n, zd_ext, ns->params.zd_extension_size,
+                          DMA_DIRECTION_TO_DEVICE, req);
+        if (status) {
+            trace_pci_nvme_err_zd_extension_map_error(zone_idx);
+            return status;
+        }
+
+        status = nvme_set_zd_ext(ns, zone, state);
+        if (status == NVME_SUCCESS) {
+            trace_pci_nvme_zd_extension_set(zone_idx);
+            return status;
+        }
         break;
 
     default:
@@ -1816,7 +1852,7 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
-    if (zra == NVME_ZONE_REPORT_EXTENDED) {
+    if (zra == NVME_ZONE_REPORT_EXTENDED && !ns->params.zd_extension_size) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
@@ -1828,6 +1864,9 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
     partial = (dw13 >> 16) & 0x01;
 
     zone_entry_sz = sizeof(NvmeZoneDescr);
+    if (zra == NVME_ZONE_REPORT_EXTENDED) {
+        zone_entry_sz += ns->params.zd_extension_size;
+    }
 
     max_zones = (len - sizeof(NvmeZoneReportHeader)) / zone_entry_sz;
     buf = g_malloc0(len);
@@ -1859,6 +1898,14 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
             z->wp = cpu_to_le64(~0ULL);
         }
 
+        if (zra == NVME_ZONE_REPORT_EXTENDED) {
+            if (zs->d.za & NVME_ZA_ZD_EXT_VALID) {
+                memcpy(buf_p, nvme_get_zd_extension(ns, zone_idx),
+                       ns->params.zd_extension_size);
+            }
+            buf_p += ns->params.zd_extension_size;
+        }
+
         zone_idx++;
     }
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 12/14] hw/block/nvme: Add injection of Offline/Read-Only zones
  2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (10 preceding siblings ...)
  2020-09-28  2:35 ` [PATCH v5 11/14] hw/block/nvme: Support Zone Descriptor Extensions Dmitry Fomichev
@ 2020-09-28  2:35 ` Dmitry Fomichev
  2020-09-28  2:35 ` [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for persistence Dmitry Fomichev
  2020-09-28  2:35 ` [PATCH v5 14/14] hw/block/nvme: Document zoned parameters in usage text Dmitry Fomichev
  13 siblings, 0 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:35 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

ZNS specification defines two zone conditions for the zones that no
longer can function properly, possibly because of flash wear or other
internal fault. It is useful to be able to "inject" a small number of
such zones for testing purposes.

This commit defines two optional device properties, "offline_zones"
and "rdonly_zones". Users can assign non-zero values to these variables
to specify the number of zones to be initialized as Offline or
Read-Only. The actual number of injected zones may be smaller than the
requested amount - Read-Only and Offline counts are expected to be much
smaller than the total number of zones on a drive.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme-ns.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++
 hw/block/nvme-ns.h |  2 ++
 hw/block/nvme.c    |  1 -
 3 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 60156dfeaf..47751f2d54 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -21,6 +21,7 @@
 #include "sysemu/sysemu.h"
 #include "sysemu/block-backend.h"
 #include "qapi/error.h"
+#include "crypto/random.h"
 
 #include "hw/qdev-properties.h"
 #include "hw/qdev-core.h"
@@ -192,6 +193,32 @@ static int nvme_calc_zone_geometry(NvmeNamespace *ns, Error **errp)
         return -1;
     }
 
+    if (ns->params.zd_extension_size) {
+        if (ns->params.zd_extension_size & 0x3f) {
+            error_setg(errp,
+                "zone descriptor extension size must be a multiple of 64B");
+            return -1;
+        }
+        if ((ns->params.zd_extension_size >> 6) > 0xff) {
+            error_setg(errp, "zone descriptor extension size is too large");
+            return -1;
+        }
+    }
+
+    if (ns->params.max_open_zones < nz) {
+        if (ns->params.nr_offline_zones > nz - ns->params.max_open_zones) {
+            error_setg(errp, "offline_zones value %u is too large",
+                ns->params.nr_offline_zones);
+            return -1;
+        }
+        if (ns->params.nr_rdonly_zones >
+            nz - ns->params.max_open_zones - ns->params.nr_offline_zones) {
+            error_setg(errp, "rdonly_zones value %u is too large",
+                ns->params.nr_rdonly_zones);
+            return -1;
+        }
+    }
+
     return 0;
 }
 
@@ -200,7 +227,9 @@ static void nvme_init_zone_meta(NvmeNamespace *ns)
     uint64_t start = 0, zone_size = ns->zone_size;
     uint64_t capacity = ns->num_zones * zone_size;
     NvmeZone *zone;
+    uint32_t rnd;
     int i;
+    uint16_t zs;
 
     ns->zone_array = g_malloc0(ns->zone_array_size);
     ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
@@ -233,6 +262,37 @@ static void nvme_init_zone_meta(NvmeNamespace *ns)
         zone->next = 0;
         start += zone_size;
     }
+
+    /* If required, make some zones Offline or Read Only */
+
+    for (i = 0; i < ns->params.nr_offline_zones; i++) {
+        do {
+            qcrypto_random_bytes(&rnd, sizeof(rnd), NULL);
+            rnd %= ns->num_zones;
+        } while (rnd < ns->params.max_open_zones);
+        zone = &ns->zone_array[rnd];
+        zs = nvme_get_zone_state(zone);
+        if (zs != NVME_ZONE_STATE_OFFLINE) {
+            nvme_set_zone_state(zone, NVME_ZONE_STATE_OFFLINE);
+        } else {
+            i--;
+        }
+    }
+
+    for (i = 0; i < ns->params.nr_rdonly_zones; i++) {
+        do {
+            qcrypto_random_bytes(&rnd, sizeof(rnd), NULL);
+            rnd %= ns->num_zones;
+        } while (rnd < ns->params.max_open_zones);
+        zone = &ns->zone_array[rnd];
+        zs = nvme_get_zone_state(zone);
+        if (zs != NVME_ZONE_STATE_OFFLINE &&
+            zs != NVME_ZONE_STATE_READ_ONLY) {
+            nvme_set_zone_state(zone, NVME_ZONE_STATE_READ_ONLY);
+        } else {
+            i--;
+        }
+    }
 }
 
 static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
@@ -358,6 +418,10 @@ static Property nvme_ns_props[] = {
     DEFINE_PROP_UINT32("max_open", NvmeNamespace, params.max_open_zones, 0),
     DEFINE_PROP_UINT32("zone_descr_ext_size", NvmeNamespace,
                        params.zd_extension_size, 0),
+    DEFINE_PROP_UINT32("offline_zones", NvmeNamespace,
+                       params.nr_offline_zones, 0),
+    DEFINE_PROP_UINT32("rdonly_zones", NvmeNamespace,
+                       params.nr_rdonly_zones, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index ed14644e09..e9b90f9677 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -48,6 +48,8 @@ typedef struct NvmeNamespaceParams {
     uint32_t max_active_zones;
     uint32_t max_open_zones;
     uint32_t zd_extension_size;
+    uint32_t nr_offline_zones;
+    uint32_t nr_rdonly_zones;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 27d191c659..80973f3ff6 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -54,7 +54,6 @@
 #include "qemu/osdep.h"
 #include "qemu/units.h"
 #include "qemu/error-report.h"
-#include "crypto/random.h"
 #include "hw/block/block.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pci.h"
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for persistence
  2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (11 preceding siblings ...)
  2020-09-28  2:35 ` [PATCH v5 12/14] hw/block/nvme: Add injection of Offline/Read-Only zones Dmitry Fomichev
@ 2020-09-28  2:35 ` Dmitry Fomichev
  2020-09-28  7:51   ` Klaus Jensen
  2020-09-28  2:35 ` [PATCH v5 14/14] hw/block/nvme: Document zoned parameters in usage text Dmitry Fomichev
  13 siblings, 1 reply; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:35 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

A ZNS drive that is emulated by this module is currently initialized
with all zones Empty upon startup. However, actual ZNS SSDs save the
state and condition of all zones in their internal NVRAM in the event
of power loss. When such a drive is powered up again, it closes or
finishes all zones that were open at the moment of shutdown. Besides
that, the write pointer position as well as the state and condition
of all zones is preserved across power-downs.

This commit adds the capability to have a persistent zone metadata
to the device. The new optional module property, "zone_file",
is introduced. If added to the command line, this property specifies
the name of the file that stores the zone metadata. If "zone_file" is
omitted, the device will be initialized with all zones empty, the same
as before.

If zone metadata is configured to be persistent, then zone descriptor
extensions also persist across controller shutdowns.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme-ns.c    | 341 ++++++++++++++++++++++++++++++++++++++++--
 hw/block/nvme-ns.h    |  33 ++++
 hw/block/nvme.c       |   2 +
 hw/block/trace-events |   1 +
 4 files changed, 362 insertions(+), 15 deletions(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 47751f2d54..a94021da81 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -20,12 +20,15 @@
 #include "hw/pci/pci.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/block-backend.h"
+#include "sysemu/hostmem.h"
+#include "qom/object_interfaces.h"
 #include "qapi/error.h"
 #include "crypto/random.h"
 
 #include "hw/qdev-properties.h"
 #include "hw/qdev-core.h"
 
+#include "trace.h"
 #include "nvme.h"
 #include "nvme-ns.h"
 
@@ -98,6 +101,7 @@ void nvme_add_zone_tail(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone)
         zl->tail = idx;
     }
     zl->size++;
+    nvme_set_zone_meta_dirty(ns);
 }
 
 /*
@@ -113,12 +117,15 @@ void nvme_remove_zone(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone)
     if (zl->size == 0) {
         zl->head = NVME_ZONE_LIST_NIL;
         zl->tail = NVME_ZONE_LIST_NIL;
+        nvme_set_zone_meta_dirty(ns);
     } else if (idx == zl->head) {
         zl->head = zone->next;
         ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
+        nvme_set_zone_meta_dirty(ns);
     } else if (idx == zl->tail) {
         zl->tail = zone->prev;
         ns->zone_array[zl->tail].next = NVME_ZONE_LIST_NIL;
+        nvme_set_zone_meta_dirty(ns);
     } else {
         ns->zone_array[zone->next].prev = zone->prev;
         ns->zone_array[zone->prev].next = zone->next;
@@ -144,6 +151,7 @@ NvmeZone *nvme_remove_zone_head(NvmeNamespace *ns, NvmeZoneList *zl)
             ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
         }
         zone->prev = zone->next = 0;
+        nvme_set_zone_meta_dirty(ns);
     }
 
     return zone;
@@ -219,11 +227,119 @@ static int nvme_calc_zone_geometry(NvmeNamespace *ns, Error **errp)
         }
     }
 
+    ns->meta_size = sizeof(NvmeZoneMeta) + ns->zone_array_size +
+                          nz * ns->params.zd_extension_size;
+    ns->meta_size = ROUND_UP(ns->meta_size, qemu_real_host_page_size);
+
+    return 0;
+}
+
+static int nvme_validate_zone_file(NvmeNamespace *ns, uint64_t capacity)
+{
+    NvmeZoneMeta *meta = ns->zone_meta;
+    NvmeZone *zone = ns->zone_array;
+    uint64_t start = 0, zone_size = ns->zone_size;
+    int i, n_imp_open = 0, n_exp_open = 0, n_closed = 0, n_full = 0;
+
+    if (meta->magic != NVME_ZONE_META_MAGIC) {
+        return 1;
+    }
+    if (meta->version != NVME_ZONE_META_VER) {
+        return 2;
+    }
+    if (meta->zone_size != zone_size) {
+        return 3;
+    }
+    if (meta->zone_capacity != ns->zone_capacity) {
+        return 4;
+    }
+    if (meta->nr_offline_zones != ns->params.nr_offline_zones) {
+        return 5;
+    }
+    if (meta->nr_rdonly_zones != ns->params.nr_rdonly_zones) {
+        return 6;
+    }
+    if (meta->lba_size != ns->blkconf.logical_block_size) {
+        return 7;
+    }
+    if (meta->zd_extension_size != ns->params.zd_extension_size) {
+        return 8;
+    }
+
+    for (i = 0; i < ns->num_zones; i++, zone++) {
+        if (start + zone_size > capacity) {
+            zone_size = capacity - start;
+        }
+        if (zone->d.zt != NVME_ZONE_TYPE_SEQ_WRITE) {
+            return 9;
+        }
+        if (zone->d.zcap != ns->zone_capacity) {
+            return 10;
+        }
+        if (zone->d.zslba != start) {
+            return 11;
+        }
+        switch (nvme_get_zone_state(zone)) {
+        case NVME_ZONE_STATE_EMPTY:
+        case NVME_ZONE_STATE_OFFLINE:
+        case NVME_ZONE_STATE_READ_ONLY:
+            if (zone->d.wp != start) {
+                return 12;
+            }
+            break;
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+            if (zone->d.wp < start ||
+                zone->d.wp >= zone->d.zslba + zone->d.zcap) {
+                return 13;
+            }
+            n_imp_open++;
+            break;
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+            if (zone->d.wp < start ||
+                zone->d.wp >= zone->d.zslba + zone->d.zcap) {
+                return 13;
+            }
+            n_exp_open++;
+            break;
+        case NVME_ZONE_STATE_CLOSED:
+            if (zone->d.wp < start ||
+                zone->d.wp >= zone->d.zslba + zone->d.zcap) {
+                return 13;
+            }
+            n_closed++;
+            break;
+        case NVME_ZONE_STATE_FULL:
+            if (zone->d.wp != zone->d.zslba + zone->d.zcap) {
+                return 14;
+            }
+            n_full++;
+            break;
+        default:
+            return 15;
+        }
+
+        start += zone_size;
+    }
+
+    if (n_exp_open != nvme_zone_list_size(ns->exp_open_zones)) {
+        return 16;
+    }
+    if (n_imp_open != nvme_zone_list_size(ns->imp_open_zones)) {
+        return 17;
+    }
+    if (n_closed != nvme_zone_list_size(ns->closed_zones)) {
+        return 18;
+    }
+    if (n_full != nvme_zone_list_size(ns->full_zones)) {
+        return 19;
+    }
+
     return 0;
 }
 
 static void nvme_init_zone_meta(NvmeNamespace *ns)
 {
+    NvmeZoneMeta *meta = ns->zone_meta;
     uint64_t start = 0, zone_size = ns->zone_size;
     uint64_t capacity = ns->num_zones * zone_size;
     NvmeZone *zone;
@@ -231,14 +347,26 @@ static void nvme_init_zone_meta(NvmeNamespace *ns)
     int i;
     uint16_t zs;
 
-    ns->zone_array = g_malloc0(ns->zone_array_size);
-    ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
-    ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
-    ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
-    ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
-    if (ns->params.zd_extension_size) {
-        ns->zd_extensions = g_malloc0(ns->params.zd_extension_size *
-                                      ns->num_zones);
+    if (ns->params.zone_file) {
+        meta->magic = NVME_ZONE_META_MAGIC;
+        meta->version = NVME_ZONE_META_VER;
+        meta->zone_size = zone_size;
+        meta->zone_capacity = ns->zone_capacity;
+        meta->lba_size = ns->blkconf.logical_block_size;
+        meta->nr_offline_zones = ns->params.nr_offline_zones;
+        meta->nr_rdonly_zones = ns->params.nr_rdonly_zones;
+        meta->zd_extension_size = ns->params.zd_extension_size;
+    } else {
+        assert(!ns->zone_meta);
+        ns->zone_array = g_malloc0(ns->zone_array_size);
+        ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
+        ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
+        ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
+        ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
+        if (ns->params.zd_extension_size) {
+            ns->zd_extensions = g_malloc0(ns->params.zd_extension_size *
+                                          ns->num_zones);
+        }
     }
 
     nvme_init_zone_list(ns->exp_open_zones);
@@ -293,12 +421,180 @@ static void nvme_init_zone_meta(NvmeNamespace *ns)
             i--;
         }
     }
+
+    if (ns->params.zone_file) {
+        nvme_set_zone_meta_dirty(ns);
+    }
+}
+
+static int nvme_open_zone_file(NvmeNamespace *ns, bool *init_meta,
+                               Error **errp)
+{
+    Object *file_be;
+    HostMemoryBackend *fb;
+    struct stat statbuf;
+    int ret;
+
+    ret = stat(ns->params.zone_file, &statbuf);
+    if (ret && errno == ENOENT) {
+        *init_meta = true;
+    } else if (!S_ISREG(statbuf.st_mode)) {
+        error_setg(errp, "\"%s\" is not a regular file",
+                   ns->params.zone_file);
+        return -1;
+    }
+
+    file_be = object_new(TYPE_MEMORY_BACKEND_FILE);
+    object_property_set_str(file_be, "mem-path", ns->params.zone_file,
+                            &error_abort);
+    object_property_set_int(file_be, "size", ns->meta_size, &error_abort);
+    object_property_set_bool(file_be, "share", true, &error_abort);
+    object_property_set_bool(file_be, "discard-data", false, &error_abort);
+    if (!user_creatable_complete(USER_CREATABLE(file_be), errp)) {
+        object_unref(file_be);
+        return -1;
+    }
+    object_property_add_child(OBJECT(ns), "_fb", file_be);
+    object_unref(file_be);
+
+    fb = MEMORY_BACKEND(file_be);
+    ns->zone_mr = host_memory_backend_get_memory(fb);
+
+    return 0;
+}
+
+static int nvme_map_zone_file(NvmeNamespace *ns, bool *init_meta)
+{
+    ns->zone_meta = (void *)memory_region_get_ram_ptr(ns->zone_mr);
+    ns->zone_array = (NvmeZone *)(ns->zone_meta + 1);
+    ns->exp_open_zones = &ns->zone_meta->exp_open_zones;
+    ns->imp_open_zones = &ns->zone_meta->imp_open_zones;
+    ns->closed_zones = &ns->zone_meta->closed_zones;
+    ns->full_zones = &ns->zone_meta->full_zones;
+
+    if (ns->params.zd_extension_size) {
+        ns->zd_extensions = (uint8_t *)(ns->zone_meta + 1);
+        ns->zd_extensions += ns->zone_array_size;
+    }
+
+    return 0;
+}
+
+void nvme_sync_zone_file(NvmeNamespace *ns, NvmeZone *zone, int len)
+{
+    uintptr_t z = (uintptr_t)zone, off = z - (uintptr_t)ns->zone_meta;
+
+    if (ns->zone_meta) {
+        memory_region_msync(ns->zone_mr, off, len);
+
+        if (ns->zone_meta->dirty) {
+            ns->zone_meta->dirty = false;
+            memory_region_msync(ns->zone_mr, 0, sizeof(NvmeZoneMeta));
+        }
+    }
+}
+
+/*
+ * Close or finish all the zones that might be still open after power-down.
+ */
+static void nvme_prepare_zones(NvmeNamespace *ns)
+{
+    NvmeZone *zone;
+    uint32_t set_state;
+    int i;
+
+    assert(!ns->nr_active_zones);
+    assert(!ns->nr_open_zones);
+
+    zone = ns->zone_array;
+    for (i = 0; i < ns->num_zones; i++, zone++) {
+        switch (nvme_get_zone_state(zone)) {
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+            nvme_remove_zone(ns, ns->imp_open_zones, zone);
+            break;
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+            nvme_remove_zone(ns, ns->exp_open_zones, zone);
+            break;
+        case NVME_ZONE_STATE_CLOSED:
+            nvme_aor_inc_active(ns);
+            /* fall through */
+        default:
+            continue;
+        }
+
+        if (zone->d.za & NVME_ZA_ZD_EXT_VALID) {
+            set_state = NVME_ZONE_STATE_CLOSED;
+        } else if (zone->d.wp == zone->d.zslba) {
+            set_state = NVME_ZONE_STATE_EMPTY;
+        } else if (ns->params.max_active_zones == 0 ||
+                   ns->nr_active_zones < ns->params.max_active_zones) {
+            set_state = NVME_ZONE_STATE_CLOSED;
+        } else {
+            set_state = NVME_ZONE_STATE_FULL;
+        }
+
+        switch (set_state) {
+        case NVME_ZONE_STATE_CLOSED:
+            trace_pci_nvme_power_on_close(nvme_get_zone_state(zone),
+                                          zone->d.zslba);
+            nvme_aor_inc_active(ns);
+            nvme_add_zone_tail(ns, ns->closed_zones, zone);
+            break;
+        case NVME_ZONE_STATE_EMPTY:
+            trace_pci_nvme_power_on_reset(nvme_get_zone_state(zone),
+                                          zone->d.zslba);
+            break;
+        case NVME_ZONE_STATE_FULL:
+            trace_pci_nvme_power_on_full(nvme_get_zone_state(zone),
+                                         zone->d.zslba);
+            zone->d.wp = nvme_zone_wr_boundary(zone);
+        }
+
+        zone->w_ptr = zone->d.wp;
+        nvme_set_zone_state(zone, set_state);
+    }
+}
+
+static int nvme_load_zone_meta(NvmeNamespace *ns, bool *init_meta)
+{
+    uint64_t capacity = ns->num_zones * ns->zone_size;
+    int ret = 0;
+
+    if (ns->params.zone_file) {
+        ret = nvme_map_zone_file(ns, init_meta);
+        trace_pci_nvme_mapped_zone_file(ns->params.zone_file, ret);
+        if (ret < 0) {
+            return ret;
+        }
+
+        if (!*init_meta) {
+            ret = nvme_validate_zone_file(ns, capacity);
+            if (ret) {
+                trace_pci_nvme_err_zone_file_invalid(ret);
+                *init_meta = true;
+            }
+        }
+    } else {
+        *init_meta = true;
+    }
+
+    if (*init_meta) {
+        nvme_init_zone_meta(ns);
+        trace_pci_nvme_initialized_zone_file(ns->params.zone_file);
+    } else {
+        nvme_prepare_zones(ns);
+    }
+    nvme_sync_zone_file(ns, ns->zone_array, ns->zone_array_size);
+
+    return 0;
 }
 
 static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
                               Error **errp)
 {
     NvmeIdNsZoned *id_ns_z;
+    int ret;
+    bool init_meta = false;
 
     if (n->params.fill_pattern == 0) {
         ns->id_ns.dlfeat |= 0x01;
@@ -310,7 +606,17 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
         return -1;
     }
 
-    nvme_init_zone_meta(ns);
+    if (ns->params.zone_file) {
+        if (nvme_open_zone_file(ns, &init_meta, errp) != 0) {
+            return -1;
+        }
+    }
+
+    ret = nvme_load_zone_meta(ns, &init_meta);
+    if (ret) {
+        error_setg(errp, "could not load/init zone metadata, err=%d", ret);
+        return -1;
+    }
 
     id_ns_z = g_malloc0(sizeof(NvmeIdNsZoned));
 
@@ -376,17 +682,21 @@ void nvme_ns_drain(NvmeNamespace *ns)
 void nvme_ns_flush(NvmeNamespace *ns)
 {
     blk_flush(ns->blkconf.blk);
+
+    nvme_sync_zone_file(ns, ns->zone_array, ns->zone_array_size);
 }
 
 void nvme_ns_cleanup(NvmeNamespace *ns)
 {
+    if (!ns->params.zone_file)  {
+        g_free(ns->zone_array);
+        g_free(ns->exp_open_zones);
+        g_free(ns->imp_open_zones);
+        g_free(ns->closed_zones);
+        g_free(ns->full_zones);
+        g_free(ns->zd_extensions);
+    }
     g_free(ns->id_ns_zoned);
-    g_free(ns->zone_array);
-    g_free(ns->exp_open_zones);
-    g_free(ns->imp_open_zones);
-    g_free(ns->closed_zones);
-    g_free(ns->full_zones);
-    g_free(ns->zd_extensions);
 }
 
 static void nvme_ns_realize(DeviceState *dev, Error **errp)
@@ -422,6 +732,7 @@ static Property nvme_ns_props[] = {
                        params.nr_offline_zones, 0),
     DEFINE_PROP_UINT32("rdonly_zones", NvmeNamespace,
                        params.nr_rdonly_zones, 0),
+    DEFINE_PROP_STRING("zone_file", NvmeNamespace, params.zone_file),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index e9b90f9677..4ff0955f91 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -36,6 +36,27 @@ typedef struct NvmeZoneList {
     uint8_t         rsvd12[4];
 } NvmeZoneList;
 
+#define NVME_ZONE_META_MAGIC 0x3aebaa70
+#define NVME_ZONE_META_VER  1
+
+typedef struct NvmeZoneMeta {
+    uint32_t        magic;
+    uint32_t        version;
+    uint64_t        zone_size;
+    uint64_t        zone_capacity;
+    uint32_t        nr_offline_zones;
+    uint32_t        nr_rdonly_zones;
+    uint32_t        lba_size;
+    uint32_t        rsvd40;
+    NvmeZoneList    exp_open_zones;
+    NvmeZoneList    imp_open_zones;
+    NvmeZoneList    closed_zones;
+    NvmeZoneList    full_zones;
+    uint8_t         zd_extension_size;
+    uint8_t         dirty;
+    uint8_t         rsvd594[3990];
+} NvmeZoneMeta;
+
 typedef struct NvmeNamespaceParams {
     uint32_t nsid;
     bool     attached;
@@ -50,6 +71,7 @@ typedef struct NvmeNamespaceParams {
     uint32_t zd_extension_size;
     uint32_t nr_offline_zones;
     uint32_t nr_rdonly_zones;
+    char     *zone_file;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
@@ -62,6 +84,7 @@ typedef struct NvmeNamespace {
 
     NvmeIdNsZoned   *id_ns_zoned;
     NvmeZone        *zone_array;
+    NvmeZoneMeta    *zone_meta;
     NvmeZoneList    *exp_open_zones;
     NvmeZoneList    *imp_open_zones;
     NvmeZoneList    *closed_zones;
@@ -74,6 +97,8 @@ typedef struct NvmeNamespace {
     uint8_t         *zd_extensions;
     int32_t         nr_open_zones;
     int32_t         nr_active_zones;
+    MemoryRegion    *zone_mr;
+    size_t          meta_size;
 
     NvmeNamespaceParams params;
 } NvmeNamespace;
@@ -110,6 +135,13 @@ static inline size_t nvme_l2b(NvmeNamespace *ns, uint64_t lba)
     return lba << nvme_ns_lbads(ns);
 }
 
+static inline void nvme_set_zone_meta_dirty(NvmeNamespace *ns)
+{
+    if (ns->params.zone_file) {
+        ns->zone_meta->dirty = true;
+    }
+}
+
 typedef struct NvmeCtrl NvmeCtrl;
 
 int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
@@ -243,5 +275,6 @@ static inline void nvme_aor_dec_active(NvmeNamespace *ns)
 void nvme_add_zone_tail(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone);
 void nvme_remove_zone(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone);
 NvmeZone *nvme_remove_zone_head(NvmeNamespace *ns, NvmeZoneList *zl);
+void nvme_sync_zone_file(NvmeNamespace *ns, NvmeZone *zone, int len);
 
 #endif /* NVME_NS_H */
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 80973f3ff6..ff7d43d38f 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -165,6 +165,8 @@ static void nvme_assign_zone_state(NvmeNamespace *ns, NvmeZone *zone,
     default:
         zone->d.za = 0;
     }
+
+    nvme_sync_zone_file(ns, zone, sizeof(NvmeZone));
 }
 
 /*
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 386f28e457..1ea4846443 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -103,6 +103,7 @@ pci_nvme_zd_extension_set(uint32_t zone_idx) "set descriptor extension for zone_
 pci_nvme_power_on_close(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Closed state"
 pci_nvme_power_on_reset(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Empty state"
 pci_nvme_power_on_full(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Full state"
+pci_nvme_initialized_zone_file(char *zfile_name) "mapped zone file %s"
 pci_nvme_mapped_zone_file(char *zfile_name, int ret) "mapped zone file %s, error %d"
 
 # nvme traces for error conditions
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 14/14] hw/block/nvme: Document zoned parameters in usage text
  2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (12 preceding siblings ...)
  2020-09-28  2:35 ` [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for persistence Dmitry Fomichev
@ 2020-09-28  2:35 ` Dmitry Fomichev
  13 siblings, 0 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-28  2:35 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

Added brief descriptions of the new device properties that are
now available to users to configure features of Zoned Namespace
Command Set in the emulator.

This patch is for documentation only, no functionality change.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index ff7d43d38f..34fc6daf9d 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,7 +9,7 @@
  */
 
 /**
- * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
+ * Reference Specs: http://www.nvmexpress.org, 1.4, 1.3, 1.2, 1.1, 1.0e
  *
  *  https://nvmexpress.org/developers/nvme-specification/
  */
@@ -23,7 +23,8 @@
  *              max_ioqpairs=<N[optional]>, \
  *              aerl=<N[optional]>, aer_max_queued=<N[optional]>, \
  *              mdts=<N[optional]>
- *      -device nvme-ns,drive=<drive_id>,bus=bus_name,nsid=<nsid>
+ *      -device nvme-ns,drive=<drive_id>,bus=bus_name,nsid=<nsid>, \
+ *              zoned=<true|false[optional]>
  *
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
  * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
@@ -49,6 +50,45 @@
  *   completion when there are no oustanding AERs. When the maximum number of
  *   enqueued events are reached, subsequent events will be dropped.
  *
+ * Setting `zoned` to true selects Zoned Command Set at the namespace.
+ * In this case, the following options are available to configure zoned
+ * operation:
+ *     zone_size=<zone size in MiB, default: 128MiB>
+ *
+ *     zone_capacity=<zone capacity in MiB, default: zone_size>
+ *         The value 0 (default) forces zone capacity to be the same as zone
+ *         size. The value of this property may not exceed zone size.
+ *
+ *     zone_file=<zone metadata file name, default: none>
+ *         Zone metadata file, if specified, allows zone information
+ *         to be persistent across shutdowns and restarts.
+ *
+ *     zone_descr_ext_size=<zone descriptor extension size, default 0>
+ *         This value needs to be specified in 64B units. If it is zero,
+ *         namespace(s) will not support zone descriptor extensions.
+ *
+ *     max_active=<Maximum Active Resources (zones), default: 0 - no limit>
+ *
+ *     max_open=<Maximum Open Resources (zones), default: 0 - no limit>
+ *
+ *     zone_append_size_limit=<zone append size limit, in KiB, default: MDTS>
+ *         The maximum I/O size that can be supported by Zone Append
+ *         command. Since internally this this value is maintained as
+ *         ZASL = log2(<maximum append size> / <page size>), some
+ *         values assigned to this property may be rounded down and
+ *         result in a lower maximum ZA data size being in effect.
+ *         If MDTS property is not assigned, the default value of 128KiB is
+ *         used as ZASL.
+ *
+ *     offline_zones=<the number of offline zones to inject, default: 0>
+ *
+ *     rdonly_zones=<the number of read-only zones to inject, default: 0>
+ *
+ *     cross_zone_read=<enables Read Across Zone Boundaries, default: true>
+ *
+ *     fill_pattern=<data fill pattern, default: 0x00>
+ *         The byte pattern to return for any portions of unwritten data
+ *         during read.
  */
 
 #include "qemu/osdep.h"
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set
  2020-09-28  2:35 ` [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
@ 2020-09-28  6:44   ` Klaus Jensen
  2020-09-28 10:42   ` Klaus Jensen
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 46+ messages in thread
From: Klaus Jensen @ 2020-09-28  6:44 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 4859 bytes --]

On Sep 28 11:35, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> The code to support for Zone Descriptor Extensions is not included in
> this commit and ZDES 0 is always reported. A later commit in this
> series will add ZDE support.
> 
> This commit doesn't yet include checks for active and open zone
> limits. It is assumed that there are no limits on either active or
> open zones.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  block/nvme.c         |   2 +-
>  hw/block/nvme-ns.c   | 185 ++++++++-
>  hw/block/nvme-ns.h   |   6 +-
>  hw/block/nvme.c      | 872 +++++++++++++++++++++++++++++++++++++++++--
>  include/block/nvme.h |   6 +-
>  5 files changed, 1033 insertions(+), 38 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 05485fdd11..7a513c9a17 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -1040,18 +1318,468 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
>          goto invalid;
>      }
>  
> +    if (ns->params.zoned) {
> +        zone_idx = nvme_zone_idx(ns, slba);
> +        assert(zone_idx < ns->num_zones);
> +        zone = &ns->zone_array[zone_idx];
> +
> +        if (is_write) {
> +            status = nvme_check_zone_write(zone, slba, nlb);
> +            if (status != NVME_SUCCESS) {
> +                trace_pci_nvme_err_zone_write_not_ok(slba, nlb, status);
> +                goto invalid;
> +            }
> +
> +            assert(nvme_wp_is_valid(zone));
> +            if (append) {
> +                if (unlikely(slba != zone->d.zslba)) {
> +                    trace_pci_nvme_err_append_not_at_start(slba, zone->d.zslba);
> +                    status = NVME_ZONE_INVALID_WRITE | NVME_DNR;
> +                    goto invalid;
> +                }
> +                if (data_size > (n->page_size << n->zasl)) {
> +                    trace_pci_nvme_err_append_too_large(slba, nlb, n->zasl);
> +                    status = NVME_INVALID_FIELD | NVME_DNR;
> +                    goto invalid;
> +                }
> +                slba = zone->w_ptr;
> +            } else if (unlikely(slba != zone->w_ptr)) {
> +                trace_pci_nvme_err_write_not_at_wp(slba, zone->d.zslba,
> +                                                   zone->w_ptr);
> +                status = NVME_ZONE_INVALID_WRITE | NVME_DNR;
> +                goto invalid;
> +            }
> +            req->fill_ofs = -1LL;
> +        } else {
> +            status = nvme_check_zone_read(ns, zone, slba, nlb);
> +            if (status != NVME_SUCCESS) {
> +                trace_pci_nvme_err_zone_read_not_ok(slba, nlb, status);
> +                goto invalid;
> +            }
> +
> +            if (slba + nlb > zone->w_ptr) {
> +                /*
> +                 * All or some data is read above the WP. Need to
> +                 * fill out the buffer area that has no backing data
> +                 * with a predefined data pattern (zeros by default)
> +                 */
> +                if (slba >= zone->w_ptr) {
> +                    req->fill_ofs = 0;
> +                } else {
> +                    req->fill_ofs = nvme_l2b(ns, zone->w_ptr - slba);
> +                }
> +                req->fill_len = nvme_l2b(ns,
> +                    nvme_zone_rd_boundary(ns, zone) - slba);

OK then. Next edge case.

Now what happens if the read crosses into a partially written zone and
reads above the write pointer in that zone?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for persistence
  2020-09-28  2:35 ` [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for persistence Dmitry Fomichev
@ 2020-09-28  7:51   ` Klaus Jensen
  2020-09-29 15:43     ` Dmitry Fomichev
  0 siblings, 1 reply; 46+ messages in thread
From: Klaus Jensen @ 2020-09-28  7:51 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 3463 bytes --]

On Sep 28 11:35, Dmitry Fomichev wrote:
> A ZNS drive that is emulated by this module is currently initialized
> with all zones Empty upon startup. However, actual ZNS SSDs save the
> state and condition of all zones in their internal NVRAM in the event
> of power loss. When such a drive is powered up again, it closes or
> finishes all zones that were open at the moment of shutdown. Besides
> that, the write pointer position as well as the state and condition
> of all zones is preserved across power-downs.
> 
> This commit adds the capability to have a persistent zone metadata
> to the device. The new optional module property, "zone_file",
> is introduced. If added to the command line, this property specifies
> the name of the file that stores the zone metadata. If "zone_file" is
> omitted, the device will be initialized with all zones empty, the same
> as before.
> 
> If zone metadata is configured to be persistent, then zone descriptor
> extensions also persist across controller shutdowns.
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme-ns.c    | 341 ++++++++++++++++++++++++++++++++++++++++--
>  hw/block/nvme-ns.h    |  33 ++++
>  hw/block/nvme.c       |   2 +
>  hw/block/trace-events |   1 +
>  4 files changed, 362 insertions(+), 15 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index 47751f2d54..a94021da81 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -293,12 +421,180 @@ static void nvme_init_zone_meta(NvmeNamespace *ns)
>              i--;
>          }
>      }
> +
> +    if (ns->params.zone_file) {
> +        nvme_set_zone_meta_dirty(ns);
> +    }
> +}
> +
> +static int nvme_open_zone_file(NvmeNamespace *ns, bool *init_meta,
> +                               Error **errp)
> +{
> +    Object *file_be;
> +    HostMemoryBackend *fb;
> +    struct stat statbuf;
> +    int ret;
> +
> +    ret = stat(ns->params.zone_file, &statbuf);
> +    if (ret && errno == ENOENT) {
> +        *init_meta = true;
> +    } else if (!S_ISREG(statbuf.st_mode)) {
> +        error_setg(errp, "\"%s\" is not a regular file",
> +                   ns->params.zone_file);
> +        return -1;
> +    }
> +
> +    file_be = object_new(TYPE_MEMORY_BACKEND_FILE);
> +    object_property_set_str(file_be, "mem-path", ns->params.zone_file,
> +                            &error_abort);
> +    object_property_set_int(file_be, "size", ns->meta_size, &error_abort);
> +    object_property_set_bool(file_be, "share", true, &error_abort);
> +    object_property_set_bool(file_be, "discard-data", false, &error_abort);
> +    if (!user_creatable_complete(USER_CREATABLE(file_be), errp)) {
> +        object_unref(file_be);
> +        return -1;
> +    }
> +    object_property_add_child(OBJECT(ns), "_fb", file_be);
> +    object_unref(file_be);
> +
> +    fb = MEMORY_BACKEND(file_be);
> +    ns->zone_mr = host_memory_backend_get_memory(fb);
> +
> +    return 0;
> +}
> +
> +static int nvme_map_zone_file(NvmeNamespace *ns, bool *init_meta)
> +{
> +    ns->zone_meta = (void *)memory_region_get_ram_ptr(ns->zone_mr);

I forgot that the HostMemoryBackend doesn't magically make the memory
available to the device, so of course this is still needed.

Anyway.

No reason for me to keep complaining about this. I do not like it, I
will not ACK it and I think I made my reasons pretty clear.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 01/14] hw/block/nvme: Report actual LBA data shift in LBAF
  2020-09-28  2:35 ` [PATCH v5 01/14] hw/block/nvme: Report actual LBA data shift in LBAF Dmitry Fomichev
@ 2020-09-28  8:51   ` Klaus Jensen
  0 siblings, 0 replies; 46+ messages in thread
From: Klaus Jensen @ 2020-09-28  8:51 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 1397 bytes --]

On Sep 28 11:35, Dmitry Fomichev wrote:
> Calculate the data shift value to report based on the set value of
> logical_block_size device property.
> 
> In the process, use a local variable to calculate the LBA format
> index instead of the hardcoded value 0. This makes the code more
> readable and it will make it easier to add support for multiple LBA
> formats in the future.
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/nvme-ns.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index 2ba0263dda..bbd7879492 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -47,6 +47,8 @@ static void nvme_ns_init(NvmeNamespace *ns)
>  
>  static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>  {
> +    int lba_index;
> +
>      if (!blkconf_blocksizes(&ns->blkconf, errp)) {
>          return -1;
>      }
> @@ -67,6 +69,9 @@ static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>          n->features.vwc = 0x1;
>      }
>  
> +    lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> +    ns->id_ns.lbaf[lba_index].ds = 31 - clz32(n->conf.logical_block_size);

You fix this later in the zoned support patch, but this should use
ns->blkconf.conf.logical_block_size.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set
  2020-09-28  2:35 ` [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
  2020-09-28  6:44   ` Klaus Jensen
@ 2020-09-28 10:42   ` Klaus Jensen
  2020-09-30  5:20     ` Klaus Jensen
  2020-10-05  0:53     ` Dmitry Fomichev
  2020-09-30  5:59   ` Klaus Jensen
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 46+ messages in thread
From: Klaus Jensen @ 2020-09-28 10:42 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 7269 bytes --]

On Sep 28 11:35, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> The code to support for Zone Descriptor Extensions is not included in
> this commit and ZDES 0 is always reported. A later commit in this
> series will add ZDE support.
> 
> This commit doesn't yet include checks for active and open zone
> limits. It is assumed that there are no limits on either active or
> open zones.
> 

I think the fill_pattern feature stands separate, so it would be nice to
extract that to a patch on its own.

> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  block/nvme.c         |   2 +-
>  hw/block/nvme-ns.c   | 185 ++++++++-
>  hw/block/nvme-ns.h   |   6 +-
>  hw/block/nvme.c      | 872 +++++++++++++++++++++++++++++++++++++++++--
>  include/block/nvme.h |   6 +-
>  5 files changed, 1033 insertions(+), 38 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index 04172f083e..daa13546c4 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -38,7 +38,6 @@ typedef struct NvmeZoneList {
>  
>  typedef struct NvmeNamespaceParams {
>      uint32_t nsid;
> -    uint8_t  csi;
>      bool     attached;
>      QemuUUID uuid;
>  
> @@ -52,6 +51,7 @@ typedef struct NvmeNamespace {
>      DeviceState  parent_obj;
>      BlockConf    blkconf;
>      int32_t      bootindex;
> +    uint8_t      csi;
>      int64_t      size;
>      NvmeIdNs     id_ns;

This should be squashed into the namespace types patch.

> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 63ad03d6d6..38e25a4d1f 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -54,6 +54,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/units.h"
>  #include "qemu/error-report.h"
> +#include "crypto/random.h"

I think this is not used until the offline/read-only zones injection
patch, right?

> +static bool nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
> +                                      bool failed)
> +{
> +    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
> +    NvmeZone *zone;
> +    uint64_t slba, start_wp = req->cqe.result64;
> +    uint32_t nlb, zone_idx;
> +    uint8_t zs;
> +
> +    if (rw->opcode != NVME_CMD_WRITE &&
> +        rw->opcode != NVME_CMD_ZONE_APPEND &&
> +        rw->opcode != NVME_CMD_WRITE_ZEROES) {
> +        return false;
> +    }
> +
> +    slba = le64_to_cpu(rw->slba);
> +    nlb = le16_to_cpu(rw->nlb) + 1;
> +    zone_idx = nvme_zone_idx(ns, slba);
> +    assert(zone_idx < ns->num_zones);
> +    zone = &ns->zone_array[zone_idx];
> +
> +    if (!failed && zone->w_ptr < start_wp + nlb) {
> +        /*
> +         * A preceding queued write to the zone has failed,
> +         * now this write is not at the WP, fail it too.
> +         */
> +        failed = true;
> +    }
> +
> +    if (failed) {
> +        if (zone->w_ptr > start_wp) {
> +            zone->w_ptr = start_wp;
> +        }

It is possible (though unlikely) that you already posted the CQE for the
write that moved the WP to w_ptr - and now you are reverting it.  This
looks like a recipe for data corruption to me.

Take this example. I use append, because if you have multiple regular
writes in queue you're screwed anyway.

  w_ptr = 0, d.wp = 0
  append 1 lba  -> w_ptr = 1, start_wp = 0, issues aio A
  append 2 lbas -> w_ptr = 3, start_wp = 1, issues aio B

  aio B success -> d.wp = 2 (since you are adding nlb),

Now, I totally do the same. Even though the zone descriptor write
pointer gets "out of sync", it will be reconciled in the absence of
failures and its fair to define that the host cannot expect a consistent
view of the write pointer without quescing I/O.

The problem is if a write then fails:

  aio A fails   -> w_ptr > start_wp (3 > 1), so you revert to w_ptr = 1

That looks bad to me. I dont think this is ever reconciled? If another
append then comes in:

  append 1 lba -> w_ptr = 2, start_wp = 1, issues aio C and overwrites
                                           the second append from before.
  aio C success -> d.wp = 3 (but it should be 2)

> @@ -1513,11 +2267,16 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
>  static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> +    NvmeIdCtrlZoned id = {};
>  
>      trace_pci_nvme_identify_ctrl_csi(c->csi);
>  
>      if (c->csi == NVME_CSI_NVM) {
>          return nvme_rpt_empty_id_struct(n, req);
> +    } else if (c->csi == NVME_CSI_ZONED) {
> +        id.zasl = n->zasl;

I dont think it should overwrite the zasl value specified by the user.
If the user specified 0, then it should return 0 for zasl here.

> @@ -2310,16 +3086,28 @@ static int nvme_start_ctrl(NvmeCtrl *n)
>              continue;
>          }
>          ns->params.attached = false;
> -        switch (ns->params.csi) {
> +        switch (ns->csi) {
>          case NVME_CSI_NVM:
>              if (NVME_CC_CSS(n->bar.cc) == CSS_NVM_ONLY ||
>                  NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
>                  ns->params.attached = true;
>              }
>              break;
> +        case NVME_CSI_ZONED:
> +            if (NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
> +                ns->params.attached = true;
> +            }
> +            break;
>          }
>      }
>  
> +    if (!n->zasl_bs) {
> +        assert(n->params.mdts);

A value of 0 for MDTS is perfectly valid.

> @@ -2382,10 +3170,11 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
>                  case CSS_NVM_ONLY:
>                      trace_pci_nvme_css_nvm_cset_selected_by_host(data &
>                                                                   0xffffffff);
> -                    break;
> +                break;

Spurious misaligned break here.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for persistence
  2020-09-28  7:51   ` Klaus Jensen
@ 2020-09-29 15:43     ` Dmitry Fomichev
  2020-09-29 16:46       ` Klaus Jensen
  0 siblings, 1 reply; 46+ messages in thread
From: Dmitry Fomichev @ 2020-09-29 15:43 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling



> -----Original Message-----
> From: Klaus Jensen <its@irrelevant.dk>
> Sent: Monday, September 28, 2020 3:52 AM
> To: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
> Cc: Keith Busch <kbusch@kernel.org>; Klaus Jensen
> <k.jensen@samsung.com>; Kevin Wolf <kwolf@redhat.com>; Philippe
> Mathieu-Daudé <philmd@redhat.com>; Maxim Levitsky
> <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>; Niklas Cassel
> <Niklas.Cassel@wdc.com>; Damien Le Moal <Damien.LeMoal@wdc.com>;
> qemu-block@nongnu.org; qemu-devel@nongnu.org; Alistair Francis
> <Alistair.Francis@wdc.com>; Matias Bjorling <Matias.Bjorling@wdc.com>
> Subject: Re: [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for
> persistence
> 
> On Sep 28 11:35, Dmitry Fomichev wrote:
> > A ZNS drive that is emulated by this module is currently initialized
> > with all zones Empty upon startup. However, actual ZNS SSDs save the
> > state and condition of all zones in their internal NVRAM in the event
> > of power loss. When such a drive is powered up again, it closes or
> > finishes all zones that were open at the moment of shutdown. Besides
> > that, the write pointer position as well as the state and condition
> > of all zones is preserved across power-downs.
> >
> > This commit adds the capability to have a persistent zone metadata
> > to the device. The new optional module property, "zone_file",
> > is introduced. If added to the command line, this property specifies
> > the name of the file that stores the zone metadata. If "zone_file" is
> > omitted, the device will be initialized with all zones empty, the same
> > as before.
> >
> > If zone metadata is configured to be persistent, then zone descriptor
> > extensions also persist across controller shutdowns.
> >
> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  hw/block/nvme-ns.c    | 341
> ++++++++++++++++++++++++++++++++++++++++--
> >  hw/block/nvme-ns.h    |  33 ++++
> >  hw/block/nvme.c       |   2 +
> >  hw/block/trace-events |   1 +
> >  4 files changed, 362 insertions(+), 15 deletions(-)
> >
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index 47751f2d54..a94021da81 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -293,12 +421,180 @@ static void
> nvme_init_zone_meta(NvmeNamespace *ns)
> >              i--;
> >          }
> >      }
> > +
> > +    if (ns->params.zone_file) {
> > +        nvme_set_zone_meta_dirty(ns);
> > +    }
> > +}
> > +
> > +static int nvme_open_zone_file(NvmeNamespace *ns, bool *init_meta,
> > +                               Error **errp)
> > +{
> > +    Object *file_be;
> > +    HostMemoryBackend *fb;
> > +    struct stat statbuf;
> > +    int ret;
> > +
> > +    ret = stat(ns->params.zone_file, &statbuf);
> > +    if (ret && errno == ENOENT) {
> > +        *init_meta = true;
> > +    } else if (!S_ISREG(statbuf.st_mode)) {
> > +        error_setg(errp, "\"%s\" is not a regular file",
> > +                   ns->params.zone_file);
> > +        return -1;
> > +    }
> > +
> > +    file_be = object_new(TYPE_MEMORY_BACKEND_FILE);
> > +    object_property_set_str(file_be, "mem-path", ns->params.zone_file,
> > +                            &error_abort);
> > +    object_property_set_int(file_be, "size", ns->meta_size, &error_abort);
> > +    object_property_set_bool(file_be, "share", true, &error_abort);
> > +    object_property_set_bool(file_be, "discard-data", false, &error_abort);
> > +    if (!user_creatable_complete(USER_CREATABLE(file_be), errp)) {
> > +        object_unref(file_be);
> > +        return -1;
> > +    }
> > +    object_property_add_child(OBJECT(ns), "_fb", file_be);
> > +    object_unref(file_be);
> > +
> > +    fb = MEMORY_BACKEND(file_be);
> > +    ns->zone_mr = host_memory_backend_get_memory(fb);
> > +
> > +    return 0;
> > +}
> > +
> > +static int nvme_map_zone_file(NvmeNamespace *ns, bool *init_meta)
> > +{
> > +    ns->zone_meta = (void *)memory_region_get_ram_ptr(ns->zone_mr);
> 
> I forgot that the HostMemoryBackend doesn't magically make the memory
> available to the device, so of course this is still needed.
> 
> Anyway.
> 
> No reason for me to keep complaining about this. I do not like it, I
> will not ACK it and I think I made my reasons pretty clear.

So, memory_region_msync() is ok, but memory_region_get_ram_ptr() is not??
This is the same API! You are really splitting hairs here to suit your agenda.
Moving goal posts again....

The "I do not like it" part is priceless. It is great that we have mail archives available.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for persistence
  2020-09-29 15:43     ` Dmitry Fomichev
@ 2020-09-29 16:46       ` Klaus Jensen
  0 siblings, 0 replies; 46+ messages in thread
From: Klaus Jensen @ 2020-09-29 16:46 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 5376 bytes --]

On Sep 29 15:43, Dmitry Fomichev wrote:
> 
> 
> > -----Original Message-----
> > From: Klaus Jensen <its@irrelevant.dk>
> > Sent: Monday, September 28, 2020 3:52 AM
> > To: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
> > Cc: Keith Busch <kbusch@kernel.org>; Klaus Jensen
> > <k.jensen@samsung.com>; Kevin Wolf <kwolf@redhat.com>; Philippe
> > Mathieu-Daudé <philmd@redhat.com>; Maxim Levitsky
> > <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>; Niklas Cassel
> > <Niklas.Cassel@wdc.com>; Damien Le Moal <Damien.LeMoal@wdc.com>;
> > qemu-block@nongnu.org; qemu-devel@nongnu.org; Alistair Francis
> > <Alistair.Francis@wdc.com>; Matias Bjorling <Matias.Bjorling@wdc.com>
> > Subject: Re: [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for
> > persistence
> > 
> > On Sep 28 11:35, Dmitry Fomichev wrote:
> > > A ZNS drive that is emulated by this module is currently initialized
> > > with all zones Empty upon startup. However, actual ZNS SSDs save the
> > > state and condition of all zones in their internal NVRAM in the event
> > > of power loss. When such a drive is powered up again, it closes or
> > > finishes all zones that were open at the moment of shutdown. Besides
> > > that, the write pointer position as well as the state and condition
> > > of all zones is preserved across power-downs.
> > >
> > > This commit adds the capability to have a persistent zone metadata
> > > to the device. The new optional module property, "zone_file",
> > > is introduced. If added to the command line, this property specifies
> > > the name of the file that stores the zone metadata. If "zone_file" is
> > > omitted, the device will be initialized with all zones empty, the same
> > > as before.
> > >
> > > If zone metadata is configured to be persistent, then zone descriptor
> > > extensions also persist across controller shutdowns.
> > >
> > > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > > ---
> > >  hw/block/nvme-ns.c    | 341
> > ++++++++++++++++++++++++++++++++++++++++--
> > >  hw/block/nvme-ns.h    |  33 ++++
> > >  hw/block/nvme.c       |   2 +
> > >  hw/block/trace-events |   1 +
> > >  4 files changed, 362 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > > index 47751f2d54..a94021da81 100644
> > > --- a/hw/block/nvme-ns.c
> > > +++ b/hw/block/nvme-ns.c
> > > @@ -293,12 +421,180 @@ static void
> > nvme_init_zone_meta(NvmeNamespace *ns)
> > >              i--;
> > >          }
> > >      }
> > > +
> > > +    if (ns->params.zone_file) {
> > > +        nvme_set_zone_meta_dirty(ns);
> > > +    }
> > > +}
> > > +
> > > +static int nvme_open_zone_file(NvmeNamespace *ns, bool *init_meta,
> > > +                               Error **errp)
> > > +{
> > > +    Object *file_be;
> > > +    HostMemoryBackend *fb;
> > > +    struct stat statbuf;
> > > +    int ret;
> > > +
> > > +    ret = stat(ns->params.zone_file, &statbuf);
> > > +    if (ret && errno == ENOENT) {
> > > +        *init_meta = true;
> > > +    } else if (!S_ISREG(statbuf.st_mode)) {
> > > +        error_setg(errp, "\"%s\" is not a regular file",
> > > +                   ns->params.zone_file);
> > > +        return -1;
> > > +    }
> > > +
> > > +    file_be = object_new(TYPE_MEMORY_BACKEND_FILE);
> > > +    object_property_set_str(file_be, "mem-path", ns->params.zone_file,
> > > +                            &error_abort);
> > > +    object_property_set_int(file_be, "size", ns->meta_size, &error_abort);
> > > +    object_property_set_bool(file_be, "share", true, &error_abort);
> > > +    object_property_set_bool(file_be, "discard-data", false, &error_abort);
> > > +    if (!user_creatable_complete(USER_CREATABLE(file_be), errp)) {
> > > +        object_unref(file_be);
> > > +        return -1;
> > > +    }
> > > +    object_property_add_child(OBJECT(ns), "_fb", file_be);
> > > +    object_unref(file_be);
> > > +
> > > +    fb = MEMORY_BACKEND(file_be);
> > > +    ns->zone_mr = host_memory_backend_get_memory(fb);
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +static int nvme_map_zone_file(NvmeNamespace *ns, bool *init_meta)
> > > +{
> > > +    ns->zone_meta = (void *)memory_region_get_ram_ptr(ns->zone_mr);
> > 
> > I forgot that the HostMemoryBackend doesn't magically make the memory
> > available to the device, so of course this is still needed.
> > 
> > Anyway.
> > 
> > No reason for me to keep complaining about this. I do not like it, I
> > will not ACK it and I think I made my reasons pretty clear.
> 
> So, memory_region_msync() is ok, but memory_region_get_ram_ptr() is not??
> This is the same API! You are really splitting hairs here to suit your agenda.
> Moving goal posts again....
> 
> The "I do not like it" part is priceless. It is great that we have mail archives available.
> 

If you read my review again, its pretty clear that I am calling out the
abstraction. I was clear that if it *really* had to be mmap based, then
it should use hostmem. Sorry for moving your patchset forward by
suggesting an improvement.

But again, I also made it pretty clear that I did not agree with the
abstraction. And that I very much disliked that it was non-portable. And
had endiannes issues. I made it SUPER clear that that was why I "did not
like it".

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set
  2020-09-28 10:42   ` Klaus Jensen
@ 2020-09-30  5:20     ` Klaus Jensen
  2020-10-05  0:53     ` Dmitry Fomichev
  1 sibling, 0 replies; 46+ messages in thread
From: Klaus Jensen @ 2020-09-30  5:20 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 1618 bytes --]

On Sep 28 12:42, Klaus Jensen wrote:
> On Sep 28 11:35, Dmitry Fomichev wrote:
> > The emulation code has been changed to advertise NVM Command Set when
> > "zoned" device property is not set (default) and Zoned Namespace
> > Command Set otherwise.
> > 
> > Handlers for three new NVMe commands introduced in Zoned Namespace
> > Command Set specification are added, namely for Zone Management
> > Receive, Zone Management Send and Zone Append.
> > 
> > Device initialization code has been extended to create a proper
> > configuration for zoned operation using device properties.
> > 
> > Read/Write command handler is modified to only allow writes at the
> > write pointer if the namespace is zoned. For Zone Append command,
> > writes implicitly happen at the write pointer and the starting write
> > pointer value is returned as the result of the command. Write Zeroes
> > handler is modified to add zoned checks that are identical to those
> > done as a part of Write flow.
> > 
> > The code to support for Zone Descriptor Extensions is not included in
> > this commit and ZDES 0 is always reported. A later commit in this
> > series will add ZDE support.
> > 
> > This commit doesn't yet include checks for active and open zone
> > limits. It is assumed that there are no limits on either active or
> > open zones.
> > 
> 
> I think the fill_pattern feature stands separate, so it would be nice to
> extract that to a patch on its own.
> 

Please disregard this.

Since the fill_pattern feature is tightly bound to reading in zones, it
doesnt really make sense to extract it.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set
  2020-09-28  2:35 ` [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
  2020-09-28  6:44   ` Klaus Jensen
  2020-09-28 10:42   ` Klaus Jensen
@ 2020-09-30  5:59   ` Klaus Jensen
  2020-10-04 23:48     ` Dmitry Fomichev
  2020-09-30 14:50   ` Niklas Cassel
  2020-09-30 15:12   ` Niklas Cassel
  4 siblings, 1 reply; 46+ messages in thread
From: Klaus Jensen @ 2020-09-30  5:59 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 4734 bytes --]

On Sep 28 11:35, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> The code to support for Zone Descriptor Extensions is not included in
> this commit and ZDES 0 is always reported. A later commit in this
> series will add ZDE support.
> 
> This commit doesn't yet include checks for active and open zone
> limits. It is assumed that there are no limits on either active or
> open zones.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  block/nvme.c         |   2 +-
>  hw/block/nvme-ns.c   | 185 ++++++++-
>  hw/block/nvme-ns.h   |   6 +-
>  hw/block/nvme.c      | 872 +++++++++++++++++++++++++++++++++++++++++--
>  include/block/nvme.h |   6 +-
>  5 files changed, 1033 insertions(+), 38 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 05485fdd11..7a513c9a17 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> +static int nvme_calc_zone_geometry(NvmeNamespace *ns, Error **errp)
> +{
> +    uint64_t zone_size, zone_cap;
> +    uint32_t nz, lbasz = ns->blkconf.logical_block_size;
> +
> +    if (ns->params.zone_size_mb) {
> +        zone_size = ns->params.zone_size_mb;
> +    } else {
> +        zone_size = NVME_DEFAULT_ZONE_SIZE;
> +    }
> +    if (ns->params.zone_capacity_mb) {
> +        zone_cap = ns->params.zone_capacity_mb;
> +    } else {
> +        zone_cap = zone_size;
> +    }

I think a check that zone_capacity_mb is less than or equal to
zone_size_mb is missing earlier?

> +static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
> +                              Error **errp)
> +{
> +    NvmeIdNsZoned *id_ns_z;
> +
> +    if (n->params.fill_pattern == 0) {
> +        ns->id_ns.dlfeat |= 0x01;
> +    } else if (n->params.fill_pattern == 0xff) {
> +        ns->id_ns.dlfeat |= 0x02;
> +    }
> +
> +    if (nvme_calc_zone_geometry(ns, errp) != 0) {
> +        return -1;
> +    }
> +
> +    nvme_init_zone_meta(ns);
> +
> +    id_ns_z = g_malloc0(sizeof(NvmeIdNsZoned));
> +
> +    /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
> +    id_ns_z->mar = 0xffffffff;
> +    id_ns_z->mor = 0xffffffff;
> +    id_ns_z->zoc = 0;
> +    id_ns_z->ozcs = ns->params.cross_zone_read ? 0x01 : 0x00;
> +
> +    id_ns_z->lbafe[lba_index].zsze = cpu_to_le64(ns->zone_size);
> +    id_ns_z->lbafe[lba_index].zdes = 0; /* FIXME make helper */
> +
> +    ns->csi = NVME_CSI_ZONED;
> +    ns->id_ns.ncap = cpu_to_le64(ns->zone_capacity * ns->num_zones);
> +    ns->id_ns.nuse = ns->id_ns.ncap;
> +    ns->id_ns.nsze = ns->id_ns.ncap;
> +

NSZE should be in terms of ZSZE. We *can* report NCAP < NSZE if zcap !=
zsze, but that requires bit 1 set in NSFEAT and proper reporting of
NUSE.

> @@ -133,6 +304,14 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
>  static Property nvme_ns_props[] = {
>      DEFINE_BLOCK_PROPERTIES(NvmeNamespace, blkconf),
>      DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
> +
> +    DEFINE_PROP_BOOL("zoned", NvmeNamespace, params.zoned, false),
> +    DEFINE_PROP_UINT64("zone_size", NvmeNamespace, params.zone_size_mb,
> +                       NVME_DEFAULT_ZONE_SIZE),
> +    DEFINE_PROP_UINT64("zone_capacity", NvmeNamespace,
> +                       params.zone_capacity_mb, 0),

There is a nice DEFINE_PROP_SIZE that handles sizes in a nice way (i.e.
1G, 1M).


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 03/14] hw/block/nvme: Introduce the Namespace Types definitions
  2020-09-28  2:35 ` [PATCH v5 03/14] hw/block/nvme: Introduce the Namespace Types definitions Dmitry Fomichev
@ 2020-09-30  8:08   ` Klaus Jensen
  2020-09-30 15:21   ` Keith Busch
  1 sibling, 0 replies; 46+ messages in thread
From: Klaus Jensen @ 2020-09-30  8:08 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 981 bytes --]

On Sep 28 11:35, Dmitry Fomichev wrote:
> From: Niklas Cassel <niklas.cassel@wdc.com>
> 
> Define the structures and constants required to implement
> Namespace Types support.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme-ns.h   |  2 ++
>  hw/block/nvme.c      |  2 +-
>  include/block/nvme.h | 74 +++++++++++++++++++++++++++++++++++---------
>  3 files changed, 63 insertions(+), 15 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index 83734f4606..cca23bc0b3 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -21,6 +21,8 @@
>  
>  typedef struct NvmeNamespaceParams {
>      uint32_t nsid;
> +    uint8_t  csi;
> +    QemuUUID uuid;
>  } NvmeNamespaceParams;

The motivation behind the NvmeNamespaceParams was to only keep user
visible parameters in this struct.

Can we move csi/uuid to the NvmeNamespace struct?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types
  2020-09-28  2:35 ` [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
@ 2020-09-30  8:15   ` Klaus Jensen
  2020-09-30 12:47   ` Niklas Cassel
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 46+ messages in thread
From: Klaus Jensen @ 2020-09-30  8:15 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 1737 bytes --]

On Sep 28 11:35, Dmitry Fomichev wrote:
> From: Niklas Cassel <niklas.cassel@wdc.com>
> 
> Namespace Types introduce a new command set, "I/O Command Sets",
> that allows the host to retrieve the command sets associated with
> a namespace. Introduce support for the command set and enable
> detection for the NVM Command Set.
> 
> The new workflows for identify commands rely heavily on zero-filled
> identify structs. E.g., certain CNS commands are defined to return
> a zero-filled identify struct when an inactive namespace NSID
> is supplied.
> 
> Add a helper function in order to avoid code duplication when
> reporting zero-filled identify structures.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme-ns.c |   3 +
>  hw/block/nvme.c    | 210 +++++++++++++++++++++++++++++++++++++--------
>  2 files changed, 175 insertions(+), 38 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index bbd7879492..31b7f986c3 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -40,6 +40,9 @@ static void nvme_ns_init(NvmeNamespace *ns)
>  
>      id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(ns));
>  
> +    ns->params.csi = NVME_CSI_NVM;
> +    qemu_uuid_generate(&ns->params.uuid); /* TODO make UUIDs persistent */
> +

It is straight-forward to put this into a 'uuid' nvme-ns parameter using
DEFINE_PROP_UUID. That will default to 'auto' which will generate an
UUID for each invocation, but if the user requires it to be
"persistent", it can be specified explicitly.

If you choose to do this, please extract to separate patch. Or I can
post it on top of nvme-next if you like.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types
  2020-09-28  2:35 ` [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
  2020-09-30  8:15   ` Klaus Jensen
@ 2020-09-30 12:47   ` Niklas Cassel
  2020-10-01 11:22   ` Niklas Cassel
  2020-10-01 22:15   ` Klaus Jensen
  3 siblings, 0 replies; 46+ messages in thread
From: Niklas Cassel @ 2020-09-30 12:47 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Klaus Jensen,
	qemu-devel, Maxim Levitsky, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Mon, Sep 28, 2020 at 11:35:19AM +0900, Dmitry Fomichev wrote:
> From: Niklas Cassel <niklas.cassel@wdc.com>
> 
> Namespace Types introduce a new command set, "I/O Command Sets",
> that allows the host to retrieve the command sets associated with
> a namespace. Introduce support for the command set and enable
> detection for the NVM Command Set.
> 
> The new workflows for identify commands rely heavily on zero-filled
> identify structs. E.g., certain CNS commands are defined to return
> a zero-filled identify struct when an inactive namespace NSID
> is supplied.
> 
> Add a helper function in order to avoid code duplication when
> reporting zero-filled identify structures.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme-ns.c |   3 +
>  hw/block/nvme.c    | 210 +++++++++++++++++++++++++++++++++++++--------
>  2 files changed, 175 insertions(+), 38 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index bbd7879492..31b7f986c3 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c

(snip)

> @@ -1597,12 +1667,31 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
>       * Namespace Identification Descriptor. Add a very basic Namespace UUID
>       * here.
>       */
> -    ns_descrs->uuid.hdr.nidt = NVME_NIDT_UUID;
> -    ns_descrs->uuid.hdr.nidl = NVME_NIDL_UUID;
> -    stl_be_p(&ns_descrs->uuid.v, nsid);
> +    desc = list_ptr;
> +    desc->nidt = NVME_NIDT_UUID;
> +    desc->nidl = NVME_NIDL_UUID;
> +    list_ptr += sizeof(*desc);
> +    memcpy(list_ptr, ns->params.uuid.data, NVME_NIDL_UUID);
> +    list_ptr += NVME_NIDL_UUID;
>  
> -    return nvme_dma(n, list, NVME_IDENTIFY_DATA_SIZE,
> -                    DMA_DIRECTION_FROM_DEVICE, req);
> +    desc = list_ptr;
> +    desc->nidt = NVME_NIDT_CSI;
> +    desc->nidl = NVME_NIDL_CSI;
> +    list_ptr += sizeof(*desc);
> +    *(uint8_t *)list_ptr = NVME_CSI_NVM;

I think that we should use ns->csi/ns->params.csi here rather than
NVME_CSI_NVM.
You do this change in a later patch, but I think it is more correct
to do it here already. (No reason not to, since ns->csi/ns->params.csi
should be set to NVME_CSI_NVM for NVM namespace already in this patch.)

> +
> +    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
> +}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 06/14] hw/block/nvme: Add support for active/inactive namespaces
  2020-09-28  2:35 ` [PATCH v5 06/14] hw/block/nvme: Add support for active/inactive namespaces Dmitry Fomichev
@ 2020-09-30 13:50   ` Niklas Cassel
  2020-10-04 23:54     ` Dmitry Fomichev
  0 siblings, 1 reply; 46+ messages in thread
From: Niklas Cassel @ 2020-09-30 13:50 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Klaus Jensen,
	qemu-devel, Maxim Levitsky, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Mon, Sep 28, 2020 at 11:35:20AM +0900, Dmitry Fomichev wrote:
> From: Niklas Cassel <niklas.cassel@wdc.com>
> 
> In NVMe, a namespace is active if it exists and is attached to the
> controller.
> 
> CAP.CSS (together with the I/O Command Set data structure) defines what
> command sets are supported by the controller.
> 
> CC.CSS (together with Set Profile) can be set to enable a subset of the
> available command sets. The namespaces belonging to a disabled command set
> will not be able to attach to the controller, and will thus be inactive.
> 
> E.g., if the user sets CC.CSS to Admin Only, NVM namespaces should be
> marked as inactive.
> 
> The identify namespace, the identify namespace CSI specific, and the namespace
> list commands have two different versions, one that only shows active
> namespaces, and the other version that shows existing namespaces, regardless
> of whether the namespace is attached or not.
> 
> Add an attached member to struct NvmeNamespace, and implement the missing CNS
> commands.
> 
> The added functionality will also simplify the implementation of namespace
> management in the future, since namespace management can also attach and
> detach namespaces.

Following my previous discussion with Klaus,
I think we need to rewrite this commit message completely:

Subject: hw/block/nvme: Add support for allocated CNS command variants

Many CNS commands have "allocated" command variants.
These includes a namespace as long as it is allocated
(i.e. a namespace is included regardless if it is active (attached)
or not.)

While these commands are optional (they are mandatory for controllers
supporting the namespace attachment command), our QEMU implementation
is more complete by actually providing support for these CNS values.

However, since our QEMU model currently does not support the namespace
attachment command, these new allocated CNS commands will return the same
result as the active CNS command variants.

In NVMe, a namespace is active if it exists and is attached to the
controller.

CAP.CSS (together with the I/O Command Set data structure) defines what
command sets are supported by the controller.

CC.CSS (together with Set Profile) can be set to enable a subset of the
available command sets.

Even if a user configures CC.CSS to e.g. Admin only, NVM namespaces
will still be attached (and thus marked as active).
Similarly, if a user configures CC.CSS to e.g. NVM, ZNS namespaces
will still be attached (and thus marked as active).

However, any operation from a disabled command set will result in a
Invalid Command Opcode.

Add an attached struct member for struct NvmeNamespace,
so that we lay the foundation for namespace attachment
support. Also implement logic in the new CNS values to
include/exclude namespaces based on this new struct member.
The only thing missing hooking up the actual Namespace Attachment
command opcode, which allows a user to toggle the attached
variable per namespace. The reason for not hooking up this
command completely is because the NVMe specification
requires that the namespace managment command is supported
if the namespacement attachment command is supported.


> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme-ns.h   |  1 +
>  hw/block/nvme.c      | 60 ++++++++++++++++++++++++++++++++++++++------
>  include/block/nvme.h | 20 +++++++++------
>  3 files changed, 65 insertions(+), 16 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index cca23bc0b3..acdb76f058 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -22,6 +22,7 @@
>  typedef struct NvmeNamespaceParams {
>      uint32_t nsid;
>      uint8_t  csi;
> +    bool     attached;
>      QemuUUID uuid;
>  } NvmeNamespaceParams;
>  
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 4ec1ddc90a..63ad03d6d6 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c

We need to add an additional check in nvme_io_cmd()
that returns Invalid Command Opcode when CC.CSS == Admin only.

> @@ -1523,7 +1523,8 @@ static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
>      return NVME_INVALID_FIELD | NVME_DNR;
>  }
>  
> -static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
> +static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req,
> +                                 bool only_active)
>  {
>      NvmeNamespace *ns;
>      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> @@ -1540,11 +1541,16 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
>          return nvme_rpt_empty_id_struct(n, req);
>      }
>  
> +    if (only_active && !ns->params.attached) {
> +        return nvme_rpt_empty_id_struct(n, req);
> +    }
> +
>      return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
>                      DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> -static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
> +static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
> +                                     bool only_active)
>  {
>      NvmeNamespace *ns;
>      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> @@ -1561,6 +1567,10 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
>          return nvme_rpt_empty_id_struct(n, req);
>      }
>  
> +    if (only_active && !ns->params.attached) {
> +        return nvme_rpt_empty_id_struct(n, req);
> +    }
> +
>      if (c->csi == NVME_CSI_NVM) {
>          return nvme_rpt_empty_id_struct(n, req);
>      }
> @@ -1568,7 +1578,8 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
>      return NVME_INVALID_FIELD | NVME_DNR;
>  }
>  
> -static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
> +static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req,
> +                                     bool only_active)
>  {
>      NvmeNamespace *ns;
>      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> @@ -1598,6 +1609,9 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
>          if (ns->params.nsid < min_nsid) {
>              continue;
>          }
> +        if (only_active && !ns->params.attached) {
> +            continue;
> +        }
>          list_ptr[j++] = cpu_to_le32(ns->params.nsid);
>          if (j == data_len / sizeof(uint32_t)) {
>              break;
> @@ -1607,7 +1621,8 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
>      return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> -static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
> +static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req,
> +                                         bool only_active)
>  {
>      NvmeNamespace *ns;
>      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> @@ -1631,6 +1646,9 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
>          if (ns->params.nsid < min_nsid) {
>              continue;
>          }
> +        if (only_active && !ns->params.attached) {
> +            continue;
> +        }
>          list_ptr[j++] = cpu_to_le32(ns->params.nsid);
>          if (j == data_len / sizeof(uint32_t)) {
>              break;
> @@ -1700,17 +1718,25 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
>  
>      switch (le32_to_cpu(c->cns)) {
>      case NVME_ID_CNS_NS:
> -        return nvme_identify_ns(n, req);
> +        return nvme_identify_ns(n, req, true);
>      case NVME_ID_CNS_CS_NS:
> -        return nvme_identify_ns_csi(n, req);
> +        return nvme_identify_ns_csi(n, req, true);
> +    case NVME_ID_CNS_NS_PRESENT:
> +        return nvme_identify_ns(n, req, false);
> +    case NVME_ID_CNS_CS_NS_PRESENT:
> +        return nvme_identify_ns_csi(n, req, false);
>      case NVME_ID_CNS_CTRL:
>          return nvme_identify_ctrl(n, req);
>      case NVME_ID_CNS_CS_CTRL:
>          return nvme_identify_ctrl_csi(n, req);
>      case NVME_ID_CNS_NS_ACTIVE_LIST:
> -        return nvme_identify_nslist(n, req);
> +        return nvme_identify_nslist(n, req, true);
>      case NVME_ID_CNS_CS_NS_ACTIVE_LIST:
> -        return nvme_identify_nslist_csi(n, req);
> +        return nvme_identify_nslist_csi(n, req, true);
> +    case NVME_ID_CNS_NS_PRESENT_LIST:
> +        return nvme_identify_nslist(n, req, false);
> +    case NVME_ID_CNS_CS_NS_PRESENT_LIST:
> +        return nvme_identify_nslist_csi(n, req, false);
>      case NVME_ID_CNS_NS_DESCR_LIST:
>          return nvme_identify_ns_descr_list(n, req);
>      case NVME_ID_CNS_IO_COMMAND_SET:
> @@ -2188,8 +2214,10 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
>  
>  static int nvme_start_ctrl(NvmeCtrl *n)
>  {
> +    NvmeNamespace *ns;
>      uint32_t page_bits = NVME_CC_MPS(n->bar.cc) + 12;
>      uint32_t page_size = 1 << page_bits;
> +    int i;
>  
>      if (unlikely(n->cq[0])) {
>          trace_pci_nvme_err_startfail_cq();
> @@ -2276,6 +2304,22 @@ static int nvme_start_ctrl(NvmeCtrl *n)
>      nvme_init_sq(&n->admin_sq, n, n->bar.asq, 0, 0,
>                   NVME_AQA_ASQS(n->bar.aqa) + 1);
>  
> +    for (i = 1; i <= n->num_namespaces; i++) {
> +        ns = nvme_ns(n, i);
> +        if (!ns) {
> +            continue;
> +        }
> +        ns->params.attached = false;
> +        switch (ns->params.csi) {
> +        case NVME_CSI_NVM:
> +            if (NVME_CC_CSS(n->bar.cc) == CSS_NVM_ONLY ||
> +                NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
> +                ns->params.attached = true;
> +            }
> +            break;
> +        }
> +    }
> +

Considering that the controller doesn't attach/detach
namespaces belonging to command sets that it doesn't
support, I think a nicer way is to remove this for-loop,
and instead, in nvme_ns_setup() or nvme_ns_init(),
always set attached = true. (Since we currently don't
support namespace attachment command).

The person that implements the last piece of namespace
management and namespace attachment will have to deal
with reading "attached" from some kind of persistent state
and setting it accordingly.

>      nvme_set_timestamp(n, 0ULL);
>  
>      QTAILQ_INIT(&n->aer_queue);
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 4587311783..b182fe40b2 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -804,14 +804,18 @@ typedef struct QEMU_PACKED NvmePSD {
>  #define NVME_IDENTIFY_DATA_SIZE 4096
>  
>  enum NvmeIdCns {
> -    NVME_ID_CNS_NS                = 0x00,
> -    NVME_ID_CNS_CTRL              = 0x01,
> -    NVME_ID_CNS_NS_ACTIVE_LIST    = 0x02,
> -    NVME_ID_CNS_NS_DESCR_LIST     = 0x03,
> -    NVME_ID_CNS_CS_NS             = 0x05,
> -    NVME_ID_CNS_CS_CTRL           = 0x06,
> -    NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
> -    NVME_ID_CNS_IO_COMMAND_SET    = 0x1c,
> +    NVME_ID_CNS_NS                    = 0x00,
> +    NVME_ID_CNS_CTRL                  = 0x01,
> +    NVME_ID_CNS_NS_ACTIVE_LIST        = 0x02,
> +    NVME_ID_CNS_NS_DESCR_LIST         = 0x03,
> +    NVME_ID_CNS_CS_NS                 = 0x05,
> +    NVME_ID_CNS_CS_CTRL               = 0x06,
> +    NVME_ID_CNS_CS_NS_ACTIVE_LIST     = 0x07,
> +    NVME_ID_CNS_NS_PRESENT_LIST       = 0x10,
> +    NVME_ID_CNS_NS_PRESENT            = 0x11,
> +    NVME_ID_CNS_CS_NS_PRESENT_LIST    = 0x1a,
> +    NVME_ID_CNS_CS_NS_PRESENT         = 0x1b,
> +    NVME_ID_CNS_IO_COMMAND_SET        = 0x1c,
>  };
>  
>  typedef struct QEMU_PACKED NvmeIdCtrl {
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set
  2020-09-28  2:35 ` [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
                     ` (2 preceding siblings ...)
  2020-09-30  5:59   ` Klaus Jensen
@ 2020-09-30 14:50   ` Niklas Cassel
  2020-09-30 18:23     ` Klaus Jensen
  2020-10-04 23:57     ` Dmitry Fomichev
  2020-09-30 15:12   ` Niklas Cassel
  4 siblings, 2 replies; 46+ messages in thread
From: Niklas Cassel @ 2020-09-30 14:50 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Klaus Jensen,
	qemu-devel, Maxim Levitsky, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Mon, Sep 28, 2020 at 11:35:23AM +0900, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> The code to support for Zone Descriptor Extensions is not included in
> this commit and ZDES 0 is always reported. A later commit in this
> series will add ZDE support.
> 
> This commit doesn't yet include checks for active and open zone
> limits. It is assumed that there are no limits on either active or
> open zones.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  block/nvme.c         |   2 +-
>  hw/block/nvme-ns.c   | 185 ++++++++-
>  hw/block/nvme-ns.h   |   6 +-
>  hw/block/nvme.c      | 872 +++++++++++++++++++++++++++++++++++++++++--
>  include/block/nvme.h |   6 +-
>  5 files changed, 1033 insertions(+), 38 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 05485fdd11..7a513c9a17 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -333,7 +333,7 @@ static inline int nvme_translate_error(const NvmeCqe *c)
>  {
>      uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF;
>      if (status) {
> -        trace_nvme_error(le32_to_cpu(c->result),
> +        trace_nvme_error(le32_to_cpu(c->result32),
>                           le16_to_cpu(c->sq_head),
>                           le16_to_cpu(c->sq_id),
>                           le16_to_cpu(c->cid),
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index 31b7f986c3..6d9dc9205b 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -33,14 +33,14 @@ static void nvme_ns_init(NvmeNamespace *ns)
>      NvmeIdNs *id_ns = &ns->id_ns;
>  
>      if (blk_get_flags(ns->blkconf.blk) & BDRV_O_UNMAP) {
> -        ns->id_ns.dlfeat = 0x9;
> +        ns->id_ns.dlfeat = 0x8;

You seem to change something that is NVM namespace specific here, why?
If it is indeed needed, I assume that this change should be in a separate
patch.

>      }
>  
>      id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
>  
>      id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(ns));
>  
> -    ns->params.csi = NVME_CSI_NVM;
> +    ns->csi = NVME_CSI_NVM;
>      qemu_uuid_generate(&ns->params.uuid); /* TODO make UUIDs persistent */
>  
>      /* no thin provisioning */
> @@ -73,7 +73,162 @@ static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>      }
>  
>      lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> -    ns->id_ns.lbaf[lba_index].ds = 31 - clz32(n->conf.logical_block_size);
> +    ns->id_ns.lbaf[lba_index].ds = 31 - clz32(ns->blkconf.logical_block_size);
> +
> +    return 0;
> +}
> +
> +/*
> + * Add a zone to the tail of a zone list.
> + */
> +void nvme_add_zone_tail(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone)
> +{
> +    uint32_t idx = (uint32_t)(zone - ns->zone_array);
> +
> +    assert(nvme_zone_not_in_list(zone));
> +
> +    if (!zl->size) {
> +        zl->head = zl->tail = idx;
> +        zone->next = zone->prev = NVME_ZONE_LIST_NIL;
> +    } else {
> +        ns->zone_array[zl->tail].next = idx;
> +        zone->prev = zl->tail;
> +        zone->next = NVME_ZONE_LIST_NIL;
> +        zl->tail = idx;
> +    }
> +    zl->size++;
> +}
> +
> +/*
> + * Remove a zone from a zone list. The zone must be linked in the list.
> + */
> +void nvme_remove_zone(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone)
> +{
> +    uint32_t idx = (uint32_t)(zone - ns->zone_array);
> +
> +    assert(!nvme_zone_not_in_list(zone));
> +
> +    --zl->size;
> +    if (zl->size == 0) {
> +        zl->head = NVME_ZONE_LIST_NIL;
> +        zl->tail = NVME_ZONE_LIST_NIL;
> +    } else if (idx == zl->head) {
> +        zl->head = zone->next;
> +        ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
> +    } else if (idx == zl->tail) {
> +        zl->tail = zone->prev;
> +        ns->zone_array[zl->tail].next = NVME_ZONE_LIST_NIL;
> +    } else {
> +        ns->zone_array[zone->next].prev = zone->prev;
> +        ns->zone_array[zone->prev].next = zone->next;
> +    }
> +
> +    zone->prev = zone->next = 0;
> +}
> +
> +static int nvme_calc_zone_geometry(NvmeNamespace *ns, Error **errp)
> +{
> +    uint64_t zone_size, zone_cap;
> +    uint32_t nz, lbasz = ns->blkconf.logical_block_size;
> +
> +    if (ns->params.zone_size_mb) {
> +        zone_size = ns->params.zone_size_mb;
> +    } else {
> +        zone_size = NVME_DEFAULT_ZONE_SIZE;
> +    }
> +    if (ns->params.zone_capacity_mb) {
> +        zone_cap = ns->params.zone_capacity_mb;
> +    } else {
> +        zone_cap = zone_size;
> +    }
> +    ns->zone_size = zone_size * MiB / lbasz;
> +    ns->zone_capacity = zone_cap * MiB / lbasz;
> +    if (ns->zone_capacity > ns->zone_size) {
> +        error_setg(errp, "zone capacity exceeds zone size");
> +        return -1;
> +    }
> +
> +    nz = DIV_ROUND_UP(ns->size / lbasz, ns->zone_size);
> +    ns->num_zones = nz;
> +    ns->zone_array_size = sizeof(NvmeZone) * nz;
> +    ns->zone_size_log2 = 0;
> +    if (is_power_of_2(ns->zone_size)) {
> +        ns->zone_size_log2 = 63 - clz64(ns->zone_size);
> +    }
> +
> +    return 0;
> +}
> +
> +static void nvme_init_zone_meta(NvmeNamespace *ns)
> +{
> +    uint64_t start = 0, zone_size = ns->zone_size;
> +    uint64_t capacity = ns->num_zones * zone_size;
> +    NvmeZone *zone;
> +    int i;
> +
> +    ns->zone_array = g_malloc0(ns->zone_array_size);
> +    ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
> +    ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
> +    ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
> +    ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
> +
> +    nvme_init_zone_list(ns->exp_open_zones);
> +    nvme_init_zone_list(ns->imp_open_zones);
> +    nvme_init_zone_list(ns->closed_zones);
> +    nvme_init_zone_list(ns->full_zones);
> +
> +    zone = ns->zone_array;
> +    for (i = 0; i < ns->num_zones; i++, zone++) {
> +        if (start + zone_size > capacity) {
> +            zone_size = capacity - start;
> +        }
> +        zone->d.zt = NVME_ZONE_TYPE_SEQ_WRITE;
> +        nvme_set_zone_state(zone, NVME_ZONE_STATE_EMPTY);
> +        zone->d.za = 0;
> +        zone->d.zcap = ns->zone_capacity;
> +        zone->d.zslba = start;
> +        zone->d.wp = start;
> +        zone->w_ptr = start;
> +        zone->prev = 0;
> +        zone->next = 0;
> +        start += zone_size;
> +    }
> +}
> +
> +static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
> +                              Error **errp)
> +{
> +    NvmeIdNsZoned *id_ns_z;
> +
> +    if (n->params.fill_pattern == 0) {
> +        ns->id_ns.dlfeat |= 0x01;
> +    } else if (n->params.fill_pattern == 0xff) {
> +        ns->id_ns.dlfeat |= 0x02;
> +    }
> +
> +    if (nvme_calc_zone_geometry(ns, errp) != 0) {
> +        return -1;
> +    }
> +
> +    nvme_init_zone_meta(ns);
> +
> +    id_ns_z = g_malloc0(sizeof(NvmeIdNsZoned));
> +
> +    /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
> +    id_ns_z->mar = 0xffffffff;
> +    id_ns_z->mor = 0xffffffff;
> +    id_ns_z->zoc = 0;
> +    id_ns_z->ozcs = ns->params.cross_zone_read ? 0x01 : 0x00;
> +
> +    id_ns_z->lbafe[lba_index].zsze = cpu_to_le64(ns->zone_size);
> +    id_ns_z->lbafe[lba_index].zdes = 0; /* FIXME make helper */
> +
> +    ns->csi = NVME_CSI_ZONED;
> +    ns->id_ns.ncap = cpu_to_le64(ns->zone_capacity * ns->num_zones);
> +    ns->id_ns.nuse = ns->id_ns.ncap;
> +    ns->id_ns.nsze = ns->id_ns.ncap;
> +
> +    ns->id_ns_zoned = id_ns_z;
>  
>      return 0;
>  }
> @@ -103,6 +258,12 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>          return -1;
>      }
>  
> +    if (ns->params.zoned) {
> +        if (nvme_zoned_init_ns(n, ns, 0, errp) != 0) {
> +            return -1;
> +        }
> +    }
> +
>      return 0;
>  }
>  
> @@ -116,6 +277,16 @@ void nvme_ns_flush(NvmeNamespace *ns)
>      blk_flush(ns->blkconf.blk);
>  }
>  
> +void nvme_ns_cleanup(NvmeNamespace *ns)
> +{
> +    g_free(ns->id_ns_zoned);
> +    g_free(ns->zone_array);
> +    g_free(ns->exp_open_zones);
> +    g_free(ns->imp_open_zones);
> +    g_free(ns->closed_zones);
> +    g_free(ns->full_zones);
> +}
> +
>  static void nvme_ns_realize(DeviceState *dev, Error **errp)
>  {
>      NvmeNamespace *ns = NVME_NS(dev);
> @@ -133,6 +304,14 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
>  static Property nvme_ns_props[] = {
>      DEFINE_BLOCK_PROPERTIES(NvmeNamespace, blkconf),
>      DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
> +
> +    DEFINE_PROP_BOOL("zoned", NvmeNamespace, params.zoned, false),
> +    DEFINE_PROP_UINT64("zone_size", NvmeNamespace, params.zone_size_mb,
> +                       NVME_DEFAULT_ZONE_SIZE),
> +    DEFINE_PROP_UINT64("zone_capacity", NvmeNamespace,
> +                       params.zone_capacity_mb, 0),
> +    DEFINE_PROP_BOOL("cross_zone_read", NvmeNamespace,
> +                      params.cross_zone_read, false),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index 04172f083e..daa13546c4 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -38,7 +38,6 @@ typedef struct NvmeZoneList {
>  
>  typedef struct NvmeNamespaceParams {
>      uint32_t nsid;
> -    uint8_t  csi;
>      bool     attached;
>      QemuUUID uuid;
>  
> @@ -52,6 +51,7 @@ typedef struct NvmeNamespace {
>      DeviceState  parent_obj;
>      BlockConf    blkconf;
>      int32_t      bootindex;
> +    uint8_t      csi;
>      int64_t      size;
>      NvmeIdNs     id_ns;
>  
> @@ -107,6 +107,7 @@ typedef struct NvmeCtrl NvmeCtrl;
>  int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
>  void nvme_ns_drain(NvmeNamespace *ns);
>  void nvme_ns_flush(NvmeNamespace *ns);
> +void nvme_ns_cleanup(NvmeNamespace *ns);
>  
>  static inline uint8_t nvme_get_zone_state(NvmeZone *zone)
>  {
> @@ -188,4 +189,7 @@ static inline NvmeZone *nvme_next_zone_in_list(NvmeNamespace *ns, NvmeZone *z,
>      return &ns->zone_array[z->next];
>  }
>  
> +void nvme_add_zone_tail(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone);
> +void nvme_remove_zone(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone);
> +
>  #endif /* NVME_NS_H */
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 63ad03d6d6..38e25a4d1f 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -54,6 +54,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/units.h"
>  #include "qemu/error-report.h"
> +#include "crypto/random.h"
>  #include "hw/block/block.h"
>  #include "hw/pci/msix.h"
>  #include "hw/pci/pci.h"
> @@ -127,6 +128,46 @@ static uint16_t nvme_sqid(NvmeRequest *req)
>      return le16_to_cpu(req->sq->sqid);
>  }
>  
> +static void nvme_assign_zone_state(NvmeNamespace *ns, NvmeZone *zone,
> +                                   uint8_t state)
> +{
> +    if (!nvme_zone_not_in_list(zone)) {
> +        switch (nvme_get_zone_state(zone)) {
> +        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +            nvme_remove_zone(ns, ns->exp_open_zones, zone);
> +            break;
> +        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +            nvme_remove_zone(ns, ns->imp_open_zones, zone);
> +            break;
> +        case NVME_ZONE_STATE_CLOSED:
> +            nvme_remove_zone(ns, ns->closed_zones, zone);
> +            break;
> +        case NVME_ZONE_STATE_FULL:
> +            nvme_remove_zone(ns, ns->full_zones, zone);
> +        }
> +   }
> +
> +    nvme_set_zone_state(zone, state);
> +
> +    switch (state) {
> +    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +        nvme_add_zone_tail(ns, ns->exp_open_zones, zone);
> +        break;
> +    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        nvme_add_zone_tail(ns, ns->imp_open_zones, zone);
> +        break;
> +    case NVME_ZONE_STATE_CLOSED:
> +        nvme_add_zone_tail(ns, ns->closed_zones, zone);
> +        break;
> +    case NVME_ZONE_STATE_FULL:
> +        nvme_add_zone_tail(ns, ns->full_zones, zone);
> +    case NVME_ZONE_STATE_READ_ONLY:
> +        break;
> +    default:
> +        zone->d.za = 0;
> +    }
> +}
> +
>  static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
>  {
>      hwaddr low = n->ctrl_mem.addr;
> @@ -813,7 +854,7 @@ static void nvme_process_aers(void *opaque)
>  
>          req = n->aer_reqs[n->outstanding_aers];
>  
> -        result = (NvmeAerResult *) &req->cqe.result;
> +        result = (NvmeAerResult *) &req->cqe.result32;
>          result->event_type = event->result.event_type;
>          result->event_info = event->result.event_info;
>          result->log_page = event->result.log_page;
> @@ -882,6 +923,200 @@ static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace *ns,
>      return NVME_SUCCESS;
>  }
>  
> +static void nvme_fill_data(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t offset,
> +                           uint32_t max_len, uint8_t pattern)
> +{
> +    ScatterGatherEntry *entry;
> +    uint32_t len, ent_len;
> +
> +    if (qsg->nsg > 0) {
> +        entry = qsg->sg;
> +        len = qsg->size;
> +        if (max_len) {
> +            len = MIN(len, max_len);
> +        }
> +        for (; len > 0; len -= ent_len) {
> +            ent_len = MIN(len, entry->len);
> +            if (offset > ent_len) {
> +                offset -= ent_len;
> +            } else if (offset != 0) {
> +                dma_memory_set(qsg->as, entry->base + offset,
> +                               pattern, ent_len - offset);
> +                offset = 0;
> +            } else {
> +                dma_memory_set(qsg->as, entry->base, pattern, ent_len);
> +            }
> +            entry++;
> +        }
> +    } else if (iov->iov) {
> +        len = iov_size(iov->iov, iov->niov);
> +        if (max_len) {
> +            len = MIN(len, max_len);
> +        }
> +        qemu_iovec_memset(iov, offset, pattern, len - offset);
> +    }
> +}
> +
> +static uint16_t nvme_check_zone_write(NvmeZone *zone, uint64_t slba,
> +                                      uint32_t nlb)
> +{
> +    uint16_t status;
> +
> +    if (unlikely((slba + nlb) > nvme_zone_wr_boundary(zone))) {
> +        return NVME_ZONE_BOUNDARY_ERROR;
> +    }
> +
> +    switch (nvme_get_zone_state(zone)) {
> +    case NVME_ZONE_STATE_EMPTY:
> +    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +    case NVME_ZONE_STATE_CLOSED:
> +        status = NVME_SUCCESS;
> +        break;
> +    case NVME_ZONE_STATE_FULL:
> +        status = NVME_ZONE_FULL;
> +        break;
> +    case NVME_ZONE_STATE_OFFLINE:
> +        status = NVME_ZONE_OFFLINE;
> +        break;
> +    case NVME_ZONE_STATE_READ_ONLY:
> +        status = NVME_ZONE_READ_ONLY;
> +        break;
> +    default:
> +        assert(false);
> +    }
> +    return status;
> +}
> +
> +static uint16_t nvme_check_zone_read(NvmeNamespace *ns, NvmeZone *zone,
> +                                     uint64_t slba, uint32_t nlb)
> +{
> +    uint64_t lba = slba, count;
> +    uint16_t status;
> +    uint8_t zs;
> +
> +    do {
> +        if (!ns->params.cross_zone_read &&
> +            (lba + nlb > nvme_zone_rd_boundary(ns, zone))) {
> +            return NVME_ZONE_BOUNDARY_ERROR | NVME_DNR;
> +        }
> +
> +        zs = nvme_get_zone_state(zone);
> +        switch (zs) {
> +        case NVME_ZONE_STATE_EMPTY:
> +        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +        case NVME_ZONE_STATE_FULL:
> +        case NVME_ZONE_STATE_CLOSED:
> +        case NVME_ZONE_STATE_READ_ONLY:
> +            status = NVME_SUCCESS;
> +            break;
> +        case NVME_ZONE_STATE_OFFLINE:
> +            status = NVME_ZONE_OFFLINE | NVME_DNR;
> +            break;
> +        default:
> +            assert(false);
> +        }
> +        if (status != NVME_SUCCESS) {
> +            break;
> +        }
> +
> +        if (lba + nlb > nvme_zone_rd_boundary(ns, zone)) {
> +            count = nvme_zone_rd_boundary(ns, zone) - lba;
> +        } else {
> +            count = nlb;
> +        }
> +
> +        lba += count;
> +        nlb -= count;
> +        zone++;
> +    } while (nlb);
> +
> +    return status;
> +}
> +
> +static inline uint32_t nvme_zone_idx(NvmeNamespace *ns, uint64_t slba)
> +{
> +    return ns->zone_size_log2 > 0 ? slba >> ns->zone_size_log2 :
> +                                    slba / ns->zone_size;
> +}
> +
> +static bool nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
> +                                      bool failed)
> +{
> +    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
> +    NvmeZone *zone;
> +    uint64_t slba, start_wp = req->cqe.result64;
> +    uint32_t nlb, zone_idx;
> +    uint8_t zs;
> +
> +    if (rw->opcode != NVME_CMD_WRITE &&
> +        rw->opcode != NVME_CMD_ZONE_APPEND &&
> +        rw->opcode != NVME_CMD_WRITE_ZEROES) {
> +        return false;
> +    }
> +
> +    slba = le64_to_cpu(rw->slba);
> +    nlb = le16_to_cpu(rw->nlb) + 1;
> +    zone_idx = nvme_zone_idx(ns, slba);
> +    assert(zone_idx < ns->num_zones);
> +    zone = &ns->zone_array[zone_idx];
> +
> +    if (!failed && zone->w_ptr < start_wp + nlb) {
> +        /*
> +         * A preceding queued write to the zone has failed,
> +         * now this write is not at the WP, fail it too.
> +         */
> +        failed = true;
> +    }
> +
> +    if (failed) {
> +        if (zone->w_ptr > start_wp) {
> +            zone->w_ptr = start_wp;
> +        }
> +        req->cqe.result64 = 0;
> +    } else if (zone->w_ptr == nvme_zone_wr_boundary(zone)) {
> +        zs = nvme_get_zone_state(zone);
> +        switch (zs) {
> +        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +        case NVME_ZONE_STATE_CLOSED:
> +        case NVME_ZONE_STATE_EMPTY:
> +            nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_FULL);
> +            /* fall through */
> +        case NVME_ZONE_STATE_FULL:
> +            break;
> +        default:
> +            assert(false);
> +        }
> +        zone->d.wp = zone->w_ptr;
> +    } else {
> +        zone->d.wp += nlb;
> +    }
> +
> +    return failed;
> +}
> +
> +static uint64_t nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone,
> +                                     uint32_t nlb)
> +{
> +    uint64_t result = zone->w_ptr;
> +    uint8_t zs;
> +
> +    zone->w_ptr += nlb;
> +
> +    if (zone->w_ptr < nvme_zone_wr_boundary(zone)) {
> +        zs = nvme_get_zone_state(zone);
> +        switch (zs) {
> +        case NVME_ZONE_STATE_EMPTY:
> +        case NVME_ZONE_STATE_CLOSED:
> +            nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_IMPLICITLY_OPEN);
> +        }
> +    }
> +
> +    return result;
> +}
> +
>  static void nvme_rw_cb(void *opaque, int ret)
>  {
>      NvmeRequest *req = opaque;
> @@ -896,10 +1131,27 @@ static void nvme_rw_cb(void *opaque, int ret)
>      trace_pci_nvme_rw_cb(nvme_cid(req), blk_name(blk));
>  
>      if (!ret) {
> -        block_acct_done(stats, acct);
> +        if (ns->params.zoned) {
> +            if (nvme_finalize_zoned_write(ns, req, false)) {
> +                ret = EIO;
> +                block_acct_failed(stats, acct);
> +                req->status = NVME_ZONE_INVALID_WRITE;
> +            } else if (req->fill_ofs >= 0) {
> +                nvme_fill_data(&req->qsg, &req->iov, req->fill_ofs,
> +                               req->fill_len,
> +                               nvme_ctrl(req)->params.fill_pattern);
> +            }
> +        }
> +        if (!ret) {
> +            block_acct_done(stats, acct);
> +        }
>      } else {
>          uint16_t status;
>  
> +        if (ns->params.zoned) {
> +            nvme_finalize_zoned_write(ns, req, true);
> +        }
> +
>          block_acct_failed(stats, acct);
>  
>          switch (req->cmd.opcode) {
> @@ -953,6 +1205,7 @@ static uint16_t nvme_do_aio(BlockBackend *blk, int64_t offset, size_t len,
>          break;
>  
>      case NVME_CMD_WRITE:
> +    case NVME_CMD_ZONE_APPEND:
>          is_write = true;
>  
>          /* fallthrough */
> @@ -997,8 +1250,10 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
>      NvmeNamespace *ns = req->ns;
>      uint64_t slba = le64_to_cpu(rw->slba);
>      uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
> +    NvmeZone *zone = NULL;
>      uint64_t offset = nvme_l2b(ns, slba);
>      uint32_t count = nvme_l2b(ns, nlb);
> +    uint32_t zone_idx;
>      uint16_t status;
>  
>      trace_pci_nvme_write_zeroes(nvme_cid(req), nvme_nsid(ns), slba, nlb);
> @@ -1009,20 +1264,43 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
>          return status;
>      }
>  
> +    if (ns->params.zoned) {
> +        zone_idx = nvme_zone_idx(ns, slba);
> +        assert(zone_idx < ns->num_zones);
> +        zone = &ns->zone_array[zone_idx];
> +
> +        status = nvme_check_zone_write(zone, slba, nlb);
> +        if (status != NVME_SUCCESS) {
> +            trace_pci_nvme_err_zone_write_not_ok(slba, nlb, status);
> +            return status | NVME_DNR;
> +        }
> +
> +        assert(nvme_wp_is_valid(zone));
> +        if (unlikely(slba != zone->w_ptr)) {
> +            trace_pci_nvme_err_write_not_at_wp(slba, zone->d.zslba,
> +                                               zone->w_ptr);
> +            return NVME_ZONE_INVALID_WRITE | NVME_DNR;
> +        }
> +
> +        req->cqe.result64 = nvme_advance_zone_wp(ns, zone, nlb);
> +    }
> +
>      return nvme_do_aio(ns->blkconf.blk, offset, count, req);
>  }
>  
> -static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
> +static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req, bool append)
>  {
>      NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
>      NvmeNamespace *ns = req->ns;
> -    uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
> +    uint32_t nlb  = le32_to_cpu(rw->nlb) + 1;
>      uint64_t slba = le64_to_cpu(rw->slba);
> -
>      uint64_t data_size = nvme_l2b(ns, nlb);
> -    uint64_t data_offset = nvme_l2b(ns, slba);
> -    enum BlockAcctType acct = req->cmd.opcode == NVME_CMD_WRITE ?
> -        BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
> +    uint64_t data_offset;
> +
> +    NvmeZone *zone = NULL;
> +    uint32_t zone_idx = 0;
> +    bool is_write = rw->opcode == NVME_CMD_WRITE || append;
> +    enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
>      uint16_t status;
>  
>      trace_pci_nvme_rw(nvme_cid(req), nvme_io_opc_str(rw->opcode),
> @@ -1040,18 +1318,468 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
>          goto invalid;
>      }
>  
> +    if (ns->params.zoned) {
> +        zone_idx = nvme_zone_idx(ns, slba);
> +        assert(zone_idx < ns->num_zones);
> +        zone = &ns->zone_array[zone_idx];
> +
> +        if (is_write) {
> +            status = nvme_check_zone_write(zone, slba, nlb);
> +            if (status != NVME_SUCCESS) {
> +                trace_pci_nvme_err_zone_write_not_ok(slba, nlb, status);
> +                goto invalid;
> +            }
> +
> +            assert(nvme_wp_is_valid(zone));
> +            if (append) {
> +                if (unlikely(slba != zone->d.zslba)) {
> +                    trace_pci_nvme_err_append_not_at_start(slba, zone->d.zslba);
> +                    status = NVME_ZONE_INVALID_WRITE | NVME_DNR;
> +                    goto invalid;
> +                }
> +                if (data_size > (n->page_size << n->zasl)) {
> +                    trace_pci_nvme_err_append_too_large(slba, nlb, n->zasl);
> +                    status = NVME_INVALID_FIELD | NVME_DNR;
> +                    goto invalid;
> +                }
> +                slba = zone->w_ptr;
> +            } else if (unlikely(slba != zone->w_ptr)) {
> +                trace_pci_nvme_err_write_not_at_wp(slba, zone->d.zslba,
> +                                                   zone->w_ptr);
> +                status = NVME_ZONE_INVALID_WRITE | NVME_DNR;
> +                goto invalid;
> +            }
> +            req->fill_ofs = -1LL;
> +        } else {
> +            status = nvme_check_zone_read(ns, zone, slba, nlb);
> +            if (status != NVME_SUCCESS) {
> +                trace_pci_nvme_err_zone_read_not_ok(slba, nlb, status);
> +                goto invalid;
> +            }
> +
> +            if (slba + nlb > zone->w_ptr) {
> +                /*
> +                 * All or some data is read above the WP. Need to
> +                 * fill out the buffer area that has no backing data
> +                 * with a predefined data pattern (zeros by default)
> +                 */
> +                if (slba >= zone->w_ptr) {
> +                    req->fill_ofs = 0;
> +                } else {
> +                    req->fill_ofs = nvme_l2b(ns, zone->w_ptr - slba);
> +                }
> +                req->fill_len = nvme_l2b(ns,
> +                    nvme_zone_rd_boundary(ns, zone) - slba);
> +            } else {
> +                req->fill_ofs = -1LL;
> +            }
> +        }
> +    } else if (append) {
> +        trace_pci_nvme_err_invalid_opc(rw->opcode);
> +        status = NVME_INVALID_OPCODE | NVME_DNR;
> +        goto invalid;
> +    }
> +
>      status = nvme_map_dptr(n, data_size, req);
>      if (status) {
>          goto invalid;
>      }
>  
> +    if (ns->params.zoned) {
> +        if (unlikely(req->fill_ofs == 0 &&
> +            slba + nlb <= nvme_zone_rd_boundary(ns, zone))) {
> +            /* No backend I/O necessary, only need to fill the buffer */
> +            nvme_fill_data(&req->qsg, &req->iov, 0, 0, n->params.fill_pattern);
> +            req->status = NVME_SUCCESS;
> +            return NVME_SUCCESS;
> +        }
> +        if (is_write) {
> +            req->cqe.result64 = nvme_advance_zone_wp(ns, zone, nlb);
> +        }
> +    }
> +
> +    data_offset = nvme_l2b(ns, slba);
> +
>      return nvme_do_aio(ns->blkconf.blk, data_offset, data_size, req);
>  
>  invalid:
>      block_acct_invalid(blk_get_stats(ns->blkconf.blk), acct);
> +    return status | NVME_DNR;
> +}
> +
> +static uint16_t nvme_get_mgmt_zone_slba_idx(NvmeNamespace *ns, NvmeCmd *c,
> +                                            uint64_t *slba, uint32_t *zone_idx)
> +{
> +    uint32_t dw10 = le32_to_cpu(c->cdw10);
> +    uint32_t dw11 = le32_to_cpu(c->cdw11);
> +
> +    if (!ns->params.zoned) {
> +        trace_pci_nvme_err_invalid_opc(c->opcode);
> +        return NVME_INVALID_OPCODE | NVME_DNR;
> +    }
> +
> +    *slba = ((uint64_t)dw11) << 32 | dw10;
> +    if (unlikely(*slba >= ns->id_ns.nsze)) {
> +        trace_pci_nvme_err_invalid_lba_range(*slba, 0, ns->id_ns.nsze);
> +        *slba = 0;
> +        return NVME_LBA_RANGE | NVME_DNR;
> +    }
> +
> +    *zone_idx = nvme_zone_idx(ns, *slba);
> +    assert(*zone_idx < ns->num_zones);
> +
> +    return NVME_SUCCESS;
> +}
> +
> +static uint16_t nvme_open_zone(NvmeNamespace *ns, NvmeZone *zone,
> +                               uint8_t state)
> +{
> +    switch (state) {
> +    case NVME_ZONE_STATE_EMPTY:
> +    case NVME_ZONE_STATE_CLOSED:
> +    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_EXPLICITLY_OPEN);
> +        /* fall through */
> +    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +        return NVME_SUCCESS;
> +    }
> +
> +    return NVME_ZONE_INVAL_TRANSITION;
> +}
> +
> +static bool nvme_cond_open_all(uint8_t state)
> +{
> +    return state == NVME_ZONE_STATE_CLOSED;
> +}
> +
> +static uint16_t nvme_close_zone(NvmeNamespace *ns, NvmeZone *zone,
> +                                uint8_t state)
> +{
> +    switch (state) {
> +    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_CLOSED);
> +        /* fall through */
> +    case NVME_ZONE_STATE_CLOSED:
> +        return NVME_SUCCESS;
> +    }
> +
> +    return NVME_ZONE_INVAL_TRANSITION;
> +}
> +
> +static bool nvme_cond_close_all(uint8_t state)
> +{
> +    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
> +           state == NVME_ZONE_STATE_EXPLICITLY_OPEN;
> +}
> +
> +static uint16_t nvme_finish_zone(NvmeNamespace *ns, NvmeZone *zone,
> +                                 uint8_t state)
> +{
> +    switch (state) {
> +    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +    case NVME_ZONE_STATE_CLOSED:
> +    case NVME_ZONE_STATE_EMPTY:
> +        zone->w_ptr = nvme_zone_wr_boundary(zone);
> +        zone->d.wp = zone->w_ptr;
> +        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_FULL);
> +        /* fall through */
> +    case NVME_ZONE_STATE_FULL:
> +        return NVME_SUCCESS;
> +    }
> +
> +    return NVME_ZONE_INVAL_TRANSITION;
> +}
> +
> +static bool nvme_cond_finish_all(uint8_t state)
> +{
> +    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
> +           state == NVME_ZONE_STATE_EXPLICITLY_OPEN ||
> +           state == NVME_ZONE_STATE_CLOSED;
> +}
> +
> +static uint16_t nvme_reset_zone(NvmeNamespace *ns, NvmeZone *zone,
> +                                uint8_t state)
> +{
> +    switch (state) {
> +    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +    case NVME_ZONE_STATE_CLOSED:
> +    case NVME_ZONE_STATE_FULL:
> +        zone->w_ptr = zone->d.zslba;
> +        zone->d.wp = zone->w_ptr;
> +        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_EMPTY);
> +        /* fall through */
> +    case NVME_ZONE_STATE_EMPTY:
> +        return NVME_SUCCESS;
> +    }
> +
> +    return NVME_ZONE_INVAL_TRANSITION;
> +}
> +
> +static bool nvme_cond_reset_all(uint8_t state)
> +{
> +    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
> +           state == NVME_ZONE_STATE_EXPLICITLY_OPEN ||
> +           state == NVME_ZONE_STATE_CLOSED ||
> +           state == NVME_ZONE_STATE_FULL;
> +}
> +
> +static uint16_t nvme_offline_zone(NvmeNamespace *ns, NvmeZone *zone,
> +                                  uint8_t state)
> +{
> +    switch (state) {
> +    case NVME_ZONE_STATE_READ_ONLY:
> +        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_OFFLINE);
> +        /* fall through */
> +    case NVME_ZONE_STATE_OFFLINE:
> +        return NVME_SUCCESS;
> +    }
> +
> +    return NVME_ZONE_INVAL_TRANSITION;
> +}
> +
> +static bool nvme_cond_offline_all(uint8_t state)
> +{
> +    return state == NVME_ZONE_STATE_READ_ONLY;
> +}
> +
> +typedef uint16_t (*op_handler_t)(NvmeNamespace *, NvmeZone *,
> +                                 uint8_t);
> +typedef bool (*need_to_proc_zone_t)(uint8_t);
> +
> +static uint16_t name_do_zone_op(NvmeNamespace *ns, NvmeZone *zone,
> +                                uint8_t state, bool all,
> +                                op_handler_t op_hndlr,
> +                                need_to_proc_zone_t proc_zone)
> +{
> +    int i;
> +    uint16_t status = 0;
> +
> +    if (!all) {
> +        status = op_hndlr(ns, zone, state);
> +    } else {
> +        for (i = 0; i < ns->num_zones; i++, zone++) {
> +            state = nvme_get_zone_state(zone);
> +            if (proc_zone(state)) {
> +                status = op_hndlr(ns, zone, state);
> +                if (status != NVME_SUCCESS) {
> +                    break;
> +                }
> +            }
> +        }
> +    }
> +
>      return status;
>  }
>  
> +static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeRequest *req)
> +{
> +    NvmeCmd *cmd = (NvmeCmd *)&req->cmd;
> +    NvmeNamespace *ns = req->ns;
> +    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> +    uint64_t slba = 0;
> +    uint32_t zone_idx = 0;
> +    uint16_t status;
> +    uint8_t action, state;
> +    bool all;
> +    NvmeZone *zone;
> +
> +    action = dw13 & 0xff;
> +    all = dw13 & 0x100;
> +
> +    req->status = NVME_SUCCESS;
> +
> +    if (!all) {
> +        status = nvme_get_mgmt_zone_slba_idx(ns, cmd, &slba, &zone_idx);
> +        if (status) {
> +            return status;
> +        }
> +    }
> +
> +    zone = &ns->zone_array[zone_idx];
> +    if (slba != zone->d.zslba) {
> +        trace_pci_nvme_err_unaligned_zone_cmd(action, slba, zone->d.zslba);
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +    state = nvme_get_zone_state(zone);
> +
> +    switch (action) {
> +
> +    case NVME_ZONE_ACTION_OPEN:
> +        trace_pci_nvme_open_zone(slba, zone_idx, all);
> +        status = name_do_zone_op(ns, zone, state, all,
> +                                 nvme_open_zone, nvme_cond_open_all);
> +        break;
> +
> +    case NVME_ZONE_ACTION_CLOSE:
> +        trace_pci_nvme_close_zone(slba, zone_idx, all);
> +        status = name_do_zone_op(ns, zone, state, all,
> +                                 nvme_close_zone, nvme_cond_close_all);
> +        break;
> +
> +    case NVME_ZONE_ACTION_FINISH:
> +        trace_pci_nvme_finish_zone(slba, zone_idx, all);
> +        status = name_do_zone_op(ns, zone, state, all,
> +                                 nvme_finish_zone, nvme_cond_finish_all);
> +        break;
> +
> +    case NVME_ZONE_ACTION_RESET:
> +        trace_pci_nvme_reset_zone(slba, zone_idx, all);
> +        status = name_do_zone_op(ns, zone, state, all,
> +                                 nvme_reset_zone, nvme_cond_reset_all);
> +        break;
> +
> +    case NVME_ZONE_ACTION_OFFLINE:
> +        trace_pci_nvme_offline_zone(slba, zone_idx, all);
> +        status = name_do_zone_op(ns, zone, state, all,
> +                                 nvme_offline_zone, nvme_cond_offline_all);
> +        break;
> +
> +    case NVME_ZONE_ACTION_SET_ZD_EXT:
> +        trace_pci_nvme_set_descriptor_extension(slba, zone_idx);
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +        break;
> +
> +    default:
> +        trace_pci_nvme_err_invalid_mgmt_action(action);
> +        status = NVME_INVALID_FIELD;
> +    }
> +
> +    if (status == NVME_ZONE_INVAL_TRANSITION) {
> +        trace_pci_nvme_err_invalid_zone_state_transition(state, action, slba,
> +                                                         zone->d.za);
> +    }
> +    if (status) {
> +        status |= NVME_DNR;
> +    }
> +
> +    return status;
> +}
> +
> +static bool nvme_zone_matches_filter(uint32_t zafs, NvmeZone *zl)
> +{
> +    int zs = nvme_get_zone_state(zl);
> +
> +    switch (zafs) {
> +    case NVME_ZONE_REPORT_ALL:
> +        return true;
> +    case NVME_ZONE_REPORT_EMPTY:
> +        return zs == NVME_ZONE_STATE_EMPTY;
> +    case NVME_ZONE_REPORT_IMPLICITLY_OPEN:
> +        return zs == NVME_ZONE_STATE_IMPLICITLY_OPEN;
> +    case NVME_ZONE_REPORT_EXPLICITLY_OPEN:
> +        return zs == NVME_ZONE_STATE_EXPLICITLY_OPEN;
> +    case NVME_ZONE_REPORT_CLOSED:
> +        return zs == NVME_ZONE_STATE_CLOSED;
> +    case NVME_ZONE_REPORT_FULL:
> +        return zs == NVME_ZONE_STATE_FULL;
> +    case NVME_ZONE_REPORT_READ_ONLY:
> +        return zs == NVME_ZONE_STATE_READ_ONLY;
> +    case NVME_ZONE_REPORT_OFFLINE:
> +        return zs == NVME_ZONE_STATE_OFFLINE;
> +    default:
> +        return false;
> +    }
> +}
> +
> +static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
> +{
> +    NvmeCmd *cmd = (NvmeCmd *)&req->cmd;
> +    NvmeNamespace *ns = req->ns;
> +    /* cdw12 is zero-based number of dwords to return. Convert to bytes */
> +    uint32_t len = (le32_to_cpu(cmd->cdw12) + 1) << 2;
> +    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> +    uint32_t zone_idx, zra, zrasf, partial;
> +    uint64_t max_zones, nr_zones = 0;
> +    uint16_t ret;
> +    uint64_t slba;
> +    NvmeZoneDescr *z;
> +    NvmeZone *zs;
> +    NvmeZoneReportHeader *header;
> +    void *buf, *buf_p;
> +    size_t zone_entry_sz;
> +
> +    req->status = NVME_SUCCESS;
> +
> +    ret = nvme_get_mgmt_zone_slba_idx(ns, cmd, &slba, &zone_idx);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    if (len < sizeof(NvmeZoneReportHeader)) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    zra = dw13 & 0xff;
> +    if (!(zra == NVME_ZONE_REPORT || zra == NVME_ZONE_REPORT_EXTENDED)) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    if (zra == NVME_ZONE_REPORT_EXTENDED) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    zrasf = (dw13 >> 8) & 0xff;
> +    if (zrasf > NVME_ZONE_REPORT_OFFLINE) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    partial = (dw13 >> 16) & 0x01;
> +
> +    zone_entry_sz = sizeof(NvmeZoneDescr);
> +
> +    max_zones = (len - sizeof(NvmeZoneReportHeader)) / zone_entry_sz;
> +    buf = g_malloc0(len);
> +
> +    header = (NvmeZoneReportHeader *)buf;
> +    buf_p = buf + sizeof(NvmeZoneReportHeader);
> +
> +    while (zone_idx < ns->num_zones && nr_zones < max_zones) {
> +        zs = &ns->zone_array[zone_idx];
> +
> +        if (!nvme_zone_matches_filter(zrasf, zs)) {
> +            zone_idx++;
> +            continue;
> +        }
> +
> +        z = (NvmeZoneDescr *)buf_p;
> +        buf_p += sizeof(NvmeZoneDescr);
> +        nr_zones++;
> +
> +        z->zt = zs->d.zt;
> +        z->zs = zs->d.zs;
> +        z->zcap = cpu_to_le64(zs->d.zcap);
> +        z->zslba = cpu_to_le64(zs->d.zslba);
> +        z->za = zs->d.za;
> +
> +        if (nvme_wp_is_valid(zs)) {
> +            z->wp = cpu_to_le64(zs->d.wp);
> +        } else {
> +            z->wp = cpu_to_le64(~0ULL);
> +        }
> +
> +        zone_idx++;
> +    }
> +
> +    if (!partial) {
> +        for (; zone_idx < ns->num_zones; zone_idx++) {
> +            zs = &ns->zone_array[zone_idx];
> +            if (nvme_zone_matches_filter(zrasf, zs)) {
> +                nr_zones++;
> +            }
> +        }
> +    }
> +    header->nr_zones = cpu_to_le64(nr_zones);
> +
> +    ret = nvme_dma(n, (uint8_t *)buf, len, DMA_DIRECTION_FROM_DEVICE, req);
> +
> +    g_free(buf);
> +
> +    return ret;
> +}
> +
>  static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
>  {
>      uint32_t nsid = le32_to_cpu(req->cmd.nsid);
> @@ -1073,9 +1801,15 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)

While you did make sure that we don't expose zone mgmt send/recv/zone append
in the cmd_effects log when CC.CSS != CSS_CSI, we also need to make sure we
return Invalid Command Opcode for any of those three commands, if a user tries
to use them anyway (while CC.CSS != CSI).

>          return nvme_flush(n, req);
>      case NVME_CMD_WRITE_ZEROES:
>          return nvme_write_zeroes(n, req);
> +    case NVME_CMD_ZONE_APPEND:
> +        return nvme_rw(n, req, true);
>      case NVME_CMD_WRITE:
>      case NVME_CMD_READ:
> -        return nvme_rw(n, req);
> +        return nvme_rw(n, req, false);
> +    case NVME_CMD_ZONE_MGMT_SEND:
> +        return nvme_zone_mgmt_send(n, req);
> +    case NVME_CMD_ZONE_MGMT_RECV:
> +        return nvme_zone_mgmt_recv(n, req);
>      default:
>          trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
>          return NVME_INVALID_OPCODE | NVME_DNR;
> @@ -1301,7 +2035,7 @@ static uint16_t nvme_error_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
>                      DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> -static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
> +static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint8_t csi, uint32_t buf_len,
>                                   uint64_t off, NvmeRequest *req)
>  {
>      NvmeEffectsLog cmd_eff_log = {};
> @@ -1326,11 +2060,20 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
>      acs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;
>      acs[NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFFECTS_CSUPP;
>  
> -    iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> -    iocs[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFFECTS_CSUPP |
> -                                  NVME_CMD_EFFECTS_LBCC;
> -    iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> -    iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
> +    if (NVME_CC_CSS(n->bar.cc) != CSS_ADMIN_ONLY) {
> +        iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> +        iocs[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFFECTS_CSUPP |
> +                                      NVME_CMD_EFFECTS_LBCC;
> +        iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> +        iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
> +    }
> +
> +    if (csi == NVME_CSI_ZONED && NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
> +        iocs[NVME_CMD_ZONE_APPEND] = NVME_CMD_EFFECTS_CSUPP |
> +                                     NVME_CMD_EFFECTS_LBCC;
> +        iocs[NVME_CMD_ZONE_MGMT_SEND] = NVME_CMD_EFFECTS_CSUPP;
> +        iocs[NVME_CMD_ZONE_MGMT_RECV] = NVME_CMD_EFFECTS_CSUPP;
> +    }
>  
>      trans_len = MIN(sizeof(cmd_eff_log) - off, buf_len);
>  
> @@ -1349,6 +2092,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
>      uint8_t  lid = dw10 & 0xff;
>      uint8_t  lsp = (dw10 >> 8) & 0xf;
>      uint8_t  rae = (dw10 >> 15) & 0x1;
> +    uint8_t csi = le32_to_cpu(cmd->cdw14) >> 24;
>      uint32_t numdl, numdu;
>      uint64_t off, lpol, lpou;
>      size_t   len;
> @@ -1382,7 +2126,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
>      case NVME_LOG_FW_SLOT_INFO:
>          return nvme_fw_log_info(n, len, off, req);
>      case NVME_LOG_CMD_EFFECTS:
> -        return nvme_cmd_effects(n, len, off, req);
> +        return nvme_cmd_effects(n, csi, len, off, req);
>      default:
>          trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1502,6 +2246,16 @@ static uint16_t nvme_rpt_empty_id_struct(NvmeCtrl *n, NvmeRequest *req)
>      return nvme_dma(n, id, sizeof(id), DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> +static inline bool nvme_csi_has_nvm_support(NvmeNamespace *ns)
> +{
> +    switch (ns->csi) {
> +    case NVME_CSI_NVM:
> +    case NVME_CSI_ZONED:
> +        return true;
> +    }
> +    return false;
> +}
> +
>  static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
>  {
>      trace_pci_nvme_identify_ctrl();
> @@ -1513,11 +2267,16 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
>  static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> +    NvmeIdCtrlZoned id = {};
>  
>      trace_pci_nvme_identify_ctrl_csi(c->csi);
>  
>      if (c->csi == NVME_CSI_NVM) {
>          return nvme_rpt_empty_id_struct(n, req);
> +    } else if (c->csi == NVME_CSI_ZONED) {
> +        id.zasl = n->zasl;
> +        return nvme_dma(n, (uint8_t *)&id, sizeof(id),
> +                        DMA_DIRECTION_FROM_DEVICE, req);

Please read my comment on nvme_identify_nslist_csi() before reading
this comment.

At least for this function, the specification is clear:

"If the host requests a data structure for an I/O Command Set that the
controller does not support, the controller shall abort the command with
a status of Invalid Field in Command."

If the controller supports the I/O command set == if the Command Set bit
is set in the data struct returned by the nvme_identify_cmd_set(),
so here we should do something like:

} else if (->csi == NVME_CSI_ZONED && ctrl_has_zns_namespaces()) {
	...
}

>      }
>  
>      return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1545,8 +2304,12 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req,
>          return nvme_rpt_empty_id_struct(n, req);
>      }
>  
> -    return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
> -                    DMA_DIRECTION_FROM_DEVICE, req);
> +    if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
> +        return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
> +                        DMA_DIRECTION_FROM_DEVICE, req);
> +    }
> +
> +    return NVME_INVALID_CMD_SET | NVME_DNR;
>  }
>  
>  static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
> @@ -1571,8 +2334,11 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
>          return nvme_rpt_empty_id_struct(n, req);
>      }
>  
> -    if (c->csi == NVME_CSI_NVM) {
> +    if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
>          return nvme_rpt_empty_id_struct(n, req);
> +    } else if (c->csi == NVME_CSI_ZONED && ns->csi == NVME_CSI_ZONED) {
> +        return nvme_dma(n, (uint8_t *)ns->id_ns_zoned, sizeof(NvmeIdNsZoned),
> +                        DMA_DIRECTION_FROM_DEVICE, req);
>      }
>  
>      return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1634,7 +2400,7 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req,
>  
>      trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
>  
> -    if (c->csi != NVME_CSI_NVM) {
> +    if (c->csi != NVME_CSI_NVM && c->csi != NVME_CSI_ZONED) {

When reading the specification for CNS 07h, I think that it is not clear
how this should behave...

I'm thinking in the case when c->csi == NVME_CSI_ZONED
when our QEMU model does only have NVMe namespaces.

Either we should return an empty list (1),
or we should return Invalid Field in Command (2).

If we decide to go with (2),
then we should probably take the code you have written in nvme_identify_cmd_set():

+    for (i = 1; i <= n->num_namespaces; i++) {
+        ns = nvme_ns(n, i);
+        if (ns && ns->params.zoned) {
+            NVME_SET_CSI(*list, NVME_CSI_ZONED);
+            break;
+        }
+    }

And move it into a ctrl_has_zns_namespaces() helper function,
and then do something like:
if (!(c->csi == NVME_CSI_NVM || (ctrl_has_zns_namespaces() && c->csi == NVME_CSI_ZONED)) 
	return NVME_INVALID_FIELD | NVME_DNR;


>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> @@ -1643,7 +2409,7 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req,
>          if (!ns) {
>              continue;
>          }
> -        if (ns->params.nsid < min_nsid) {
> +        if (ns->params.nsid < min_nsid || c->csi != ns->csi) {
>              continue;
>          }
>          if (only_active && !ns->params.attached) {
> @@ -1696,19 +2462,29 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
>      desc->nidt = NVME_NIDT_CSI;
>      desc->nidl = NVME_NIDL_CSI;
>      list_ptr += sizeof(*desc);
> -    *(uint8_t *)list_ptr = NVME_CSI_NVM;
> +    *(uint8_t *)list_ptr = ns->csi;
>  
>      return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
>  static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeRequest *req)
>  {
> +    NvmeNamespace *ns;
>      uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
>      static const int data_len = sizeof(list);
> +    int i;
>  
>      trace_pci_nvme_identify_cmd_set();
>  
>      NVME_SET_CSI(*list, NVME_CSI_NVM);
> +    for (i = 1; i <= n->num_namespaces; i++) {
> +        ns = nvme_ns(n, i);
> +        if (ns && ns->params.zoned) {
> +            NVME_SET_CSI(*list, NVME_CSI_ZONED);
> +            break;
> +        }
> +    }
> +
>      return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> @@ -1751,7 +2527,7 @@ static uint16_t nvme_abort(NvmeCtrl *n, NvmeRequest *req)
>  {
>      uint16_t sqid = le32_to_cpu(req->cmd.cdw10) & 0xffff;
>  
> -    req->cqe.result = 1;
> +    req->cqe.result32 = 1;
>      if (nvme_check_sqid(n, sqid)) {
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
> @@ -1932,7 +2708,7 @@ defaults:
>      }
>  
>  out:
> -    req->cqe.result = cpu_to_le32(result);
> +    req->cqe.result32 = cpu_to_le32(result);
>      return NVME_SUCCESS;
>  }
>  
> @@ -2057,8 +2833,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
>                                      ((dw11 >> 16) & 0xFFFF) + 1,
>                                      n->params.max_ioqpairs,
>                                      n->params.max_ioqpairs);
> -        req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
> -                                      ((n->params.max_ioqpairs - 1) << 16));
> +        req->cqe.result32 = cpu_to_le32((n->params.max_ioqpairs - 1) |
> +                                        ((n->params.max_ioqpairs - 1) << 16));
>          break;
>      case NVME_ASYNCHRONOUS_EVENT_CONF:
>          n->features.async_config = dw11;
> @@ -2310,16 +3086,28 @@ static int nvme_start_ctrl(NvmeCtrl *n)
>              continue;
>          }
>          ns->params.attached = false;
> -        switch (ns->params.csi) {
> +        switch (ns->csi) {
>          case NVME_CSI_NVM:
>              if (NVME_CC_CSS(n->bar.cc) == CSS_NVM_ONLY ||
>                  NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
>                  ns->params.attached = true;
>              }
>              break;
> +        case NVME_CSI_ZONED:
> +            if (NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
> +                ns->params.attached = true;
> +            }
> +            break;
>          }
>      }

Like I wrote in my review comment in the patch that added support for the new
allocated CNS values, I prefer if we remove this for-loop completely, and
simply set attached = true in nvme_ns_setup()/nvme_ns_init() instead.

(I was considering if we should set attach = true in nvme_zoned_init_ns(),
but because nvme_ns_setup()/nvme_ns_init() is called for all namespaces,
including ZNS namespaces, I don't think that any additional code in
nvme_zoned_init_ns() is warranted.)

>  
> +    if (!n->zasl_bs) {
> +        assert(n->params.mdts);
> +        n->zasl = n->params.mdts;
> +    } else {
> +        n->zasl = 31 - clz32(n->zasl_bs / n->page_size);
> +    }
> +
>      nvme_set_timestamp(n, 0ULL);
>  
>      QTAILQ_INIT(&n->aer_queue);
> @@ -2382,10 +3170,11 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
>                  case CSS_NVM_ONLY:
>                      trace_pci_nvme_css_nvm_cset_selected_by_host(data &
>                                                                   0xffffffff);
> -                    break;
> +                break;
>                  case CSS_CSI:
>                      NVME_SET_CC_CSS(n->bar.cc, CSS_CSI);
> -                    trace_pci_nvme_css_all_csets_sel_by_host(data & 0xffffffff);
> +                    trace_pci_nvme_css_all_csets_sel_by_host(data &
> +                                                             0xffffffff);
>                      break;
>                  case CSS_ADMIN_ONLY:
>                      break;
> @@ -2780,6 +3569,12 @@ static void nvme_init_state(NvmeCtrl *n)
>      n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
>      n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
>      n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
> +
> +    if (!n->params.zasl_kb) {
> +        n->zasl_bs = n->params.mdts ? 0 : NVME_DEFAULT_MAX_ZA_SIZE * KiB;
> +    } else {
> +        n->zasl_bs = n->params.zasl_kb * KiB;
> +    }
>  }
>  
>  int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> @@ -2985,8 +3780,9 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>      NVME_CAP_SET_CQR(n->bar.cap, 1);
>      NVME_CAP_SET_TO(n->bar.cap, 0xf);
>      /*
> -     * The device now always supports NS Types, but all commands
> -     * that support CSI field will only handle NVM Command Set.
> +     * The device now always supports NS Types, even when "zoned" property
> +     * is set to zero. If this is the case, all commands that support CSI
> +     * field only handle NVM Command Set.
>       */
>      NVME_CAP_SET_CSS(n->bar.cap, (CAP_CSS_NVM | CAP_CSS_CSI_SUPP));
>      NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
> @@ -3033,9 +3829,21 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>  static void nvme_exit(PCIDevice *pci_dev)
>  {
>      NvmeCtrl *n = NVME(pci_dev);
> +    NvmeNamespace *ns;
> +    int i;
>  
>      nvme_clear_ctrl(n);
> +
> +    for (i = 1; i <= n->num_namespaces; i++) {
> +        ns = nvme_ns(n, i);
> +        if (!ns) {
> +            continue;
> +        }
> +
> +        nvme_ns_cleanup(ns);
> +    }
>      g_free(n->namespaces);
> +
>      g_free(n->cq);
>      g_free(n->sq);
>      g_free(n->aer_reqs);
> @@ -3063,6 +3871,8 @@ static Property nvme_props[] = {
>      DEFINE_PROP_UINT32("aer_max_queued", NvmeCtrl, params.aer_max_queued, 64),
>      DEFINE_PROP_UINT8("mdts", NvmeCtrl, params.mdts, 7),
>      DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false),
> +    DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
> +    DEFINE_PROP_UINT32("zone_append_size_limit", NvmeCtrl, params.zasl_kb, 0),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index a7126e123f..628c665728 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -651,8 +651,10 @@ typedef struct QEMU_PACKED NvmeAerResult {
>  } NvmeAerResult;
>  
>  typedef struct QEMU_PACKED NvmeCqe {
> -    uint32_t    result;
> -    uint32_t    rsvd;
> +    union {
> +        uint64_t     result64;
> +        uint32_t     result32;
> +    };
>      uint16_t    sq_head;
>      uint16_t    sq_id;
>      uint16_t    cid;
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set
  2020-09-28  2:35 ` [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
                     ` (3 preceding siblings ...)
  2020-09-30 14:50   ` Niklas Cassel
@ 2020-09-30 15:12   ` Niklas Cassel
  4 siblings, 0 replies; 46+ messages in thread
From: Niklas Cassel @ 2020-09-30 15:12 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Klaus Jensen,
	qemu-devel, Maxim Levitsky, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Mon, Sep 28, 2020 at 11:35:23AM +0900, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> The code to support for Zone Descriptor Extensions is not included in
> this commit and ZDES 0 is always reported. A later commit in this
> series will add ZDE support.
> 
> This commit doesn't yet include checks for active and open zone
> limits. It is assumed that there are no limits on either active or
> open zones.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  block/nvme.c         |   2 +-
>  hw/block/nvme-ns.c   | 185 ++++++++-
>  hw/block/nvme-ns.h   |   6 +-
>  hw/block/nvme.c      | 872 +++++++++++++++++++++++++++++++++++++++++--
>  include/block/nvme.h |   6 +-
>  5 files changed, 1033 insertions(+), 38 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 05485fdd11..7a513c9a17 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c

(snip)

> @@ -1326,11 +2060,20 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
>      acs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;
>      acs[NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFFECTS_CSUPP;
>  
> -    iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> -    iocs[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFFECTS_CSUPP |
> -                                  NVME_CMD_EFFECTS_LBCC;
> -    iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> -    iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
> +    if (NVME_CC_CSS(n->bar.cc) != CSS_ADMIN_ONLY) {
> +        iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> +        iocs[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFFECTS_CSUPP |
> +                                      NVME_CMD_EFFECTS_LBCC;
> +        iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> +        iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
> +    }
> +
> +    if (csi == NVME_CSI_ZONED && NVME_CC_CSS(n->bar.cc) == CSS_CSI) {

Actually, intead of naming the helper function, ctrl_has_zns_namespaces(),
a better name might be ctrl_has_zns_support()

Since this is what is used to set the bit in nvme_identify_cmd_set(),

Then, I think that this should be:

if (ctrl_has_zns_support() && csi == NVME_CSI_ZONED && NVME_CC_CSS(n->bar.cc) == CSS_CSI) {


> +        iocs[NVME_CMD_ZONE_APPEND] = NVME_CMD_EFFECTS_CSUPP |
> +                                     NVME_CMD_EFFECTS_LBCC;
> +        iocs[NVME_CMD_ZONE_MGMT_SEND] = NVME_CMD_EFFECTS_CSUPP;
> +        iocs[NVME_CMD_ZONE_MGMT_RECV] = NVME_CMD_EFFECTS_CSUPP;
> +    }
>  
>      trans_len = MIN(sizeof(cmd_eff_log) - off, buf_len);
>  




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 03/14] hw/block/nvme: Introduce the Namespace Types definitions
  2020-09-28  2:35 ` [PATCH v5 03/14] hw/block/nvme: Introduce the Namespace Types definitions Dmitry Fomichev
  2020-09-30  8:08   ` Klaus Jensen
@ 2020-09-30 15:21   ` Keith Busch
  1 sibling, 0 replies; 46+ messages in thread
From: Keith Busch @ 2020-09-30 15:21 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Fam Zheng, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Mon, Sep 28, 2020 at 11:35:17AM +0900, Dmitry Fomichev wrote:
> From: Niklas Cassel <niklas.cassel@wdc.com>
> 
> Define the structures and constants required to implement
> Namespace Types support.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme-ns.h   |  2 ++
>  hw/block/nvme.c      |  2 +-
>  include/block/nvme.h | 74 +++++++++++++++++++++++++++++++++++---------
>  3 files changed, 63 insertions(+), 15 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index 83734f4606..cca23bc0b3 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -21,6 +21,8 @@
>  
>  typedef struct NvmeNamespaceParams {
>      uint32_t nsid;
> +    uint8_t  csi;
> +    QemuUUID uuid;

Neither of these new params are used anywhere in this patch.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set
  2020-09-30 14:50   ` Niklas Cassel
@ 2020-09-30 18:23     ` Klaus Jensen
  2020-10-04 23:57     ` Dmitry Fomichev
  1 sibling, 0 replies; 46+ messages in thread
From: Klaus Jensen @ 2020-09-30 18:23 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block,
	Dmitry Fomichev, Klaus Jensen, qemu-devel, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 4562 bytes --]

On Sep 30 14:50, Niklas Cassel wrote:
> On Mon, Sep 28, 2020 at 11:35:23AM +0900, Dmitry Fomichev wrote:
> > The emulation code has been changed to advertise NVM Command Set when
> > "zoned" device property is not set (default) and Zoned Namespace
> > Command Set otherwise.
> > 
> > Handlers for three new NVMe commands introduced in Zoned Namespace
> > Command Set specification are added, namely for Zone Management
> > Receive, Zone Management Send and Zone Append.
> > 
> > Device initialization code has been extended to create a proper
> > configuration for zoned operation using device properties.
> > 
> > Read/Write command handler is modified to only allow writes at the
> > write pointer if the namespace is zoned. For Zone Append command,
> > writes implicitly happen at the write pointer and the starting write
> > pointer value is returned as the result of the command. Write Zeroes
> > handler is modified to add zoned checks that are identical to those
> > done as a part of Write flow.
> > 
> > The code to support for Zone Descriptor Extensions is not included in
> > this commit and ZDES 0 is always reported. A later commit in this
> > series will add ZDE support.
> > 
> > This commit doesn't yet include checks for active and open zone
> > limits. It is assumed that there are no limits on either active or
> > open zones.
> > 
> > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> > Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> > Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> > Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> > Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> > Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  block/nvme.c         |   2 +-
> >  hw/block/nvme-ns.c   | 185 ++++++++-
> >  hw/block/nvme-ns.h   |   6 +-
> >  hw/block/nvme.c      | 872 +++++++++++++++++++++++++++++++++++++++++--
> >  include/block/nvme.h |   6 +-
> >  5 files changed, 1033 insertions(+), 38 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 05485fdd11..7a513c9a17 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -333,7 +333,7 @@ static inline int nvme_translate_error(const NvmeCqe *c)
> >  {
> >      uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF;
> >      if (status) {
> > -        trace_nvme_error(le32_to_cpu(c->result),
> > +        trace_nvme_error(le32_to_cpu(c->result32),
> >                           le16_to_cpu(c->sq_head),
> >                           le16_to_cpu(c->sq_id),
> >                           le16_to_cpu(c->cid),
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index 31b7f986c3..6d9dc9205b 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -33,14 +33,14 @@ static void nvme_ns_init(NvmeNamespace *ns)
> >      NvmeIdNs *id_ns = &ns->id_ns;
> >  
> >      if (blk_get_flags(ns->blkconf.blk) & BDRV_O_UNMAP) {
> > -        ns->id_ns.dlfeat = 0x9;
> > +        ns->id_ns.dlfeat = 0x8;
> 
> You seem to change something that is NVM namespace specific here, why?
> If it is indeed needed, I assume that this change should be in a separate
> patch.
> 

Stood out to me as well - and I thought it was sound enough, but now I'm
not sure sure.

DLFEAT is set to 0x8, which only signifies that Deallocate in Write
Zeroes is supported. Previously, it would also signify that returned
values would be 0x00 (DLFEAT 0x8 | 0x1). But since Dmitry added the
fill_pattern parameter...


> > +static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
> > +                              Error **errp)
> > +{
> > +    NvmeIdNsZoned *id_ns_z;
> > +
> > +    if (n->params.fill_pattern == 0) {
> > +        ns->id_ns.dlfeat |= 0x01;
> > +    } else if (n->params.fill_pattern == 0xff) {
> > +        ns->id_ns.dlfeat |= 0x02;
> > +    }

... then, when initialized, we look at the fill_pattern and set DLFEAT
accordingly instead.

But since fill_pattern only works for ZNS namespaces, I think dlfeat
should still be 0x9 for NVM namespaces. For NVM namespaces, since
neither DULBE or DSM is not supported, there is really only Write Zeroes
that can explicitly "deallocate" a block, and since that *will* write
zeroes no matter if DEAC is set or not, 0x00 pattern is guaranteed.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types
  2020-09-28  2:35 ` [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
  2020-09-30  8:15   ` Klaus Jensen
  2020-09-30 12:47   ` Niklas Cassel
@ 2020-10-01 11:22   ` Niklas Cassel
  2020-10-01 15:29     ` Keith Busch
  2020-10-01 22:15   ` Klaus Jensen
  3 siblings, 1 reply; 46+ messages in thread
From: Niklas Cassel @ 2020-10-01 11:22 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Klaus Jensen,
	qemu-devel, Maxim Levitsky, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Mon, Sep 28, 2020 at 11:35:19AM +0900, Dmitry Fomichev wrote:
> From: Niklas Cassel <niklas.cassel@wdc.com>
> 
> Namespace Types introduce a new command set, "I/O Command Sets",
> that allows the host to retrieve the command sets associated with
> a namespace. Introduce support for the command set and enable
> detection for the NVM Command Set.
> 
> The new workflows for identify commands rely heavily on zero-filled
> identify structs. E.g., certain CNS commands are defined to return
> a zero-filled identify struct when an inactive namespace NSID
> is supplied.
> 
> Add a helper function in order to avoid code duplication when
> reporting zero-filled identify structures.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme-ns.c |   3 +
>  hw/block/nvme.c    | 210 +++++++++++++++++++++++++++++++++++++--------
>  2 files changed, 175 insertions(+), 38 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index bbd7879492..31b7f986c3 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -40,6 +40,9 @@ static void nvme_ns_init(NvmeNamespace *ns)
>  
>      id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(ns));
>  
> +    ns->params.csi = NVME_CSI_NVM;
> +    qemu_uuid_generate(&ns->params.uuid); /* TODO make UUIDs persistent */
> +
>      /* no thin provisioning */
>      id_ns->ncap = id_ns->nsze;
>      id_ns->nuse = id_ns->ncap;
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 29fa005fa2..4ec1ddc90a 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -1495,6 +1495,13 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
>      return NVME_SUCCESS;
>  }
>  
> +static uint16_t nvme_rpt_empty_id_struct(NvmeCtrl *n, NvmeRequest *req)
> +{
> +    uint8_t id[NVME_IDENTIFY_DATA_SIZE] = {};
> +
> +    return nvme_dma(n, id, sizeof(id), DMA_DIRECTION_FROM_DEVICE, req);
> +}
> +
>  static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
>  {
>      trace_pci_nvme_identify_ctrl();
> @@ -1503,11 +1510,23 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
>                      DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> +static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
> +{
> +    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> +
> +    trace_pci_nvme_identify_ctrl_csi(c->csi);
> +
> +    if (c->csi == NVME_CSI_NVM) {
> +        return nvme_rpt_empty_id_struct(n, req);
> +    }
> +
> +    return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
>  static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeNamespace *ns;
>      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> -    NvmeIdNs *id_ns, inactive = { 0 };
>      uint32_t nsid = le32_to_cpu(c->nsid);
>  
>      trace_pci_nvme_identify_ns(nsid);
> @@ -1518,23 +1537,46 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
>  
>      ns = nvme_ns(n, nsid);
>      if (unlikely(!ns)) {
> -        id_ns = &inactive;
> -    } else {
> -        id_ns = &ns->id_ns;
> +        return nvme_rpt_empty_id_struct(n, req);
>      }
>  
> -    return nvme_dma(n, (uint8_t *)id_ns, sizeof(NvmeIdNs),
> +    return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
>                      DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> +static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
> +{
> +    NvmeNamespace *ns;
> +    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> +    uint32_t nsid = le32_to_cpu(c->nsid);
> +
> +    trace_pci_nvme_identify_ns_csi(nsid, c->csi);
> +
> +    if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
> +        return NVME_INVALID_NSID | NVME_DNR;
> +    }
> +
> +    ns = nvme_ns(n, nsid);
> +    if (unlikely(!ns)) {
> +        return nvme_rpt_empty_id_struct(n, req);
> +    }
> +
> +    if (c->csi == NVME_CSI_NVM) {
> +        return nvme_rpt_empty_id_struct(n, req);
> +    }
> +
> +    return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
>  static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
>  {
> +    NvmeNamespace *ns;
>      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> -    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
>      uint32_t min_nsid = le32_to_cpu(c->nsid);
> -    uint32_t *list;
> -    uint16_t ret;
> -    int j = 0;
> +    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
> +    static const int data_len = sizeof(list);
> +    uint32_t *list_ptr = (uint32_t *)list;
> +    int i, j = 0;
>  
>      trace_pci_nvme_identify_nslist(min_nsid);
>  
> @@ -1548,48 +1590,76 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
>          return NVME_INVALID_NSID | NVME_DNR;
>      }
>  
> -    list = g_malloc0(data_len);
> -    for (int i = 1; i <= n->num_namespaces; i++) {
> -        if (i <= min_nsid || !nvme_ns(n, i)) {
> +    for (i = 1; i <= n->num_namespaces; i++) {
> +        ns = nvme_ns(n, i);
> +        if (!ns) {
>              continue;
>          }
> -        list[j++] = cpu_to_le32(i);
> +        if (ns->params.nsid < min_nsid) {
> +            continue;
> +        }
> +        list_ptr[j++] = cpu_to_le32(ns->params.nsid);
>          if (j == data_len / sizeof(uint32_t)) {
>              break;
>          }
>      }
> -    ret = nvme_dma(n, (uint8_t *)list, data_len, DMA_DIRECTION_FROM_DEVICE,
> -                   req);
> -    g_free(list);
> -    return ret;
> +
> +    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
> +}
> +
> +static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
> +{
> +    NvmeNamespace *ns;
> +    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> +    uint32_t min_nsid = le32_to_cpu(c->nsid);
> +    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
> +    static const int data_len = sizeof(list);
> +    uint32_t *list_ptr = (uint32_t *)list;
> +    int i, j = 0;
> +
> +    trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
> +
> +    if (c->csi != NVME_CSI_NVM) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    for (i = 1; i <= n->num_namespaces; i++) {
> +        ns = nvme_ns(n, i);
> +        if (!ns) {
> +            continue;
> +        }
> +        if (ns->params.nsid < min_nsid) {
> +            continue;
> +        }
> +        list_ptr[j++] = cpu_to_le32(ns->params.nsid);
> +        if (j == data_len / sizeof(uint32_t)) {
> +            break;
> +        }
> +    }
> +
> +    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
>  static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> +    NvmeNamespace *ns;
>      uint32_t nsid = le32_to_cpu(c->nsid);
> -    uint8_t list[NVME_IDENTIFY_DATA_SIZE];
> -
> -    struct data {
> -        struct {
> -            NvmeIdNsDescr hdr;
> -            uint8_t v[16];
> -        } uuid;
> -    };
> -
> -    struct data *ns_descrs = (struct data *)list;
> +    NvmeIdNsDescr *desc;
> +    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
> +    static const int data_len = sizeof(list);
> +    void *list_ptr = list;
>  
>      trace_pci_nvme_identify_ns_descr_list(nsid);
>  
> -    if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
> -        return NVME_INVALID_NSID | NVME_DNR;
> -    }
> -
>      if (unlikely(!nvme_ns(n, nsid))) {
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> -    memset(list, 0x0, sizeof(list));
> +    ns = nvme_ns(n, nsid);
> +    if (unlikely(!ns)) {
> +        return nvme_rpt_empty_id_struct(n, req);
> +    }
>  
>      /*
>       * Because the NGUID and EUI64 fields are 0 in the Identify Namespace data
> @@ -1597,12 +1667,31 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
>       * Namespace Identification Descriptor. Add a very basic Namespace UUID
>       * here.
>       */
> -    ns_descrs->uuid.hdr.nidt = NVME_NIDT_UUID;
> -    ns_descrs->uuid.hdr.nidl = NVME_NIDL_UUID;
> -    stl_be_p(&ns_descrs->uuid.v, nsid);
> +    desc = list_ptr;
> +    desc->nidt = NVME_NIDT_UUID;
> +    desc->nidl = NVME_NIDL_UUID;
> +    list_ptr += sizeof(*desc);
> +    memcpy(list_ptr, ns->params.uuid.data, NVME_NIDL_UUID);
> +    list_ptr += NVME_NIDL_UUID;
>  
> -    return nvme_dma(n, list, NVME_IDENTIFY_DATA_SIZE,
> -                    DMA_DIRECTION_FROM_DEVICE, req);
> +    desc = list_ptr;
> +    desc->nidt = NVME_NIDT_CSI;
> +    desc->nidl = NVME_NIDL_CSI;
> +    list_ptr += sizeof(*desc);
> +    *(uint8_t *)list_ptr = NVME_CSI_NVM;
> +
> +    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
> +}
> +
> +static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeRequest *req)
> +{
> +    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
> +    static const int data_len = sizeof(list);
> +
> +    trace_pci_nvme_identify_cmd_set();
> +
> +    NVME_SET_CSI(*list, NVME_CSI_NVM);
> +    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
>  static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
> @@ -1612,12 +1701,20 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
>      switch (le32_to_cpu(c->cns)) {
>      case NVME_ID_CNS_NS:
>          return nvme_identify_ns(n, req);
> +    case NVME_ID_CNS_CS_NS:
> +        return nvme_identify_ns_csi(n, req);
>      case NVME_ID_CNS_CTRL:
>          return nvme_identify_ctrl(n, req);
> +    case NVME_ID_CNS_CS_CTRL:
> +        return nvme_identify_ctrl_csi(n, req);
>      case NVME_ID_CNS_NS_ACTIVE_LIST:
>          return nvme_identify_nslist(n, req);
> +    case NVME_ID_CNS_CS_NS_ACTIVE_LIST:
> +        return nvme_identify_nslist_csi(n, req);
>      case NVME_ID_CNS_NS_DESCR_LIST:
>          return nvme_identify_ns_descr_list(n, req);
> +    case NVME_ID_CNS_IO_COMMAND_SET:
> +        return nvme_identify_cmd_set(n, req);
>      default:
>          trace_pci_nvme_err_invalid_identify_cns(le32_to_cpu(c->cns));
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1799,6 +1896,9 @@ defaults:
>              result |= NVME_INTVC_NOCOALESCING;
>          }
>  
> +        break;
> +    case NVME_COMMAND_SET_PROFILE:
> +        result = 0;
>          break;
>      default:
>          result = nvme_feature_default[fid];
> @@ -1939,6 +2039,12 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
>          break;
>      case NVME_TIMESTAMP:
>          return nvme_set_feature_timestamp(n, req);
> +    case NVME_COMMAND_SET_PROFILE:
> +        if (dw11 & 0x1ff) {
> +            trace_pci_nvme_err_invalid_iocsci(dw11 & 0x1ff);
> +            return NVME_CMD_SET_CMB_REJECTED | NVME_DNR;
> +        }
> +        break;
>      default:
>          return NVME_FEAT_NOT_CHANGEABLE | NVME_DNR;
>      }
> @@ -2222,6 +2328,30 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
>          break;
>      case 0x14:  /* CC */
>          trace_pci_nvme_mmio_cfg(data & 0xffffffff);
> +
> +        if (NVME_CC_CSS(data) != NVME_CC_CSS(n->bar.cc)) {
> +            if (NVME_CC_EN(n->bar.cc)) {

I just saw this print when doing controller reset on a live system.

Added a debug print:
nvme_write_bar WRITING: 0x0 previous: 0x464061

so the second if-statement has to be:

    if (NVME_CC_EN(n->bar.cc) && NVME_CC_EN(data)) {

Sorry for introducing the bug in the first place.


Kind regards,
Niklas

> +                NVME_GUEST_ERR(pci_nvme_err_change_css_when_enabled,
> +                               "changing selected command set when enabled");
> +            } else {
> +                switch (NVME_CC_CSS(data)) {
> +                case CSS_NVM_ONLY:
> +                    trace_pci_nvme_css_nvm_cset_selected_by_host(data &
> +                                                                 0xffffffff);
> +                    break;
> +                case CSS_CSI:
> +                    NVME_SET_CC_CSS(n->bar.cc, CSS_CSI);
> +                    trace_pci_nvme_css_all_csets_sel_by_host(data & 0xffffffff);
> +                    break;
> +                case CSS_ADMIN_ONLY:
> +                    break;
> +                default:
> +                    NVME_GUEST_ERR(pci_nvme_ub_unknown_css_value,
> +                                   "unknown value in CC.CSS field");
> +                }
> +            }
> +        }
> +
>          /* Windows first sends data, then sends enable bit */
>          if (!NVME_CC_EN(data) && !NVME_CC_EN(n->bar.cc) &&
>              !NVME_CC_SHN(data) && !NVME_CC_SHN(n->bar.cc))
> @@ -2810,7 +2940,11 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>      NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
>      NVME_CAP_SET_CQR(n->bar.cap, 1);
>      NVME_CAP_SET_TO(n->bar.cap, 0xf);
> -    NVME_CAP_SET_CSS(n->bar.cap, 1);
> +    /*
> +     * The device now always supports NS Types, but all commands
> +     * that support CSI field will only handle NVM Command Set.
> +     */
> +    NVME_CAP_SET_CSS(n->bar.cap, (CAP_CSS_NVM | CAP_CSS_CSI_SUPP));
>      NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
>  
>      n->bar.vs = NVME_SPEC_VER;
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types
  2020-10-01 11:22   ` Niklas Cassel
@ 2020-10-01 15:29     ` Keith Busch
  2020-10-01 15:50       ` Niklas Cassel
  0 siblings, 1 reply; 46+ messages in thread
From: Keith Busch @ 2020-10-01 15:29 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block,
	Dmitry Fomichev, Klaus Jensen, qemu-devel, Maxim Levitsky,
	Alistair Francis, Philippe Mathieu-Daudé,
	Matias Bjorling

On Thu, Oct 01, 2020 at 11:22:46AM +0000, Niklas Cassel wrote:
> On Mon, Sep 28, 2020 at 11:35:19AM +0900, Dmitry Fomichev wrote:
> > From: Niklas Cassel <niklas.cassel@wdc.com>
> > @@ -2222,6 +2328,30 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
> >          break;
> >      case 0x14:  /* CC */
> >          trace_pci_nvme_mmio_cfg(data & 0xffffffff);
> > +
> > +        if (NVME_CC_CSS(data) != NVME_CC_CSS(n->bar.cc)) {
> > +            if (NVME_CC_EN(n->bar.cc)) {
> 
> I just saw this print when doing controller reset on a live system.
> 
> Added a debug print:
> nvme_write_bar WRITING: 0x0 previous: 0x464061
> 
> so the second if-statement has to be:
> 
>     if (NVME_CC_EN(n->bar.cc) && NVME_CC_EN(data)) {
> 
> Sorry for introducing the bug in the first place.

No worries.

I don't think the check should be here at all, really. The only check for valid
CSS should be in nvme_start_ctrl(), which I posted yesterday.
 
> > +                NVME_GUEST_ERR(pci_nvme_err_change_css_when_enabled,
> > +                               "changing selected command set when enabled");
> > +            } else {
> > +                switch (NVME_CC_CSS(data)) {
> > +                case CSS_NVM_ONLY:
> > +                    trace_pci_nvme_css_nvm_cset_selected_by_host(data &
> > +                                                                 0xffffffff);
> > +                    break;
> > +                case CSS_CSI:
> > +                    NVME_SET_CC_CSS(n->bar.cc, CSS_CSI);
> > +                    trace_pci_nvme_css_all_csets_sel_by_host(data & 0xffffffff);
> > +                    break;
> > +                case CSS_ADMIN_ONLY:
> > +                    break;
> > +                default:
> > +                    NVME_GUEST_ERR(pci_nvme_ub_unknown_css_value,
> > +                                   "unknown value in CC.CSS field");
> > +                }
> > +            }
> > +        }


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types
  2020-10-01 15:29     ` Keith Busch
@ 2020-10-01 15:50       ` Niklas Cassel
  2020-10-01 15:59         ` Keith Busch
  0 siblings, 1 reply; 46+ messages in thread
From: Niklas Cassel @ 2020-10-01 15:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block,
	Dmitry Fomichev, Klaus Jensen, qemu-devel, Maxim Levitsky,
	Alistair Francis, Philippe Mathieu-Daudé,
	Matias Bjorling

On Thu, Oct 01, 2020 at 09:29:22AM -0600, Keith Busch wrote:
> On Thu, Oct 01, 2020 at 11:22:46AM +0000, Niklas Cassel wrote:
> > On Mon, Sep 28, 2020 at 11:35:19AM +0900, Dmitry Fomichev wrote:
> > > From: Niklas Cassel <niklas.cassel@wdc.com>
> > > @@ -2222,6 +2328,30 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
> > >          break;
> > >      case 0x14:  /* CC */
> > >          trace_pci_nvme_mmio_cfg(data & 0xffffffff);
> > > +
> > > +        if (NVME_CC_CSS(data) != NVME_CC_CSS(n->bar.cc)) {
> > > +            if (NVME_CC_EN(n->bar.cc)) {
> > 
> > I just saw this print when doing controller reset on a live system.
> > 
> > Added a debug print:
> > nvme_write_bar WRITING: 0x0 previous: 0x464061
> > 
> > so the second if-statement has to be:
> > 
> >     if (NVME_CC_EN(n->bar.cc) && NVME_CC_EN(data)) {
> > 
> > Sorry for introducing the bug in the first place.
> 
> No worries.
> 
> I don't think the check should be here at all, really. The only check for valid
> CSS should be in nvme_start_ctrl(), which I posted yesterday.

The reasoning for this additional check is this:

From CC.CC register description:

"This field shall only be changed when the controller
is disabled (CC.EN is cleared to ‘0’)."

In the QEMU model, we have functions, e.g. nvme_cmd_effects(),
that uses n->bar.cc "at runtime".

So I don't think that simply checking for valid CSS in
nvme_start_ctrl() is sufficient.

Thoughts?


Kind regards,
Niklas

>  
> > > +                NVME_GUEST_ERR(pci_nvme_err_change_css_when_enabled,
> > > +                               "changing selected command set when enabled");
> > > +            } else {
> > > +                switch (NVME_CC_CSS(data)) {
> > > +                case CSS_NVM_ONLY:
> > > +                    trace_pci_nvme_css_nvm_cset_selected_by_host(data &
> > > +                                                                 0xffffffff);
> > > +                    break;
> > > +                case CSS_CSI:
> > > +                    NVME_SET_CC_CSS(n->bar.cc, CSS_CSI);
> > > +                    trace_pci_nvme_css_all_csets_sel_by_host(data & 0xffffffff);
> > > +                    break;
> > > +                case CSS_ADMIN_ONLY:
> > > +                    break;
> > > +                default:
> > > +                    NVME_GUEST_ERR(pci_nvme_ub_unknown_css_value,
> > > +                                   "unknown value in CC.CSS field");
> > > +                }
> > > +            }
> > > +        }

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types
  2020-10-01 15:50       ` Niklas Cassel
@ 2020-10-01 15:59         ` Keith Busch
  2020-10-01 16:23           ` Niklas Cassel
  0 siblings, 1 reply; 46+ messages in thread
From: Keith Busch @ 2020-10-01 15:59 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block,
	Dmitry Fomichev, Klaus Jensen, qemu-devel, Maxim Levitsky,
	Alistair Francis, Philippe Mathieu-Daudé,
	Matias Bjorling

On Thu, Oct 01, 2020 at 03:50:35PM +0000, Niklas Cassel wrote:
> On Thu, Oct 01, 2020 at 09:29:22AM -0600, Keith Busch wrote:
> > On Thu, Oct 01, 2020 at 11:22:46AM +0000, Niklas Cassel wrote:
> > > On Mon, Sep 28, 2020 at 11:35:19AM +0900, Dmitry Fomichev wrote:
> > > > From: Niklas Cassel <niklas.cassel@wdc.com>
> > > > @@ -2222,6 +2328,30 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
> > > >          break;
> > > >      case 0x14:  /* CC */
> > > >          trace_pci_nvme_mmio_cfg(data & 0xffffffff);
> > > > +
> > > > +        if (NVME_CC_CSS(data) != NVME_CC_CSS(n->bar.cc)) {
> > > > +            if (NVME_CC_EN(n->bar.cc)) {
> > > 
> > > I just saw this print when doing controller reset on a live system.
> > > 
> > > Added a debug print:
> > > nvme_write_bar WRITING: 0x0 previous: 0x464061
> > > 
> > > so the second if-statement has to be:
> > > 
> > >     if (NVME_CC_EN(n->bar.cc) && NVME_CC_EN(data)) {
> > > 
> > > Sorry for introducing the bug in the first place.
> > 
> > No worries.
> > 
> > I don't think the check should be here at all, really. The only check for valid
> > CSS should be in nvme_start_ctrl(), which I posted yesterday.
> 
> The reasoning for this additional check is this:
> 
> From CC.CC register description:
> 
> "This field shall only be changed when the controller
> is disabled (CC.EN is cleared to ‘0’)."
> 
> In the QEMU model, we have functions, e.g. nvme_cmd_effects(),
> that uses n->bar.cc "at runtime".
> 
> So I don't think that simply checking for valid CSS in
> nvme_start_ctrl() is sufficient.
> 
> Thoughts?

The qemu controller accepts host register writes only for valid enable
and shutdown  bit transitions. Or at least it should. If not, then we
need to fix that, but that's not specific to the CSS bits.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types
  2020-10-01 15:59         ` Keith Busch
@ 2020-10-01 16:23           ` Niklas Cassel
  2020-10-01 17:08             ` Keith Busch
  0 siblings, 1 reply; 46+ messages in thread
From: Niklas Cassel @ 2020-10-01 16:23 UTC (permalink / raw)
  To: Keith Busch
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block,
	Dmitry Fomichev, Klaus Jensen, qemu-devel, Maxim Levitsky,
	Alistair Francis, Philippe Mathieu-Daudé,
	Matias Bjorling

On Thu, Oct 01, 2020 at 08:59:31AM -0700, Keith Busch wrote:
> On Thu, Oct 01, 2020 at 03:50:35PM +0000, Niklas Cassel wrote:
> > On Thu, Oct 01, 2020 at 09:29:22AM -0600, Keith Busch wrote:
> > > On Thu, Oct 01, 2020 at 11:22:46AM +0000, Niklas Cassel wrote:
> > > > On Mon, Sep 28, 2020 at 11:35:19AM +0900, Dmitry Fomichev wrote:
> > > > > From: Niklas Cassel <niklas.cassel@wdc.com>
> > > > > @@ -2222,6 +2328,30 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
> > > > >          break;
> > > > >      case 0x14:  /* CC */
> > > > >          trace_pci_nvme_mmio_cfg(data & 0xffffffff);
> > > > > +
> > > > > +        if (NVME_CC_CSS(data) != NVME_CC_CSS(n->bar.cc)) {
> > > > > +            if (NVME_CC_EN(n->bar.cc)) {
> > > > 
> > > > I just saw this print when doing controller reset on a live system.
> > > > 
> > > > Added a debug print:
> > > > nvme_write_bar WRITING: 0x0 previous: 0x464061
> > > > 
> > > > so the second if-statement has to be:
> > > > 
> > > >     if (NVME_CC_EN(n->bar.cc) && NVME_CC_EN(data)) {
> > > > 
> > > > Sorry for introducing the bug in the first place.
> > > 
> > > No worries.
> > > 
> > > I don't think the check should be here at all, really. The only check for valid
> > > CSS should be in nvme_start_ctrl(), which I posted yesterday.
> > 
> > The reasoning for this additional check is this:
> > 
> > From CC.CC register description:
> > 
> > "This field shall only be changed when the controller
> > is disabled (CC.EN is cleared to ‘0’)."
> > 
> > In the QEMU model, we have functions, e.g. nvme_cmd_effects(),
> > that uses n->bar.cc "at runtime".
> > 
> > So I don't think that simply checking for valid CSS in
> > nvme_start_ctrl() is sufficient.
> > 
> > Thoughts?
> 
> The qemu controller accepts host register writes only for valid enable
> and shutdown  bit transitions. Or at least it should. If not, then we
> need to fix that, but that's not specific to the CSS bits.

I simply added the second if-statement, (if (NVME_CC_EN(n->bar.cc))),
the rest of the NVME_CC_CSS was written by someone else.

But I see your point, all of this code:

        if (NVME_CC_CSS(data) != NVME_CC_CSS(n->bar.cc)) {
            if (NVME_CC_EN(n->bar.cc)) {
                NVME_GUEST_ERR(pci_nvme_err_change_css_when_enabled,
                               "changing selected command set when enabled");
            } else {
                switch (NVME_CC_CSS(data)) {
                case CSS_NVM_ONLY:
                    trace_pci_nvme_css_nvm_cset_selected_by_host(data &
                                                                 0xffffffff);
                break;
                case CSS_CSI:
                    NVME_SET_CC_CSS(n->bar.cc, CSS_CSI);
                    trace_pci_nvme_css_all_csets_sel_by_host(data &
                                                             0xffffffff);
                    break;
                case CSS_ADMIN_ONLY:
                    break;
                default:
                    NVME_GUEST_ERR(pci_nvme_ub_unknown_css_value,
                                   "unknown value in CC.CSS field");
                }
            }
        }

should simply be dropped.

No need to call NVME_SET_CC_CSS() explicitly.

CC.CSS bit will be set futher down in this function anyway:

        if (NVME_CC_EN(data) && !NVME_CC_EN(n->bar.cc)) {
            n->bar.cc = data;


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types
  2020-10-01 16:23           ` Niklas Cassel
@ 2020-10-01 17:08             ` Keith Busch
  0 siblings, 0 replies; 46+ messages in thread
From: Keith Busch @ 2020-10-01 17:08 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block,
	Dmitry Fomichev, Klaus Jensen, qemu-devel, Maxim Levitsky,
	Alistair Francis, Philippe Mathieu-Daudé,
	Matias Bjorling

On Thu, Oct 01, 2020 at 04:23:56PM +0000, Niklas Cassel wrote:
> But I see your point, all of this code:
> 
>         if (NVME_CC_CSS(data) != NVME_CC_CSS(n->bar.cc)) {
>             if (NVME_CC_EN(n->bar.cc)) {
>                 NVME_GUEST_ERR(pci_nvme_err_change_css_when_enabled,
>                                "changing selected command set when enabled");
>             } else {
>                 switch (NVME_CC_CSS(data)) {
>                 case CSS_NVM_ONLY:
>                     trace_pci_nvme_css_nvm_cset_selected_by_host(data &
>                                                                  0xffffffff);
>                 break;
>                 case CSS_CSI:
>                     NVME_SET_CC_CSS(n->bar.cc, CSS_CSI);
>                     trace_pci_nvme_css_all_csets_sel_by_host(data &
>                                                              0xffffffff);
>                     break;
>                 case CSS_ADMIN_ONLY:
>                     break;
>                 default:
>                     NVME_GUEST_ERR(pci_nvme_ub_unknown_css_value,
>                                    "unknown value in CC.CSS field");
>                 }
>             }
>         }
> 
> should simply be dropped.
> 
> No need to call NVME_SET_CC_CSS() explicitly.
> 
> CC.CSS bit will be set futher down in this function anyway:
> 
>         if (NVME_CC_EN(data) && !NVME_CC_EN(n->bar.cc)) {
>             n->bar.cc = data;

Yep, that's how I saw it too. I folded it all into a rebase here for
this particular patch:

  http://git.infradead.org/qemu-nvme.git/commitdiff/45157cab2e700155b05f0bd28533f73d7e399ab8?hp=2015774a010011a9e8d2ab5291fd8d747f60471e

It depends on the prep patches I sent yesterday, which seem pretty
straight forward. I'll just wait another day before committing that
stuff and other fixes to the nvme-next branch. But if you want to get a
head start on the ZNS enabling parts, what I have in mind is in the
branch from the above link.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types
  2020-09-28  2:35 ` [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
                     ` (2 preceding siblings ...)
  2020-10-01 11:22   ` Niklas Cassel
@ 2020-10-01 22:15   ` Klaus Jensen
  2020-10-01 22:30     ` Dmitry Fomichev
  3 siblings, 1 reply; 46+ messages in thread
From: Klaus Jensen @ 2020-10-01 22:15 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 3654 bytes --]

On Sep 28 11:35, Dmitry Fomichev wrote:
> From: Niklas Cassel <niklas.cassel@wdc.com>
> 
> Namespace Types introduce a new command set, "I/O Command Sets",
> that allows the host to retrieve the command sets associated with
> a namespace. Introduce support for the command set and enable
> detection for the NVM Command Set.
> 
> The new workflows for identify commands rely heavily on zero-filled
> identify structs. E.g., certain CNS commands are defined to return
> a zero-filled identify struct when an inactive namespace NSID
> is supplied.
> 
> Add a helper function in order to avoid code duplication when
> reporting zero-filled identify structures.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme-ns.c |   3 +
>  hw/block/nvme.c    | 210 +++++++++++++++++++++++++++++++++++++--------
>  2 files changed, 175 insertions(+), 38 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index bbd7879492..31b7f986c3 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c

The following looks like a rebase gone wrong.

There are some redundant checks and wrong return values.

>  static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> +    NvmeNamespace *ns;
>      uint32_t nsid = le32_to_cpu(c->nsid);
> -    uint8_t list[NVME_IDENTIFY_DATA_SIZE];
> -
> -    struct data {
> -        struct {
> -            NvmeIdNsDescr hdr;
> -            uint8_t v[16];
> -        } uuid;
> -    };
> -
> -    struct data *ns_descrs = (struct data *)list;
> +    NvmeIdNsDescr *desc;
> +    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
> +    static const int data_len = sizeof(list);
> +    void *list_ptr = list;

Oh maaan, please do not replace my nicely cleaned up code with pointer
arithmetics :(

>  
>      trace_pci_nvme_identify_ns_descr_list(nsid);
>  
> -    if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
> -        return NVME_INVALID_NSID | NVME_DNR;
> -    }
> -

This removal looks wrong.

>      if (unlikely(!nvme_ns(n, nsid))) {
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> -    memset(list, 0x0, sizeof(list));
> +    ns = nvme_ns(n, nsid);
> +    if (unlikely(!ns)) {
> +        return nvme_rpt_empty_id_struct(n, req);
> +    }
>  

And this doesnt look like it belongs (its checked just a few lines
before, and it returns an error status as it should).

>      /*
>       * Because the NGUID and EUI64 fields are 0 in the Identify Namespace data
> @@ -1597,12 +1667,31 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
>       * Namespace Identification Descriptor. Add a very basic Namespace UUID
>       * here.
>       */
> -    ns_descrs->uuid.hdr.nidt = NVME_NIDT_UUID;
> -    ns_descrs->uuid.hdr.nidl = NVME_NIDL_UUID;
> -    stl_be_p(&ns_descrs->uuid.v, nsid);
> +    desc = list_ptr;
> +    desc->nidt = NVME_NIDT_UUID;
> +    desc->nidl = NVME_NIDL_UUID;
> +    list_ptr += sizeof(*desc);
> +    memcpy(list_ptr, ns->params.uuid.data, NVME_NIDL_UUID);
> +    list_ptr += NVME_NIDL_UUID;
>  
> -    return nvme_dma(n, list, NVME_IDENTIFY_DATA_SIZE,
> -                    DMA_DIRECTION_FROM_DEVICE, req);
> +    desc = list_ptr;
> +    desc->nidt = NVME_NIDT_CSI;
> +    desc->nidl = NVME_NIDL_CSI;
> +    list_ptr += sizeof(*desc);
> +    *(uint8_t *)list_ptr = NVME_CSI_NVM;
> +
> +    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
> +}
> +


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types
  2020-10-01 22:15   ` Klaus Jensen
@ 2020-10-01 22:30     ` Dmitry Fomichev
  0 siblings, 0 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-10-01 22:30 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

> -----Original Message-----
> From: Klaus Jensen <its@irrelevant.dk>
> Sent: Thursday, October 1, 2020 6:15 PM
> To: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
> Cc: Keith Busch <kbusch@kernel.org>; Klaus Jensen
> <k.jensen@samsung.com>; Kevin Wolf <kwolf@redhat.com>; Philippe
> Mathieu-Daudé <philmd@redhat.com>; Maxim Levitsky
> <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>; Niklas Cassel
> <Niklas.Cassel@wdc.com>; Damien Le Moal <Damien.LeMoal@wdc.com>;
> qemu-block@nongnu.org; qemu-devel@nongnu.org; Alistair Francis
> <Alistair.Francis@wdc.com>; Matias Bjorling <Matias.Bjorling@wdc.com>
> Subject: Re: [PATCH v5 05/14] hw/block/nvme: Add support for Namespace
> Types
> 
> On Sep 28 11:35, Dmitry Fomichev wrote:
> > From: Niklas Cassel <niklas.cassel@wdc.com>
> >
> > Namespace Types introduce a new command set, "I/O Command Sets",
> > that allows the host to retrieve the command sets associated with
> > a namespace. Introduce support for the command set and enable
> > detection for the NVM Command Set.
> >
> > The new workflows for identify commands rely heavily on zero-filled
> > identify structs. E.g., certain CNS commands are defined to return
> > a zero-filled identify struct when an inactive namespace NSID
> > is supplied.
> >
> > Add a helper function in order to avoid code duplication when
> > reporting zero-filled identify structures.
> >
> > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  hw/block/nvme-ns.c |   3 +
> >  hw/block/nvme.c    | 210 +++++++++++++++++++++++++++++++++++++-
> -------
> >  2 files changed, 175 insertions(+), 38 deletions(-)
> >
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index bbd7879492..31b7f986c3 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> 
> The following looks like a rebase gone wrong.
> 
> There are some redundant checks and wrong return values.
> 
> >  static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest
> *req)
> >  {
> >      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> > +    NvmeNamespace *ns;
> >      uint32_t nsid = le32_to_cpu(c->nsid);
> > -    uint8_t list[NVME_IDENTIFY_DATA_SIZE];
> > -
> > -    struct data {
> > -        struct {
> > -            NvmeIdNsDescr hdr;
> > -            uint8_t v[16];
> > -        } uuid;
> > -    };
> > -
> > -    struct data *ns_descrs = (struct data *)list;
> > +    NvmeIdNsDescr *desc;
> > +    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
> > +    static const int data_len = sizeof(list);
> > +    void *list_ptr = list;
> 
> Oh maaan, please do not replace my nicely cleaned up code with pointer
> arithmetics :(
> 
> >
> >      trace_pci_nvme_identify_ns_descr_list(nsid);
> >
> > -    if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
> > -        return NVME_INVALID_NSID | NVME_DNR;
> > -    }
> > -
> 
> This removal looks wrong.
> 
> >      if (unlikely(!nvme_ns(n, nsid))) {
> >          return NVME_INVALID_FIELD | NVME_DNR;
> >      }
> >
> > -    memset(list, 0x0, sizeof(list));
> > +    ns = nvme_ns(n, nsid);
> > +    if (unlikely(!ns)) {
> > +        return nvme_rpt_empty_id_struct(n, req);
> > +    }
> >
> 
> And this doesnt look like it belongs (its checked just a few lines
> before, and it returns an error status as it should).
> 

This and above looks like a merge error, I am correcting this along
with moving UUID calculation to a separate commit.

> >      /*
> >       * Because the NGUID and EUI64 fields are 0 in the Identify Namespace
> data
> > @@ -1597,12 +1667,31 @@ static uint16_t
> nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
> >       * Namespace Identification Descriptor. Add a very basic Namespace
> UUID
> >       * here.
> >       */
> > -    ns_descrs->uuid.hdr.nidt = NVME_NIDT_UUID;
> > -    ns_descrs->uuid.hdr.nidl = NVME_NIDL_UUID;
> > -    stl_be_p(&ns_descrs->uuid.v, nsid);
> > +    desc = list_ptr;
> > +    desc->nidt = NVME_NIDT_UUID;
> > +    desc->nidl = NVME_NIDL_UUID;
> > +    list_ptr += sizeof(*desc);
> > +    memcpy(list_ptr, ns->params.uuid.data, NVME_NIDL_UUID);
> > +    list_ptr += NVME_NIDL_UUID;
> >
> > -    return nvme_dma(n, list, NVME_IDENTIFY_DATA_SIZE,
> > -                    DMA_DIRECTION_FROM_DEVICE, req);
> > +    desc = list_ptr;
> > +    desc->nidt = NVME_NIDT_CSI;
> > +    desc->nidl = NVME_NIDL_CSI;
> > +    list_ptr += sizeof(*desc);
> > +    *(uint8_t *)list_ptr = NVME_CSI_NVM;
> > +
> > +    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE,
> req);
> > +}
> > +


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set
  2020-09-30  5:59   ` Klaus Jensen
@ 2020-10-04 23:48     ` Dmitry Fomichev
  0 siblings, 0 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-10-04 23:48 UTC (permalink / raw)
  To: its
  Cc: fam, Niklas Cassel, Damien Le Moal, qemu-block, k.jensen,
	qemu-devel, mlevitsk, Alistair Francis, kbusch, kwolf, philmd,
	Matias Bjorling

On Wed, 2020-09-30 at 07:59 +0200, Klaus Jensen wrote:
> On Sep 28 11:35, Dmitry Fomichev wrote:
> > The emulation code has been changed to advertise NVM Command Set when
> > "zoned" device property is not set (default) and Zoned Namespace
> > Command Set otherwise.
> > 
> > Handlers for three new NVMe commands introduced in Zoned Namespace
> > Command Set specification are added, namely for Zone Management
> > Receive, Zone Management Send and Zone Append.
> > 
> > Device initialization code has been extended to create a proper
> > configuration for zoned operation using device properties.
> > 
> > Read/Write command handler is modified to only allow writes at the
> > write pointer if the namespace is zoned. For Zone Append command,
> > writes implicitly happen at the write pointer and the starting write
> > pointer value is returned as the result of the command. Write Zeroes
> > handler is modified to add zoned checks that are identical to those
> > done as a part of Write flow.
> > 
> > The code to support for Zone Descriptor Extensions is not included in
> > this commit and ZDES 0 is always reported. A later commit in this
> > series will add ZDE support.
> > 
> > This commit doesn't yet include checks for active and open zone
> > limits. It is assumed that there are no limits on either active or
> > open zones.
> > 
> > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> > Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> > Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> > Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> > Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> > Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  block/nvme.c         |   2 +-
> >  hw/block/nvme-ns.c   | 185 ++++++++-
> >  hw/block/nvme-ns.h   |   6 +-
> >  hw/block/nvme.c      | 872 +++++++++++++++++++++++++++++++++++++++++--
> >  include/block/nvme.h |   6 +-
> >  5 files changed, 1033 insertions(+), 38 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 05485fdd11..7a513c9a17 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > +static int nvme_calc_zone_geometry(NvmeNamespace *ns, Error **errp)
> > +{
> > +    uint64_t zone_size, zone_cap;
> > +    uint32_t nz, lbasz = ns->blkconf.logical_block_size;
> > +
> > +    if (ns->params.zone_size_mb) {
> > +        zone_size = ns->params.zone_size_mb;
> > +    } else {
> > +        zone_size = NVME_DEFAULT_ZONE_SIZE;
> > +    }
> > +    if (ns->params.zone_capacity_mb) {
> > +        zone_cap = ns->params.zone_capacity_mb;
> > +    } else {
> > +        zone_cap = zone_size;
> > +    }
> 
> I think a check that zone_capacity_mb is less than or equal to
> zone_size_mb is missing earlier?

The check is right below, but I now think it is better to
compare byte sizes rather than numbers of LBAs. There are also
checks missing for zone_size >= lbasz and zone_cap >= lbasz that
I am adding.

> 
> > +static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
> > +                              Error **errp)
> > +{
> > +    NvmeIdNsZoned *id_ns_z;
> > +
> > +    if (n->params.fill_pattern == 0) {
> > +        ns->id_ns.dlfeat |= 0x01;
> > +    } else if (n->params.fill_pattern == 0xff) {
> > +        ns->id_ns.dlfeat |= 0x02;
> > +    }
> > +
> > +    if (nvme_calc_zone_geometry(ns, errp) != 0) {
> > +        return -1;
> > +    }
> > +
> > +    nvme_init_zone_meta(ns);
> > +
> > +    id_ns_z = g_malloc0(sizeof(NvmeIdNsZoned));
> > +
> > +    /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
> > +    id_ns_z->mar = 0xffffffff;
> > +    id_ns_z->mor = 0xffffffff;
> > +    id_ns_z->zoc = 0;
> > +    id_ns_z->ozcs = ns->params.cross_zone_read ? 0x01 : 0x00;
> > +
> > +    id_ns_z->lbafe[lba_index].zsze = cpu_to_le64(ns->zone_size);
> > +    id_ns_z->lbafe[lba_index].zdes = 0; /* FIXME make helper */
> > +
> > +    ns->csi = NVME_CSI_ZONED;
> > +    ns->id_ns.ncap = cpu_to_le64(ns->zone_capacity * ns->num_zones);
> > +    ns->id_ns.nuse = ns->id_ns.ncap;
> > +    ns->id_ns.nsze = ns->id_ns.ncap;
> > +
> 
> NSZE should be in terms of ZSZE. We *can* report NCAP < NSZE if zcap !=
> zsze, but that requires bit 1 set in NSFEAT and proper reporting of
> NUSE.

Ok, will correct. I think it used to be this way, but got messed up
during multiple transformations of this code.

> 
> > @@ -133,6 +304,14 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
> >  static Property nvme_ns_props[] = {
> >      DEFINE_BLOCK_PROPERTIES(NvmeNamespace, blkconf),
> >      DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
> > +
> > +    DEFINE_PROP_BOOL("zoned", NvmeNamespace, params.zoned, false),
> > +    DEFINE_PROP_UINT64("zone_size", NvmeNamespace, params.zone_size_mb,
> > +                       NVME_DEFAULT_ZONE_SIZE),
> > +    DEFINE_PROP_UINT64("zone_capacity", NvmeNamespace,
> > +                       params.zone_capacity_mb, 0),
> 
> There is a nice DEFINE_PROP_SIZE that handles sizes in a nice way (i.e.
> 1G, 1M).

It should be nice to use these types, will add them. _SIZE32 sounds like
a good candidate to use for ZASL.

Thank you for your valuable feedback!
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 06/14] hw/block/nvme: Add support for active/inactive namespaces
  2020-09-30 13:50   ` Niklas Cassel
@ 2020-10-04 23:54     ` Dmitry Fomichev
  2020-10-05 11:26       ` Niklas Cassel
  0 siblings, 1 reply; 46+ messages in thread
From: Dmitry Fomichev @ 2020-10-04 23:54 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: fam, kwolf, Damien Le Moal, qemu-block, k.jensen, qemu-devel,
	mlevitsk, Alistair Francis, kbusch, philmd, Matias Bjorling

On Wed, 2020-09-30 at 13:50 +0000, Niklas Cassel wrote:
> On Mon, Sep 28, 2020 at 11:35:20AM +0900, Dmitry Fomichev wrote:
> > From: Niklas Cassel <niklas.cassel@wdc.com>
> > 
> > In NVMe, a namespace is active if it exists and is attached to the
> > controller.
> > 
> > CAP.CSS (together with the I/O Command Set data structure) defines what
> > command sets are supported by the controller.
> > 
> > CC.CSS (together with Set Profile) can be set to enable a subset of the
> > available command sets. The namespaces belonging to a disabled command set
> > will not be able to attach to the controller, and will thus be inactive.
> > 
> > E.g., if the user sets CC.CSS to Admin Only, NVM namespaces should be
> > marked as inactive.
> > 
> > The identify namespace, the identify namespace CSI specific, and the namespace
> > list commands have two different versions, one that only shows active
> > namespaces, and the other version that shows existing namespaces, regardless
> > of whether the namespace is attached or not.
> > 
> > Add an attached member to struct NvmeNamespace, and implement the missing CNS
> > commands.
> > 
> > The added functionality will also simplify the implementation of namespace
> > management in the future, since namespace management can also attach and
> > detach namespaces.
> 
> Following my previous discussion with Klaus,
> I think we need to rewrite this commit message completely:
> 
> Subject: hw/block/nvme: Add support for allocated CNS command variants
> 
> Many CNS commands have "allocated" command variants.
> These includes a namespace as long as it is allocated
> (i.e. a namespace is included regardless if it is active (attached)
> or not.)
> 
> While these commands are optional (they are mandatory for controllers
> supporting the namespace attachment command), our QEMU implementation
> is more complete by actually providing support for these CNS values.
> 
> However, since our QEMU model currently does not support the namespace
> attachment command, these new allocated CNS commands will return the same
> result as the active CNS command variants.
> 
> In NVMe, a namespace is active if it exists and is attached to the
> controller.
> 
> CAP.CSS (together with the I/O Command Set data structure) defines what
> command sets are supported by the controller.
> 
> CC.CSS (together with Set Profile) can be set to enable a subset of the
> available command sets.
> 
> Even if a user configures CC.CSS to e.g. Admin only, NVM namespaces
> will still be attached (and thus marked as active).
> Similarly, if a user configures CC.CSS to e.g. NVM, ZNS namespaces
> will still be attached (and thus marked as active).
> 
> However, any operation from a disabled command set will result in a
> Invalid Command Opcode.
> 
> Add an attached struct member for struct NvmeNamespace,
> so that we lay the foundation for namespace attachment
> support. Also implement logic in the new CNS values to
> include/exclude namespaces based on this new struct member.
> The only thing missing hooking up the actual Namespace Attachment
> command opcode, which allows a user to toggle the attached
> variable per namespace. The reason for not hooking up this
> command completely is because the NVMe specification
> requires that the namespace managment command is supported
> if the namespacement attachment command is supported.
> 

OK, putting this in.

> 
> > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  hw/block/nvme-ns.h   |  1 +
> >  hw/block/nvme.c      | 60 ++++++++++++++++++++++++++++++++++++++------
> >  include/block/nvme.h | 20 +++++++++------
> >  3 files changed, 65 insertions(+), 16 deletions(-)
> > 
> > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> > index cca23bc0b3..acdb76f058 100644
> > --- a/hw/block/nvme-ns.h
> > +++ b/hw/block/nvme-ns.h
> > @@ -22,6 +22,7 @@
> >  typedef struct NvmeNamespaceParams {
> >      uint32_t nsid;
> >      uint8_t  csi;
> > +    bool     attached;
> >      QemuUUID uuid;
> >  } NvmeNamespaceParams;
> >  
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 4ec1ddc90a..63ad03d6d6 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> 
> We need to add an additional check in nvme_io_cmd()
> that returns Invalid Command Opcode when CC.CSS == Admin only.
> 

I think Keith has this addition already queued. 

> > @@ -1523,7 +1523,8 @@ static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
> >      return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> >  
> > -static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
> > +static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req,
> > +                                 bool only_active)
> >  {
> >      NvmeNamespace *ns;
> >      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> > @@ -1540,11 +1541,16 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
> >          return nvme_rpt_empty_id_struct(n, req);
> >      }
> >  
> > +    if (only_active && !ns->params.attached) {
> > +        return nvme_rpt_empty_id_struct(n, req);
> > +    }
> > +
> >      return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
> >                      DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> > -static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
> > +static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
> > +                                     bool only_active)
> >  {
> >      NvmeNamespace *ns;
> >      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> > @@ -1561,6 +1567,10 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
> >          return nvme_rpt_empty_id_struct(n, req);
> >      }
> >  
> > +    if (only_active && !ns->params.attached) {
> > +        return nvme_rpt_empty_id_struct(n, req);
> > +    }
> > +
> >      if (c->csi == NVME_CSI_NVM) {
> >          return nvme_rpt_empty_id_struct(n, req);
> >      }
> > @@ -1568,7 +1578,8 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
> >      return NVME_INVALID_FIELD | NVME_DNR;
> >  }
> >  
> > -static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
> > +static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req,
> > +                                     bool only_active)
> >  {
> >      NvmeNamespace *ns;
> >      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> > @@ -1598,6 +1609,9 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
> >          if (ns->params.nsid < min_nsid) {
> >              continue;
> >          }
> > +        if (only_active && !ns->params.attached) {
> > +            continue;
> > +        }
> >          list_ptr[j++] = cpu_to_le32(ns->params.nsid);
> >          if (j == data_len / sizeof(uint32_t)) {
> >              break;
> > @@ -1607,7 +1621,8 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
> >      return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> > -static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
> > +static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req,
> > +                                         bool only_active)
> >  {
> >      NvmeNamespace *ns;
> >      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> > @@ -1631,6 +1646,9 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
> >          if (ns->params.nsid < min_nsid) {
> >              continue;
> >          }
> > +        if (only_active && !ns->params.attached) {
> > +            continue;
> > +        }
> >          list_ptr[j++] = cpu_to_le32(ns->params.nsid);
> >          if (j == data_len / sizeof(uint32_t)) {
> >              break;
> > @@ -1700,17 +1718,25 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
> >  
> >      switch (le32_to_cpu(c->cns)) {
> >      case NVME_ID_CNS_NS:
> > -        return nvme_identify_ns(n, req);
> > +        return nvme_identify_ns(n, req, true);
> >      case NVME_ID_CNS_CS_NS:
> > -        return nvme_identify_ns_csi(n, req);
> > +        return nvme_identify_ns_csi(n, req, true);
> > +    case NVME_ID_CNS_NS_PRESENT:
> > +        return nvme_identify_ns(n, req, false);
> > +    case NVME_ID_CNS_CS_NS_PRESENT:
> > +        return nvme_identify_ns_csi(n, req, false);
> >      case NVME_ID_CNS_CTRL:
> >          return nvme_identify_ctrl(n, req);
> >      case NVME_ID_CNS_CS_CTRL:
> >          return nvme_identify_ctrl_csi(n, req);
> >      case NVME_ID_CNS_NS_ACTIVE_LIST:
> > -        return nvme_identify_nslist(n, req);
> > +        return nvme_identify_nslist(n, req, true);
> >      case NVME_ID_CNS_CS_NS_ACTIVE_LIST:
> > -        return nvme_identify_nslist_csi(n, req);
> > +        return nvme_identify_nslist_csi(n, req, true);
> > +    case NVME_ID_CNS_NS_PRESENT_LIST:
> > +        return nvme_identify_nslist(n, req, false);
> > +    case NVME_ID_CNS_CS_NS_PRESENT_LIST:
> > +        return nvme_identify_nslist_csi(n, req, false);
> >      case NVME_ID_CNS_NS_DESCR_LIST:
> >          return nvme_identify_ns_descr_list(n, req);
> >      case NVME_ID_CNS_IO_COMMAND_SET:
> > @@ -2188,8 +2214,10 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
> >  
> >  static int nvme_start_ctrl(NvmeCtrl *n)
> >  {
> > +    NvmeNamespace *ns;
> >      uint32_t page_bits = NVME_CC_MPS(n->bar.cc) + 12;
> >      uint32_t page_size = 1 << page_bits;
> > +    int i;
> >  
> >      if (unlikely(n->cq[0])) {
> >          trace_pci_nvme_err_startfail_cq();
> > @@ -2276,6 +2304,22 @@ static int nvme_start_ctrl(NvmeCtrl *n)
> >      nvme_init_sq(&n->admin_sq, n, n->bar.asq, 0, 0,
> >                   NVME_AQA_ASQS(n->bar.aqa) + 1);
> >  
> > +    for (i = 1; i <= n->num_namespaces; i++) {
> > +        ns = nvme_ns(n, i);
> > +        if (!ns) {
> > +            continue;
> > +        }
> > +        ns->params.attached = false;
> > +        switch (ns->params.csi) {
> > +        case NVME_CSI_NVM:
> > +            if (NVME_CC_CSS(n->bar.cc) == CSS_NVM_ONLY ||
> > +                NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
> > +                ns->params.attached = true;
> > +            }
> > +            break;
> > +        }
> > +    }
> > +
> 
> Considering that the controller doesn't attach/detach
> namespaces belonging to command sets that it doesn't
> support, I think a nicer way is to remove this for-loop,
> and instead, in nvme_ns_setup() or nvme_ns_init(),
> always set attached = true. (Since we currently don't
> support namespace attachment command).
> 
> The person that implements the last piece of namespace
> management and namespace attachment will have to deal
> with reading "attached" from some kind of persistent state


I did some spec reading on this topic and it seems that
this logic is necessary precisely because there is no
attach/detach command available. Such a command would
prevent attachment of a zoned namespace if CC.CSS is
NVM_ONLY, right? But since we have a static config, we
need to do this IMO.

Also, 6.1.5 of the spec says that any operation that uses
an inactive NSID shall fail with Invalid Field. I am
adding a few bits to fail all i/o commands and set/get
features attempted on inactive namespaces.

> and setting it accordingly.
> 
> >      nvme_set_timestamp(n, 0ULL);
> >  
> >      QTAILQ_INIT(&n->aer_queue);
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 4587311783..b182fe40b2 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -804,14 +804,18 @@ typedef struct QEMU_PACKED NvmePSD {
> >  #define NVME_IDENTIFY_DATA_SIZE 4096
> >  
> >  enum NvmeIdCns {
> > -    NVME_ID_CNS_NS                = 0x00,
> > -    NVME_ID_CNS_CTRL              = 0x01,
> > -    NVME_ID_CNS_NS_ACTIVE_LIST    = 0x02,
> > -    NVME_ID_CNS_NS_DESCR_LIST     = 0x03,
> > -    NVME_ID_CNS_CS_NS             = 0x05,
> > -    NVME_ID_CNS_CS_CTRL           = 0x06,
> > -    NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
> > -    NVME_ID_CNS_IO_COMMAND_SET    = 0x1c,
> > +    NVME_ID_CNS_NS                    = 0x00,
> > +    NVME_ID_CNS_CTRL                  = 0x01,
> > +    NVME_ID_CNS_NS_ACTIVE_LIST        = 0x02,
> > +    NVME_ID_CNS_NS_DESCR_LIST         = 0x03,
> > +    NVME_ID_CNS_CS_NS                 = 0x05,
> > +    NVME_ID_CNS_CS_CTRL               = 0x06,
> > +    NVME_ID_CNS_CS_NS_ACTIVE_LIST     = 0x07,
> > +    NVME_ID_CNS_NS_PRESENT_LIST       = 0x10,
> > +    NVME_ID_CNS_NS_PRESENT            = 0x11,
> > +    NVME_ID_CNS_CS_NS_PRESENT_LIST    = 0x1a,
> > +    NVME_ID_CNS_CS_NS_PRESENT         = 0x1b,
> > +    NVME_ID_CNS_IO_COMMAND_SET        = 0x1c,
> >  };
> >  
> >  typedef struct QEMU_PACKED NvmeIdCtrl {
> > -- 
> > 2.21.0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set
  2020-09-30 14:50   ` Niklas Cassel
  2020-09-30 18:23     ` Klaus Jensen
@ 2020-10-04 23:57     ` Dmitry Fomichev
  2020-10-05 11:41       ` Niklas Cassel
  1 sibling, 1 reply; 46+ messages in thread
From: Dmitry Fomichev @ 2020-10-04 23:57 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: fam, kwolf, Damien Le Moal, qemu-block, k.jensen, qemu-devel,
	mlevitsk, Alistair Francis, kbusch, philmd, Matias Bjorling

On Wed, 2020-09-30 at 14:50 +0000, Niklas Cassel wrote:
> On Mon, Sep 28, 2020 at 11:35:23AM +0900, Dmitry Fomichev wrote:
> > The emulation code has been changed to advertise NVM Command Set when
> > "zoned" device property is not set (default) and Zoned Namespace
> > Command Set otherwise.
> > 
> > Handlers for three new NVMe commands introduced in Zoned Namespace
> > Command Set specification are added, namely for Zone Management
> > Receive, Zone Management Send and Zone Append.
> > 
> > Device initialization code has been extended to create a proper
> > configuration for zoned operation using device properties.
> > 
> > Read/Write command handler is modified to only allow writes at the
> > write pointer if the namespace is zoned. For Zone Append command,
> > writes implicitly happen at the write pointer and the starting write
> > pointer value is returned as the result of the command. Write Zeroes
> > handler is modified to add zoned checks that are identical to those
> > done as a part of Write flow.
> > 
> > The code to support for Zone Descriptor Extensions is not included in
> > this commit and ZDES 0 is always reported. A later commit in this
> > series will add ZDE support.
> > 
> > This commit doesn't yet include checks for active and open zone
> > limits. It is assumed that there are no limits on either active or
> > open zones.
> > 
> > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> > Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> > Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> > Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> > Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> > Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  block/nvme.c         |   2 +-
> >  hw/block/nvme-ns.c   | 185 ++++++++-
> >  hw/block/nvme-ns.h   |   6 +-
> >  hw/block/nvme.c      | 872 +++++++++++++++++++++++++++++++++++++++++--
> >  include/block/nvme.h |   6 +-
> >  5 files changed, 1033 insertions(+), 38 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 05485fdd11..7a513c9a17 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -333,7 +333,7 @@ static inline int nvme_translate_error(const NvmeCqe *c)
> >  {
> >      uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF;
> >      if (status) {
> > -        trace_nvme_error(le32_to_cpu(c->result),
> > +        trace_nvme_error(le32_to_cpu(c->result32),
> >                           le16_to_cpu(c->sq_head),
> >                           le16_to_cpu(c->sq_id),
> >                           le16_to_cpu(c->cid),
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index 31b7f986c3..6d9dc9205b 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -33,14 +33,14 @@ static void nvme_ns_init(NvmeNamespace *ns)
> >      NvmeIdNs *id_ns = &ns->id_ns;
> >  
> >      if (blk_get_flags(ns->blkconf.blk) & BDRV_O_UNMAP) {
> > -        ns->id_ns.dlfeat = 0x9;
> > +        ns->id_ns.dlfeat = 0x8;
> 
> You seem to change something that is NVM namespace specific here, why?
> If it is indeed needed, I assume that this change should be in a separate
> patch.
> 

OK, this needs to be done in nvme_zoned_init_ns(). Thanks

> >      }
> >  
> >      id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> >      uint16_t status;

<snip>

> >  
> > +    header->nr_zones = cpu_to_le64(nr_zones);
> > +
> > +    ret = nvme_dma(n, (uint8_t *)buf, len, DMA_DIRECTION_FROM_DEVICE, req);
> > +
> > +    g_free(buf);
> > +
> > +    return ret;
> > +}
> > +
> >  static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
> >  {
> >      uint32_t nsid = le32_to_cpu(req->cmd.nsid);
> > @@ -1073,9 +1801,15 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
> 
> While you did make sure that we don't expose zone mgmt send/recv/zone append
> in the cmd_effects log when CC.CSS != CSS_CSI, we also need to make sure we
> return Invalid Command Opcode for any of those three commands, if a user tries
> to use them anyway (while CC.CSS != CSI).
> 

Yes, good catch. Only the commands that are marked as supported in Commands
Supported and Effects log page are allowed to be executed. I am making some
changes to ensure this.

> >          return nvme_flush(n, req);
> >      case NVME_CMD_WRITE_ZEROES:
> >          return nvme_write_zeroes(n, req);
> > +    case NVME_CMD_ZONE_APPEND:
> > +        return nvme_rw(n, req, true);
> >      case NVME_CMD_WRITE:
> >      case NVME_CMD_READ:
> > -        return nvme_rw(n, req);
> > +        return nvme_rw(n, req, false);
> > +    case NVME_CMD_ZONE_MGMT_SEND:
> > +        return nvme_zone_mgmt_send(n, req);
> > +    case NVME_CMD_ZONE_MGMT_RECV:
> > +        return nvme_zone_mgmt_recv(n, req);
> >      default:
> >          trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
> >          return NVME_INVALID_OPCODE | NVME_DNR;
> > @@ -1301,7 +2035,7 @@ static uint16_t nvme_error_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
> >                      DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> > -static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
> > +static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint8_t csi, uint32_t buf_len,
> >                                   uint64_t off, NvmeRequest *req)
> >  {
> >      NvmeEffectsLog cmd_eff_log = {};
> > @@ -1326,11 +2060,20 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
> >      acs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;
> >      acs[NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFFECTS_CSUPP;
> >  
> > -    iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> > -    iocs[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFFECTS_CSUPP |
> > -                                  NVME_CMD_EFFECTS_LBCC;
> > -    iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> > -    iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
> > +    if (NVME_CC_CSS(n->bar.cc) != CSS_ADMIN_ONLY) {
> > +        iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> > +        iocs[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFFECTS_CSUPP |
> > +                                      NVME_CMD_EFFECTS_LBCC;
> > +        iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> > +        iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
> > +    }
> > +
> > +    if (csi == NVME_CSI_ZONED && NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
> > +        iocs[NVME_CMD_ZONE_APPEND] = NVME_CMD_EFFECTS_CSUPP |
> > +                                     NVME_CMD_EFFECTS_LBCC;
> > +        iocs[NVME_CMD_ZONE_MGMT_SEND] = NVME_CMD_EFFECTS_CSUPP;
> > +        iocs[NVME_CMD_ZONE_MGMT_RECV] = NVME_CMD_EFFECTS_CSUPP;
> > +    }

I think the above needs to be changed to only allow admin commands if this
log request arrives with an unrecognized CSI. Some command sets possibly may
not support some or any NVM i/o commands.

> >  
> >      trans_len = MIN(sizeof(cmd_eff_log) - off, buf_len);
> >  
> > @@ -1349,6 +2092,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
> >      uint8_t  lid = dw10 & 0xff;
> >      uint8_t  lsp = (dw10 >> 8) & 0xf;
> >      uint8_t  rae = (dw10 >> 15) & 0x1;
> > +    uint8_t csi = le32_to_cpu(cmd->cdw14) >> 24;
> >      uint32_t numdl, numdu;
> >      uint64_t off, lpol, lpou;
> >      size_t   len;
> > @@ -1382,7 +2126,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
> >      case NVME_LOG_FW_SLOT_INFO:
> >          return nvme_fw_log_info(n, len, off, req);
> >      case NVME_LOG_CMD_EFFECTS:
> > -        return nvme_cmd_effects(n, len, off, req);
> > +        return nvme_cmd_effects(n, csi, len, off, req);
> >      default:
> >          trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
> >          return NVME_INVALID_FIELD | NVME_DNR;
> > @@ -1502,6 +2246,16 @@ static uint16_t nvme_rpt_empty_id_struct(NvmeCtrl *n, NvmeRequest *req)
> >      return nvme_dma(n, id, sizeof(id), DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> > +static inline bool nvme_csi_has_nvm_support(NvmeNamespace *ns)
> > +{
> > +    switch (ns->csi) {
> > +    case NVME_CSI_NVM:
> > +    case NVME_CSI_ZONED:
> > +        return true;
> > +    }
> > +    return false;
> > +}
> > +
> >  static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
> >  {
> >      trace_pci_nvme_identify_ctrl();
> > @@ -1513,11 +2267,16 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
> >  static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
> >  {
> >      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> > +    NvmeIdCtrlZoned id = {};
> >  
> >      trace_pci_nvme_identify_ctrl_csi(c->csi);
> >  
> >      if (c->csi == NVME_CSI_NVM) {
> >          return nvme_rpt_empty_id_struct(n, req);
> > +    } else if (c->csi == NVME_CSI_ZONED) {
> > +        id.zasl = n->zasl;
> > +        return nvme_dma(n, (uint8_t *)&id, sizeof(id),
> > +                        DMA_DIRECTION_FROM_DEVICE, req);
> 
> Please read my comment on nvme_identify_nslist_csi() before reading
> this comment.
> 
> At least for this function, the specification is clear:
> 
> "If the host requests a data structure for an I/O Command Set that the
> controller does not support, the controller shall abort the command with
> a status of Invalid Field in Command."
> 
> If the controller supports the I/O command set == if the Command Set bit
> is set in the data struct returned by the nvme_identify_cmd_set(),
> so here we should do something like:
> 
> } else if (->csi == NVME_CSI_ZONED && ctrl_has_zns_namespaces()) {
> 	...
> }
> 

With this commit, the controller supports ZNS command set regardless of
the number of attached ZNS namespaces. It could be zero, but the controller
still supports it. I think it would be better not to change the behavior
of this command to depend on whether there are any ZNS namespaces added
or not.

> >      }
> >  
> >      return NVME_INVALID_FIELD | NVME_DNR;
> > @@ -1545,8 +2304,12 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req,
> >          return nvme_rpt_empty_id_struct(n, req);
> >      }
> >  
> > -    return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
> > -                    DMA_DIRECTION_FROM_DEVICE, req);
> > +    if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
> > +        return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
> > +                        DMA_DIRECTION_FROM_DEVICE, req);
> > +    }
> > +
> > +    return NVME_INVALID_CMD_SET | NVME_DNR;
> >  }
> >  
> >  static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
> > @@ -1571,8 +2334,11 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
> >          return nvme_rpt_empty_id_struct(n, req);
> >      }
> >  
> > -    if (c->csi == NVME_CSI_NVM) {
> > +    if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
> >          return nvme_rpt_empty_id_struct(n, req);
> > +    } else if (c->csi == NVME_CSI_ZONED && ns->csi == NVME_CSI_ZONED) {
> > +        return nvme_dma(n, (uint8_t *)ns->id_ns_zoned, sizeof(NvmeIdNsZoned),
> > +                        DMA_DIRECTION_FROM_DEVICE, req);
> >      }
> >  
> >      return NVME_INVALID_FIELD | NVME_DNR;
> > @@ -1634,7 +2400,7 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req,
> >  
> >      trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
> >  
> > -    if (c->csi != NVME_CSI_NVM) {
> > +    if (c->csi != NVME_CSI_NVM && c->csi != NVME_CSI_ZONED) {
> 
> When reading the specification for CNS 07h, I think that it is not clear
> how this should behave...
> 
> I'm thinking in the case when c->csi == NVME_CSI_ZONED
> when our QEMU model does only have NVMe namespaces.
> 

I think simply returning an empty list is fine in this case. The loop
that follows will not add any nsids to the list and this is what host
is going to receive.

> Either we should return an empty list (1),
> or we should return Invalid Field in Command (2).
> 
> If we decide to go with (2),
> then we should probably take the code you have written in nvme_identify_cmd_set():
> 
> +    for (i = 1; i <= n->num_namespaces; i++) {
> +        ns = nvme_ns(n, i);
> +        if (ns && ns->params.zoned) {
> +            NVME_SET_CSI(*list, NVME_CSI_ZONED);
> +            break;
> +        }
> +    }
> 
> And move it into a ctrl_has_zns_namespaces() helper function,
> and then do something like:
> if (!(c->csi == NVME_CSI_NVM || (ctrl_has_zns_namespaces() && c->csi == NVME_CSI_ZONED)) 
> 	return NVME_INVALID_FIELD | NVME_DNR;
> 
> 


> >          return NVME_INVALID_FIELD | NVME_DNR;
> >      }
> >  
> > @@ -1643,7 +2409,7 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req,
> >          if (!ns) {
> >              continue;
> >          }
> > -        if (ns->params.nsid < min_nsid) {
> > +        if (ns->params.nsid < min_nsid || c->csi != ns->csi) {
> >              continue;
> >          }
> >          if (only_active && !ns->params.attached) {
> > @@ -1696,19 +2462,29 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
> >      desc->nidt = NVME_NIDT_CSI;
> >      desc->nidl = NVME_NIDL_CSI;
> >      list_ptr += sizeof(*desc);
> > -    *(uint8_t *)list_ptr = NVME_CSI_NVM;
> > +    *(uint8_t *)list_ptr = ns->csi;
> >  
> >      return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> >  static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeRequest *req)
> >  {
> > +    NvmeNamespace *ns;
> >      uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
> >      static const int data_len = sizeof(list);
> > +    int i;
> >  
> >      trace_pci_nvme_identify_cmd_set();
> >  
> >      NVME_SET_CSI(*list, NVME_CSI_NVM);
> > +    for (i = 1; i <= n->num_namespaces; i++) {
> > +        ns = nvme_ns(n, i);
> > +        if (ns && ns->params.zoned) {
> > +            NVME_SET_CSI(*list, NVME_CSI_ZONED);
> > +            break;
> > +        }
> > +    }
> > +
> >      return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> > @@ -1751,7 +2527,7 @@ static uint16_t nvme_abort(NvmeCtrl *n, NvmeRequest *req)
> >  {
> >      uint16_t sqid = le32_to_cpu(req->cmd.cdw10) & 0xffff;
> >  
> > -    req->cqe.result = 1;
> > +    req->cqe.result32 = 1;
> >      if (nvme_check_sqid(n, sqid)) {
> >          return NVME_INVALID_FIELD | NVME_DNR;
> >      }
> > @@ -1932,7 +2708,7 @@ defaults:
> >      }
> >  
> >  out:
> > -    req->cqe.result = cpu_to_le32(result);
> > +    req->cqe.result32 = cpu_to_le32(result);
> >      return NVME_SUCCESS;
> >  }
> >  
> > @@ -2057,8 +2833,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
> >                                      ((dw11 >> 16) & 0xFFFF) + 1,
> >                                      n->params.max_ioqpairs,
> >                                      n->params.max_ioqpairs);
> > -        req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
> > -                                      ((n->params.max_ioqpairs - 1) << 16));
> > +        req->cqe.result32 = cpu_to_le32((n->params.max_ioqpairs - 1) |
> > +                                        ((n->params.max_ioqpairs - 1) << 16));
> >          break;
> >      case NVME_ASYNCHRONOUS_EVENT_CONF:
> >          n->features.async_config = dw11;
> > @@ -2310,16 +3086,28 @@ static int nvme_start_ctrl(NvmeCtrl *n)
> >              continue;
> >          }
> >          ns->params.attached = false;
> > -        switch (ns->params.csi) {
> > +        switch (ns->csi) {
> >          case NVME_CSI_NVM:
> >              if (NVME_CC_CSS(n->bar.cc) == CSS_NVM_ONLY ||
> >                  NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
> >                  ns->params.attached = true;
> >              }
> >              break;
> > +        case NVME_CSI_ZONED:
> > +            if (NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
> > +                ns->params.attached = true;
> > +            }
> > +            break;
> >          }
> >      }
> 
> Like I wrote in my review comment in the patch that added support for the new
> allocated CNS values, I prefer if we remove this for-loop completely, and
> simply set attached = true in nvme_ns_setup()/nvme_ns_init() instead.
> 
> (I was considering if we should set attach = true in nvme_zoned_init_ns(),
> but because nvme_ns_setup()/nvme_ns_init() is called for all namespaces,
> including ZNS namespaces, I don't think that any additional code in
> nvme_zoned_init_ns() is warranted.)

I think CC.CSS value is not available during namespace setup and if we
assign active flag in nvme_zoned_ns_setup(), zoned namespaces may end up
being active even if NVM Only command set is selected. So keeping this loop
seems like a good idea.

> 
> >  
> > +    if (!n->zasl_bs) {
> > +        assert(n->params.mdts);
> > +        n->zasl = n->params.mdts;
> > +    } else {
> > +        n->zasl = 31 - clz32(n->zasl_bs / n->page_size);
> > +    }
> > +
> >      nvme_set_timestamp(n, 0ULL);
> >  
> >      QTAILQ_INIT(&n->aer_queue);
> > @@ -2382,10 +3170,11 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
> >                  case CSS_NVM_ONLY:
> >                      trace_pci_nvme_css_nvm_cset_selected_by_host(data &
> >                                                                   0xffffffff);
> > -                    break;
> > +                break;
> >                  case CSS_CSI:
> >                      NVME_SET_CC_CSS(n->bar.cc, CSS_CSI);
> > -                    trace_pci_nvme_css_all_csets_sel_by_host(data & 0xffffffff);
> > +                    trace_pci_nvme_css_all_csets_sel_by_host(data &
> > +                                                             0xffffffff);
> >                      break;
> >                  case CSS_ADMIN_ONLY:
> >                      break;
> > @@ -2780,6 +3569,12 @@ static void nvme_init_state(NvmeCtrl *n)
> >      n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
> >      n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> >      n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
> > +
> > +    if (!n->params.zasl_kb) {
> > +        n->zasl_bs = n->params.mdts ? 0 : NVME_DEFAULT_MAX_ZA_SIZE * KiB;
> > +    } else {
> > +        n->zasl_bs = n->params.zasl_kb * KiB;
> > +    }
> >  }
> >  
> >  int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> > @@ -2985,8 +3780,9 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
> >      NVME_CAP_SET_CQR(n->bar.cap, 1);
> >      NVME_CAP_SET_TO(n->bar.cap, 0xf);
> >      /*
> > -     * The device now always supports NS Types, but all commands
> > -     * that support CSI field will only handle NVM Command Set.
> > +     * The device now always supports NS Types, even when "zoned" property
> > +     * is set to zero. If this is the case, all commands that support CSI
> > +     * field only handle NVM Command Set.
> >       */
> >      NVME_CAP_SET_CSS(n->bar.cap, (CAP_CSS_NVM | CAP_CSS_CSI_SUPP));
> >      NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
> > @@ -3033,9 +3829,21 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> >  static void nvme_exit(PCIDevice *pci_dev)
> >  {
> >      NvmeCtrl *n = NVME(pci_dev);
> > +    NvmeNamespace *ns;
> > +    int i;
> >  
> >      nvme_clear_ctrl(n);
> > +
> > +    for (i = 1; i <= n->num_namespaces; i++) {
> > +        ns = nvme_ns(n, i);
> > +        if (!ns) {
> > +            continue;
> > +        }
> > +
> > +        nvme_ns_cleanup(ns);
> > +    }
> >      g_free(n->namespaces);
> > +
> >      g_free(n->cq);
> >      g_free(n->sq);
> >      g_free(n->aer_reqs);
> > @@ -3063,6 +3871,8 @@ static Property nvme_props[] = {
> >      DEFINE_PROP_UINT32("aer_max_queued", NvmeCtrl, params.aer_max_queued, 64),
> >      DEFINE_PROP_UINT8("mdts", NvmeCtrl, params.mdts, 7),
> >      DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false),
> > +    DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
> > +    DEFINE_PROP_UINT32("zone_append_size_limit", NvmeCtrl, params.zasl_kb, 0),
> >      DEFINE_PROP_END_OF_LIST(),
> >  };
> >  
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index a7126e123f..628c665728 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -651,8 +651,10 @@ typedef struct QEMU_PACKED NvmeAerResult {
> >  } NvmeAerResult;
> >  
> >  typedef struct QEMU_PACKED NvmeCqe {
> > -    uint32_t    result;
> > -    uint32_t    rsvd;
> > +    union {
> > +        uint64_t     result64;
> > +        uint32_t     result32;
> > +    };
> >      uint16_t    sq_head;
> >      uint16_t    sq_id;
> >      uint16_t    cid;
> > -- 
> > 2.21.0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set
  2020-09-28 10:42   ` Klaus Jensen
  2020-09-30  5:20     ` Klaus Jensen
@ 2020-10-05  0:53     ` Dmitry Fomichev
  1 sibling, 0 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-10-05  0:53 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

> -----Original Message-----
> From: Klaus Jensen <its@irrelevant.dk>
> Sent: Monday, September 28, 2020 6:43 AM
> To: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
> Cc: Keith Busch <kbusch@kernel.org>; Klaus Jensen
> <k.jensen@samsung.com>; Kevin Wolf <kwolf@redhat.com>; Philippe
> Mathieu-Daudé <philmd@redhat.com>; Maxim Levitsky
> <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>; Niklas Cassel
> <Niklas.Cassel@wdc.com>; Damien Le Moal <Damien.LeMoal@wdc.com>;
> qemu-block@nongnu.org; qemu-devel@nongnu.org; Alistair Francis
> <Alistair.Francis@wdc.com>; Matias Bjorling <Matias.Bjorling@wdc.com>
> Subject: Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace
> Command Set
> 
> On Sep 28 11:35, Dmitry Fomichev wrote:
> > The emulation code has been changed to advertise NVM Command Set
> when
> > "zoned" device property is not set (default) and Zoned Namespace
> > Command Set otherwise.
> >
> > Handlers for three new NVMe commands introduced in Zoned Namespace
> > Command Set specification are added, namely for Zone Management
> > Receive, Zone Management Send and Zone Append.
> >
> > Device initialization code has been extended to create a proper
> > configuration for zoned operation using device properties.
> >
> > Read/Write command handler is modified to only allow writes at the
> > write pointer if the namespace is zoned. For Zone Append command,
> > writes implicitly happen at the write pointer and the starting write
> > pointer value is returned as the result of the command. Write Zeroes
> > handler is modified to add zoned checks that are identical to those
> > done as a part of Write flow.
> >
> > The code to support for Zone Descriptor Extensions is not included in
> > this commit and ZDES 0 is always reported. A later commit in this
> > series will add ZDE support.
> >
> > This commit doesn't yet include checks for active and open zone
> > limits. It is assumed that there are no limits on either active or
> > open zones.
> >
> 
> I think the fill_pattern feature stands separate, so it would be nice to
> extract that to a patch on its own.
> 
> > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> > Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> > Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> > Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> > Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> > Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  block/nvme.c         |   2 +-
> >  hw/block/nvme-ns.c   | 185 ++++++++-
> >  hw/block/nvme-ns.h   |   6 +-
> >  hw/block/nvme.c      | 872
> +++++++++++++++++++++++++++++++++++++++++--
> >  include/block/nvme.h |   6 +-
> >  5 files changed, 1033 insertions(+), 38 deletions(-)
> >
> > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> > index 04172f083e..daa13546c4 100644
> > --- a/hw/block/nvme-ns.h
> > +++ b/hw/block/nvme-ns.h
> > @@ -38,7 +38,6 @@ typedef struct NvmeZoneList {
> >
> >  typedef struct NvmeNamespaceParams {
> >      uint32_t nsid;
> > -    uint8_t  csi;
> >      bool     attached;
> >      QemuUUID uuid;
> >
> > @@ -52,6 +51,7 @@ typedef struct NvmeNamespace {
> >      DeviceState  parent_obj;
> >      BlockConf    blkconf;
> >      int32_t      bootindex;
> > +    uint8_t      csi;
> >      int64_t      size;
> >      NvmeIdNs     id_ns;
> 
> This should be squashed into the namespace types patch.
> 

Yes, thanks.

> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 63ad03d6d6..38e25a4d1f 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -54,6 +54,7 @@
> >  #include "qemu/osdep.h"
> >  #include "qemu/units.h"
> >  #include "qemu/error-report.h"
> > +#include "crypto/random.h"
> 
> I think this is not used until the offline/read-only zones injection
> patch, right?
> 

Indeed, will move.

> > +static bool nvme_finalize_zoned_write(NvmeNamespace *ns,
> NvmeRequest *req,
> > +                                      bool failed)
> > +{
> > +    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
> > +    NvmeZone *zone;
> > +    uint64_t slba, start_wp = req->cqe.result64;
> > +    uint32_t nlb, zone_idx;
> > +    uint8_t zs;
> > +
> > +    if (rw->opcode != NVME_CMD_WRITE &&
> > +        rw->opcode != NVME_CMD_ZONE_APPEND &&
> > +        rw->opcode != NVME_CMD_WRITE_ZEROES) {
> > +        return false;
> > +    }
> > +
> > +    slba = le64_to_cpu(rw->slba);
> > +    nlb = le16_to_cpu(rw->nlb) + 1;
> > +    zone_idx = nvme_zone_idx(ns, slba);
> > +    assert(zone_idx < ns->num_zones);
> > +    zone = &ns->zone_array[zone_idx];
> > +
> > +    if (!failed && zone->w_ptr < start_wp + nlb) {
> > +        /*
> > +         * A preceding queued write to the zone has failed,
> > +         * now this write is not at the WP, fail it too.
> > +         */
> > +        failed = true;
> > +    }
> > +
> > +    if (failed) {
> > +        if (zone->w_ptr > start_wp) {
> > +            zone->w_ptr = start_wp;
> > +        }
> 
> It is possible (though unlikely) that you already posted the CQE for the
> write that moved the WP to w_ptr - and now you are reverting it.  This
> looks like a recipe for data corruption to me.
> 
> Take this example. I use append, because if you have multiple regular
> writes in queue you're screwed anyway.
> 
>   w_ptr = 0, d.wp = 0
>   append 1 lba  -> w_ptr = 1, start_wp = 0, issues aio A
>   append 2 lbas -> w_ptr = 3, start_wp = 1, issues aio B
> 
>   aio B success -> d.wp = 2 (since you are adding nlb),
> 
> Now, I totally do the same. Even though the zone descriptor write
> pointer gets "out of sync", it will be reconciled in the absence of
> failures and its fair to define that the host cannot expect a consistent
> view of the write pointer without quescing I/O.
> 
> The problem is if a write then fails:
> 
>   aio A fails   -> w_ptr > start_wp (3 > 1), so you revert to w_ptr = 1
> 
> That looks bad to me. I dont think this is ever reconciled? If another
> append then comes in:
> 
>   append 1 lba -> w_ptr = 2, start_wp = 1, issues aio C and overwrites
>                                            the second append from before.
>   aio C success -> d.wp = 3 (but it should be 2)
> 

Right, need to sync w_ptr and d.wp here. Good find!

> > @@ -1513,11 +2267,16 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl
> *n, NvmeRequest *req)
> >  static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
> >  {
> >      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> > +    NvmeIdCtrlZoned id = {};
> >
> >      trace_pci_nvme_identify_ctrl_csi(c->csi);
> >
> >      if (c->csi == NVME_CSI_NVM) {
> >          return nvme_rpt_empty_id_struct(n, req);
> > +    } else if (c->csi == NVME_CSI_ZONED) {
> > +        id.zasl = n->zasl;
> 
> I dont think it should overwrite the zasl value specified by the user.
> If the user specified 0, then it should return 0 for zasl here.

Not sure if I get this. The value of n->zasl is calculated based on the
setting given by the user (or by default).

> 
> > @@ -2310,16 +3086,28 @@ static int nvme_start_ctrl(NvmeCtrl *n)
> >              continue;
> >          }
> >          ns->params.attached = false;
> > -        switch (ns->params.csi) {
> > +        switch (ns->csi) {
> >          case NVME_CSI_NVM:
> >              if (NVME_CC_CSS(n->bar.cc) == CSS_NVM_ONLY ||
> >                  NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
> >                  ns->params.attached = true;
> >              }
> >              break;
> > +        case NVME_CSI_ZONED:
> > +            if (NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
> > +                ns->params.attached = true;
> > +            }
> > +            break;
> >          }
> >      }
> >
> > +    if (!n->zasl_bs) {
> > +        assert(n->params.mdts);
> 
> A value of 0 for MDTS is perfectly valid.

Ok, need to remove this assert.
 
> 
> > @@ -2382,10 +3170,11 @@ static void nvme_write_bar(NvmeCtrl *n,
> hwaddr offset, uint64_t data,
> >                  case CSS_NVM_ONLY:
> >                      trace_pci_nvme_css_nvm_cset_selected_by_host(data &
> >                                                                   0xffffffff);
> > -                    break;
> > +                break;
> 
> Spurious misaligned break here.

Nice catch! It's misaligned by 4, so checkpatch doesn't complain about it :)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 06/14] hw/block/nvme: Add support for active/inactive namespaces
  2020-10-04 23:54     ` Dmitry Fomichev
@ 2020-10-05 11:26       ` Niklas Cassel
  0 siblings, 0 replies; 46+ messages in thread
From: Niklas Cassel @ 2020-10-05 11:26 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: fam, kwolf, Damien Le Moal, qemu-block, k.jensen, qemu-devel,
	mlevitsk, Alistair Francis, kbusch, philmd, Matias Bjorling

On Sun, Oct 04, 2020 at 11:54:13PM +0000, Dmitry Fomichev wrote:
> On Wed, 2020-09-30 at 13:50 +0000, Niklas Cassel wrote:
> > On Mon, Sep 28, 2020 at 11:35:20AM +0900, Dmitry Fomichev wrote:
> > > From: Niklas Cassel <niklas.cassel@wdc.com>
> > > 
> > > In NVMe, a namespace is active if it exists and is attached to the
> > > controller.
> > > 
> > > CAP.CSS (together with the I/O Command Set data structure) defines what
> > > command sets are supported by the controller.
> > > 
> > > CC.CSS (together with Set Profile) can be set to enable a subset of the
> > > available command sets. The namespaces belonging to a disabled command set
> > > will not be able to attach to the controller, and will thus be inactive.
> > > 
> > > E.g., if the user sets CC.CSS to Admin Only, NVM namespaces should be
> > > marked as inactive.
> > > 
> > > The identify namespace, the identify namespace CSI specific, and the namespace
> > > list commands have two different versions, one that only shows active
> > > namespaces, and the other version that shows existing namespaces, regardless
> > > of whether the namespace is attached or not.
> > > 
> > > Add an attached member to struct NvmeNamespace, and implement the missing CNS
> > > commands.
> > > 
> > > The added functionality will also simplify the implementation of namespace
> > > management in the future, since namespace management can also attach and
> > > detach namespaces.
> > 
> > Following my previous discussion with Klaus,
> > I think we need to rewrite this commit message completely:
> > 
> > Subject: hw/block/nvme: Add support for allocated CNS command variants
> > 
> > Many CNS commands have "allocated" command variants.
> > These includes a namespace as long as it is allocated
> > (i.e. a namespace is included regardless if it is active (attached)
> > or not.)
> > 
> > While these commands are optional (they are mandatory for controllers
> > supporting the namespace attachment command), our QEMU implementation
> > is more complete by actually providing support for these CNS values.
> > 
> > However, since our QEMU model currently does not support the namespace
> > attachment command, these new allocated CNS commands will return the same
> > result as the active CNS command variants.
> > 
> > In NVMe, a namespace is active if it exists and is attached to the
> > controller.
> > 
> > CAP.CSS (together with the I/O Command Set data structure) defines what
> > command sets are supported by the controller.
> > 
> > CC.CSS (together with Set Profile) can be set to enable a subset of the
> > available command sets.
> > 
> > Even if a user configures CC.CSS to e.g. Admin only, NVM namespaces
> > will still be attached (and thus marked as active).
> > Similarly, if a user configures CC.CSS to e.g. NVM, ZNS namespaces
> > will still be attached (and thus marked as active).
> > 
> > However, any operation from a disabled command set will result in a
> > Invalid Command Opcode.
> > 
> > Add an attached struct member for struct NvmeNamespace,
> > so that we lay the foundation for namespace attachment
> > support. Also implement logic in the new CNS values to
> > include/exclude namespaces based on this new struct member.
> > The only thing missing hooking up the actual Namespace Attachment
> > command opcode, which allows a user to toggle the attached
> > variable per namespace. The reason for not hooking up this
> > command completely is because the NVMe specification
> > requires that the namespace managment command is supported
> > if the namespacement attachment command is supported.
> > 
> 

(snip)

> > > @@ -2276,6 +2304,22 @@ static int nvme_start_ctrl(NvmeCtrl *n)
> > >      nvme_init_sq(&n->admin_sq, n, n->bar.asq, 0, 0,
> > >                   NVME_AQA_ASQS(n->bar.aqa) + 1);
> > >  
> > > +    for (i = 1; i <= n->num_namespaces; i++) {
> > > +        ns = nvme_ns(n, i);
> > > +        if (!ns) {
> > > +            continue;
> > > +        }
> > > +        ns->params.attached = false;
> > > +        switch (ns->params.csi) {
> > > +        case NVME_CSI_NVM:
> > > +            if (NVME_CC_CSS(n->bar.cc) == CSS_NVM_ONLY ||
> > > +                NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
> > > +                ns->params.attached = true;
> > > +            }
> > > +            break;
> > > +        }
> > > +    }
> > > +
> > 
> > Considering that the controller doesn't attach/detach
> > namespaces belonging to command sets that it doesn't
> > support, I think a nicer way is to remove this for-loop,
> > and instead, in nvme_ns_setup() or nvme_ns_init(),
> > always set attached = true. (Since we currently don't
> > support namespace attachment command).
> > 
> > The person that implements the last piece of namespace
> > management and namespace attachment will have to deal
> > with reading "attached" from some kind of persistent state
> 
> 
> I did some spec reading on this topic and it seems that
> this logic is necessary precisely because there is no
> attach/detach command available. Such a command would
> prevent attachment of a zoned namespace if CC.CSS is
> NVM_ONLY, right? But since we have a static config, we
> need to do this IMO.

As far as I understand the spec, a NVM Command Set namespace will be attached
to the controller (thus active), regardless if you start the controller with
CC.CSS = Admin only, or CC.CSS = NVM.

(And as far as I understand, this doesn't depend on if the controller supports
the namespace attachment command or not.)

See the register description for CC.CSS:
"If bit 44 is set to ‘1’ in the Command Sets Supported (CSS) field, then the value
111b indicates that only the Admin Command Set is supported and that no I/O
Command Set or I/O Command Set Specific Admin commands are supported.
When only the Admin Command Set is supported, any command submitted on
an I/O Submission Queue and any I/O Command Set Specific Admin command
submitted on the Admin Submission Queue is completed with status Invalid
Command Opcode."

So I think that no matter what CC.CSS setting you have, no namespace
will ever be detached by the controller. It will still be attached,
but you will get Invalid Command Opcode if sending any command.

I assume that CC.CSS is way older than namespace management, so that is
probably why CC.CSS simply causes "Invalid Command Opcode" rather than
detaching namespaces.

> 
> Also, 6.1.5 of the spec says that any operation that uses
> an inactive NSID shall fail with Invalid Field. I am
> adding a few bits to fail all i/o commands and set/get
> features attempted on inactive namespaces.

Inactive NSID == a NSID that is not attached.
As far as I understand, the controller itself will never detach
a namespace. And since the QEMU model right now does not support
namespace management, neither will the user.
So I don't see that we will have any inactive namespace.

Therefore, I suggested that we remove this for-loop.
(Or drop this patch all together, but I do think that
it provides value to have the additional CNS commands
implemented, even if they will return the same result
as the exiting active CNS commands.)


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set
  2020-10-04 23:57     ` Dmitry Fomichev
@ 2020-10-05 11:41       ` Niklas Cassel
  2020-10-05 23:08         ` Dmitry Fomichev
  0 siblings, 1 reply; 46+ messages in thread
From: Niklas Cassel @ 2020-10-05 11:41 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: fam, kwolf, Damien Le Moal, qemu-block, k.jensen, qemu-devel,
	mlevitsk, Alistair Francis, kbusch, philmd, Matias Bjorling

On Sun, Oct 04, 2020 at 11:57:07PM +0000, Dmitry Fomichev wrote:
> On Wed, 2020-09-30 at 14:50 +0000, Niklas Cassel wrote:
> > On Mon, Sep 28, 2020 at 11:35:23AM +0900, Dmitry Fomichev wrote:
> > > The emulation code has been changed to advertise NVM Command Set when
> > > "zoned" device property is not set (default) and Zoned Namespace
> > > Command Set otherwise.
> > > 
> > > Handlers for three new NVMe commands introduced in Zoned Namespace
> > > Command Set specification are added, namely for Zone Management
> > > Receive, Zone Management Send and Zone Append.
> > > 
> > > Device initialization code has been extended to create a proper
> > > configuration for zoned operation using device properties.
> > > 
> > > Read/Write command handler is modified to only allow writes at the
> > > write pointer if the namespace is zoned. For Zone Append command,
> > > writes implicitly happen at the write pointer and the starting write
> > > pointer value is returned as the result of the command. Write Zeroes
> > > handler is modified to add zoned checks that are identical to those
> > > done as a part of Write flow.
> > > 
> > > The code to support for Zone Descriptor Extensions is not included in
> > > this commit and ZDES 0 is always reported. A later commit in this
> > > series will add ZDE support.
> > > 
> > > This commit doesn't yet include checks for active and open zone
> > > limits. It is assumed that there are no limits on either active or
> > > open zones.
> > > 
> > > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> > > Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> > > Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> > > Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> > > Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> > > Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> > > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > > Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> > > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > > ---
> > >  block/nvme.c         |   2 +-
> > >  hw/block/nvme-ns.c   | 185 ++++++++-
> > >  hw/block/nvme-ns.h   |   6 +-
> > >  hw/block/nvme.c      | 872 +++++++++++++++++++++++++++++++++++++++++--
> > >  include/block/nvme.h |   6 +-
> > >  5 files changed, 1033 insertions(+), 38 deletions(-)
> > > 
> > > diff --git a/block/nvme.c b/block/nvme.c
> > > index 05485fdd11..7a513c9a17 100644
> > > --- a/block/nvme.c
> > > +++ b/block/nvme.c

(snip)

> > 
> > Please read my comment on nvme_identify_nslist_csi() before reading
> > this comment.
> > 
> > At least for this function, the specification is clear:
> > 
> > "If the host requests a data structure for an I/O Command Set that the
> > controller does not support, the controller shall abort the command with
> > a status of Invalid Field in Command."
> > 
> > If the controller supports the I/O command set == if the Command Set bit
> > is set in the data struct returned by the nvme_identify_cmd_set(),
> > so here we should do something like:
> > 
> > } else if (->csi == NVME_CSI_ZONED && ctrl_has_zns_namespaces()) {
> > 	...
> > }
> > 
> 
> With this commit, the controller supports ZNS command set regardless of
> the number of attached ZNS namespaces. It could be zero, but the controller
> still supports it. I think it would be better not to change the behavior
> of this command to depend on whether there are any ZNS namespaces added
> or not.

Ok, always having ZNS Command Set support, regardless if a user defines
a zoned namespace on the QEMU command line or not, does simplify things.

But then in nvme_identify_cmd_set(), you need to call
NVME_SET_CSI(*list, NVME_CSI_ZONED) unconditionally.

(Right now you loop though all namespaces, and only set the support bit
if you find a zoned namespace.)

> > Like I wrote in my review comment in the patch that added support for the new
> > allocated CNS values, I prefer if we remove this for-loop completely, and
> > simply set attached = true in nvme_ns_setup()/nvme_ns_init() instead.
> > 
> > (I was considering if we should set attach = true in nvme_zoned_init_ns(),
> > but because nvme_ns_setup()/nvme_ns_init() is called for all namespaces,
> > including ZNS namespaces, I don't think that any additional code in
> > nvme_zoned_init_ns() is warranted.)
> 
> I think CC.CSS value is not available during namespace setup and if we
> assign active flag in nvme_zoned_ns_setup(), zoned namespaces may end up
> being active even if NVM Only command set is selected. So keeping this loop
> seems like a good idea.

It is true that CC.CSS is not yet available during namespace setup,
but since the controller itself will never detach namespaces based on
CC.CSS, why are we dependant on CC.CSS being available?

Sure, once someone implements namespace management, they will need
to read if a certain namespace is attached or detached from some
persistent state, perhaps in the zone meta-data file, and set
attached boolean in nvme_ns_init() accordingly, but I still don't see
any dependance on CC.CSS even when namespace management is implemented.



Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set
  2020-10-05 11:41       ` Niklas Cassel
@ 2020-10-05 23:08         ` Dmitry Fomichev
  0 siblings, 0 replies; 46+ messages in thread
From: Dmitry Fomichev @ 2020-10-05 23:08 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: fam, kwolf, Damien Le Moal, qemu-block, k.jensen, qemu-devel,
	mlevitsk, Alistair Francis, kbusch, philmd, Matias Bjorling

> -----Original Message-----
> From: Niklas Cassel <Niklas.Cassel@wdc.com>
> Sent: Monday, October 5, 2020 7:41 AM
> To: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
> Cc: Alistair Francis <Alistair.Francis@wdc.com>; qemu-devel@nongnu.org;
> Damien Le Moal <Damien.LeMoal@wdc.com>; fam@euphon.net; Matias
> Bjorling <Matias.Bjorling@wdc.com>; qemu-block@nongnu.org;
> kwolf@redhat.com; mlevitsk@redhat.com; k.jensen@samsung.com;
> kbusch@kernel.org; philmd@redhat.com
> Subject: Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace
> Command Set
> 
> On Sun, Oct 04, 2020 at 11:57:07PM +0000, Dmitry Fomichev wrote:
> > On Wed, 2020-09-30 at 14:50 +0000, Niklas Cassel wrote:
> > > On Mon, Sep 28, 2020 at 11:35:23AM +0900, Dmitry Fomichev wrote:
> > > > The emulation code has been changed to advertise NVM Command Set
> when
> > > > "zoned" device property is not set (default) and Zoned Namespace
> > > > Command Set otherwise.
> > > >
> > > > Handlers for three new NVMe commands introduced in Zoned
> Namespace
> > > > Command Set specification are added, namely for Zone Management
> > > > Receive, Zone Management Send and Zone Append.
> > > >
> > > > Device initialization code has been extended to create a proper
> > > > configuration for zoned operation using device properties.
> > > >
> > > > Read/Write command handler is modified to only allow writes at the
> > > > write pointer if the namespace is zoned. For Zone Append command,
> > > > writes implicitly happen at the write pointer and the starting write
> > > > pointer value is returned as the result of the command. Write Zeroes
> > > > handler is modified to add zoned checks that are identical to those
> > > > done as a part of Write flow.
> > > >
> > > > The code to support for Zone Descriptor Extensions is not included in
> > > > this commit and ZDES 0 is always reported. A later commit in this
> > > > series will add ZDE support.
> > > >
> > > > This commit doesn't yet include checks for active and open zone
> > > > limits. It is assumed that there are no limits on either active or
> > > > open zones.
> > > >
> > > > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> > > > Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> > > > Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> > > > Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> > > > Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> > > > Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> > > > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > > > Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> > > > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > > > ---
> > > >  block/nvme.c         |   2 +-
> > > >  hw/block/nvme-ns.c   | 185 ++++++++-
> > > >  hw/block/nvme-ns.h   |   6 +-
> > > >  hw/block/nvme.c      | 872
> +++++++++++++++++++++++++++++++++++++++++--
> > > >  include/block/nvme.h |   6 +-
> > > >  5 files changed, 1033 insertions(+), 38 deletions(-)
> > > >
> > > > diff --git a/block/nvme.c b/block/nvme.c
> > > > index 05485fdd11..7a513c9a17 100644
> > > > --- a/block/nvme.c
> > > > +++ b/block/nvme.c
> 
> (snip)
> 
> > >
> > > Please read my comment on nvme_identify_nslist_csi() before reading
> > > this comment.
> > >
> > > At least for this function, the specification is clear:
> > >
> > > "If the host requests a data structure for an I/O Command Set that the
> > > controller does not support, the controller shall abort the command with
> > > a status of Invalid Field in Command."
> > >
> > > If the controller supports the I/O command set == if the Command Set bit
> > > is set in the data struct returned by the nvme_identify_cmd_set(),
> > > so here we should do something like:
> > >
> > > } else if (->csi == NVME_CSI_ZONED && ctrl_has_zns_namespaces()) {
> > > 	...
> > > }
> > >
> >
> > With this commit, the controller supports ZNS command set regardless of
> > the number of attached ZNS namespaces. It could be zero, but the
> controller
> > still supports it. I think it would be better not to change the behavior
> > of this command to depend on whether there are any ZNS namespaces
> added
> > or not.
> 
> Ok, always having ZNS Command Set support, regardless if a user defines
> a zoned namespace on the QEMU command line or not, does simplify things.
> 
> But then in nvme_identify_cmd_set(), you need to call
> NVME_SET_CSI(*list, NVME_CSI_ZONED) unconditionally.
> 

Perhaps,
NVME_SET_CSI(*list, NVME_CSI_NVM)
NVME_SET_CSI(*list, NVME_CSI_ZONED)

since this is a vector...

> (Right now you loop though all namespaces, and only set the support bit
> if you find a zoned namespace.)
> 
> > > Like I wrote in my review comment in the patch that added support for
> the new
> > > allocated CNS values, I prefer if we remove this for-loop completely, and
> > > simply set attached = true in nvme_ns_setup()/nvme_ns_init() instead.
> > >
> > > (I was considering if we should set attach = true in
> nvme_zoned_init_ns(),
> > > but because nvme_ns_setup()/nvme_ns_init() is called for all
> namespaces,
> > > including ZNS namespaces, I don't think that any additional code in
> > > nvme_zoned_init_ns() is warranted.)
> >
> > I think CC.CSS value is not available during namespace setup and if we
> > assign active flag in nvme_zoned_ns_setup(), zoned namespaces may end
> up
> > being active even if NVM Only command set is selected. So keeping this
> loop
> > seems like a good idea.
> 
> It is true that CC.CSS is not yet available during namespace setup,
> but since the controller itself will never detach namespaces based on
> CC.CSS, why are we dependant on CC.CSS being available?
> 
> Sure, once someone implements namespace management, they will need
> to read if a certain namespace is attached or detached from some
> persistent state, perhaps in the zone meta-data file, and set
> attached boolean in nvme_ns_init() accordingly, but I still don't see
> any dependance on CC.CSS even when namespace management is
> implemented.
> 

Ok, thanks for the clarification. I think it would be the best to add "attached"
property to namespace code instead of hardcoding it to true. The default,
of course, will be true, but will be possible to set it to false to test how everything
works with inactive namespaces. And yes, no dependence on CC.CSS.

> 
> 
> Kind regards,
> Niklas


^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2020-10-05 23:09 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-28  2:35 [PATCH v5 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 01/14] hw/block/nvme: Report actual LBA data shift in LBAF Dmitry Fomichev
2020-09-28  8:51   ` Klaus Jensen
2020-09-28  2:35 ` [PATCH v5 02/14] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 03/14] hw/block/nvme: Introduce the Namespace Types definitions Dmitry Fomichev
2020-09-30  8:08   ` Klaus Jensen
2020-09-30 15:21   ` Keith Busch
2020-09-28  2:35 ` [PATCH v5 04/14] hw/block/nvme: Define trace events related to NS Types Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
2020-09-30  8:15   ` Klaus Jensen
2020-09-30 12:47   ` Niklas Cassel
2020-10-01 11:22   ` Niklas Cassel
2020-10-01 15:29     ` Keith Busch
2020-10-01 15:50       ` Niklas Cassel
2020-10-01 15:59         ` Keith Busch
2020-10-01 16:23           ` Niklas Cassel
2020-10-01 17:08             ` Keith Busch
2020-10-01 22:15   ` Klaus Jensen
2020-10-01 22:30     ` Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 06/14] hw/block/nvme: Add support for active/inactive namespaces Dmitry Fomichev
2020-09-30 13:50   ` Niklas Cassel
2020-10-04 23:54     ` Dmitry Fomichev
2020-10-05 11:26       ` Niklas Cassel
2020-09-28  2:35 ` [PATCH v5 07/14] hw/block/nvme: Make Zoned NS Command Set definitions Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 08/14] hw/block/nvme: Define Zoned NS Command Set trace events Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
2020-09-28  6:44   ` Klaus Jensen
2020-09-28 10:42   ` Klaus Jensen
2020-09-30  5:20     ` Klaus Jensen
2020-10-05  0:53     ` Dmitry Fomichev
2020-09-30  5:59   ` Klaus Jensen
2020-10-04 23:48     ` Dmitry Fomichev
2020-09-30 14:50   ` Niklas Cassel
2020-09-30 18:23     ` Klaus Jensen
2020-10-04 23:57     ` Dmitry Fomichev
2020-10-05 11:41       ` Niklas Cassel
2020-10-05 23:08         ` Dmitry Fomichev
2020-09-30 15:12   ` Niklas Cassel
2020-09-28  2:35 ` [PATCH v5 10/14] hw/block/nvme: Introduce max active and open zone limits Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 11/14] hw/block/nvme: Support Zone Descriptor Extensions Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 12/14] hw/block/nvme: Add injection of Offline/Read-Only zones Dmitry Fomichev
2020-09-28  2:35 ` [PATCH v5 13/14] hw/block/nvme: Use zone metadata file for persistence Dmitry Fomichev
2020-09-28  7:51   ` Klaus Jensen
2020-09-29 15:43     ` Dmitry Fomichev
2020-09-29 16:46       ` Klaus Jensen
2020-09-28  2:35 ` [PATCH v5 14/14] hw/block/nvme: Document zoned parameters in usage text Dmitry Fomichev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.