All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
@ 2020-10-19  2:17 Dmitry Fomichev
  2020-10-19  2:17 ` [PATCH v7 01/11] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
                   ` (11 more replies)
  0 siblings, 12 replies; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-19  2:17 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

v6 -> v7:

 - Introduce ns->iocs initialization function earlier in the series,
   in CSE Log patch.

 - Set NVM iocs for zoned namespaces when CC.CSS is set to
   NVME_CC_CSS_NVM.

 - Clean up code in CSE log handler.
 
v5 -> v6:

 - Remove zoned state persistence code. Replace position-independent
   zone lists with QTAILQs.

 - Close all open zones upon clearing of the controller. This is
   a similar procedure to the one previously performed upon powering
   up with zone persistence. 

 - Squash NS Types and ZNS triplets of commits to keep definitions
   and trace event definitions together with the implementation code.

 - Move namespace UUID generation to a separate patch. Add the new
   "uuid" property as suggested by Klaus.

 - Rework Commands and Effects patch to make sure that the log is
   always in sync with the actual set of commands supported.

 - Add two refactoring commits at the end of the series to
   optimize read and write i/o path.

- Incorporate feedback from Keith, Klaus and Niklas:

  * fix rebase errors in nvme_identify_ns_descr_list()
  * remove unnecessary code from nvme_write_bar()
  * move csi to NvmeNamespace and use it from the beginning in NSTypes
    patch
  * change zone read processing to cover all corner cases with RAZB=1
  * sync w_ptr and d.wp in case of a i/o error at the preceding zone
  * reword the commit message in active/inactive patch with the new
    text from Niklas
  * correct dlfeat reporting depending on the fill pattern set
  * add more checks for "attached" n/s parameter to prevent i/o and
    get/set features on inactive namespaces
  * Use DEFINE_PROP_SIZE and DEFINE_PROP_SIZE32 for zone size/capacity
    and ZASL respectively
  * Improve zone size and capacity validation
  * Correctly report NSZE

v4 -> v5:

 - Rebase to the current qemu-nvme.

 - Use HostMemoryBackendFile as the backing storage for persistent
   zone metadata.

 - Fix the issue with filling the valid data in the next zone if RAZBi
   is enabled.

v3 -> v4:

 - Fix bugs introduced in v2/v3 for QD > 1 operation. Now, all writes
   to a zone happen at the new write pointer variable, zone->w_ptr,
   that is advanced right after submitting the backend i/o. The existing
   zone->d.wp variable is updated upon the successful write completion
   and it is used for zone reporting. Some code has been split from
   nvme_finalize_zoned_write() function to a new function,
   nvme_advance_zone_wp().

 - Make the code compile under mingw. Switch to using QEMU API for
   mmap/msync, i.e. memory_region...(). Since mmap is not available in
   mingw (even though there is mman-win32 library available on Github),
   conditional compilation is added around these calls to avoid
   undefined symbols under mingw. A better fix would be to add stub
   functions to softmmu/memory.c for the case when CONFIG_POSIX is not
   defined, but such change is beyond the scope of this patchset and it
   can be made in a separate patch.

 - Correct permission mask used to open zone metadata file.

 - Fold "Define 64 bit cqe.result" patch into ZNS commit.

 - Use clz64/clz32 instead of defining nvme_ilog2() function.

 - Simplify rpt_empty_id_struct() code, move nvme_fill_data() back
   to ZNS patch.

 - Fix a power-on processing bug.

 - Rename NVME_CMD_ZONE_APND to NVME_CMD_ZONE_APPEND.

 - Make the list of review comments addressed in v2 of the series
   (see below).

v2 -> v3:

 - Moved nvme_fill_data() function to the NSTypes patch as it is
   now used there to output empty namespace identify structs.
 - Fixed typo in Maxim's email address.

v1 -> v2:

 - Rebased on top of qemu-nvme/next branch.
 - Incorporated feedback from Klaus and Alistair.
    * Allow a subset of CSE log to be read, not the entire log
    * Assign admin command entries in CSE log to ACS fields
    * Set LPA bit 1 to indicate support of CSE log page
    * Rename CC.CSS value CSS_ALL_NSTYPES (110b) to CSS_CSI
    * Move the code to assign lbaf.ds to a separate patch
    * Remove the change in firmware revision
    * Change "driver" to "device" in comments and annotations
    * Rename ZAMDS to ZASL
    * Correct a few format expressions and some wording in
      trace event definitions
    * Remove validation code to return NVME_CAP_EXCEEDED error
    * Make ZASL to be equal to MDTS if "zone_append_size_limit"
      module parameter is not set
    * Clean up nvme_zoned_init_ctrl() to make size calculations
      less confusing
    * Avoid changing module parameters, use separate n/s variables
      if additional calculations are necessary to convert parameters
      to running values
    * Use NVME_DEFAULT_ZONE_SIZE to assign the default zone size value
    * Use default 0 for zone capacity meaning that zone capacity will
      be equal to zone size by default
    * Issue warnings if user MAR/MOR values are too large and have
      to be adjusted
    * Use unsigned values for MAR/MOR
 - Dropped "Simulate Zone Active excursions" patch.
   Excursion behavior may depend on the internal controller
   architecture and therefore be vendor-specific.
 - Dropped support for Zone Attributes and zoned AENs for now.
   These features can be added in a future series.
 - NS Types support is extended to handle active/inactive namespaces.
 - Update the write pointer after backing storage I/O completion, not
   before. This makes the emulation to run correctly in case of
   backing device failures.
 - Avoid division in the I/O path if the device zone size is
   a power of two (the most common case). Zone index then can be
   calculated by using bit shift.
 - A few reported bugs have been fixed.
 - Indentation in function definitions has been changed to make it
   the same as the rest of the code.


Zoned Namespace (ZNS) Command Set is a newly introduced command set
published by the NVM Express, Inc. organization as TP 4053. The main
design goals of ZNS are to provide hardware designers the means to
reduce NVMe controller complexity and to allow achieving a better I/O
latency and throughput. SSDs that implement this interface are
commonly known as ZNS SSDs.

This command set is implementing a zoned storage model, similarly to
ZAC/ZBC. As such, there is already support in Linux, allowing one to
perform the majority of tasks needed for managing ZNS SSDs.

The Zoned Namespace Command Set relies on another TP, known as
Namespace Types (NVMe TP 4056), which introduces support for having
multiple command sets per namespace.

Both ZNS and Namespace Types specifications can be downloaded by
visiting the following link -

https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs.zip

This patch series adds Namespace Types support and zoned namespace
emulation capability to the existing NVMe PCI device.

Based-on: <20201013174826.GA1049145@dhcp-10-100-145-180.wdl.wdc.com>

Dmitry Fomichev (9):
  hw/block/nvme: Add Commands Supported and Effects log
  hw/block/nvme: Generate namespace UUIDs
  hw/block/nvme: Support Zoned Namespace Command Set
  hw/block/nvme: Introduce max active and open zone limits
  hw/block/nvme: Support Zone Descriptor Extensions
  hw/block/nvme: Add injection of Offline/Read-Only zones
  hw/block/nvme: Document zoned parameters in usage text
  hw/block/nvme: Separate read and write handlers
  hw/block/nvme: Merge nvme_write_zeroes() with nvme_write()

Niklas Cassel (2):
  hw/block/nvme: Add support for Namespace Types
  hw/block/nvme: Support allocated CNS command variants

 block/nvme.c          |    2 +-
 hw/block/nvme-ns.c    |  295 ++++++++
 hw/block/nvme-ns.h    |  109 +++
 hw/block/nvme.c       | 1550 ++++++++++++++++++++++++++++++++++++++---
 hw/block/nvme.h       |    9 +
 hw/block/trace-events |   36 +-
 include/block/nvme.h  |  201 +++++-
 7 files changed, 2078 insertions(+), 124 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v7 01/11] hw/block/nvme: Add Commands Supported and Effects log
  2020-10-19  2:17 [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
@ 2020-10-19  2:17 ` Dmitry Fomichev
  2020-10-19 19:22   ` Keith Busch
  2020-10-19 20:16   ` Klaus Jensen
  2020-10-19  2:17 ` [PATCH v7 02/11] hw/block/nvme: Generate namespace UUIDs Dmitry Fomichev
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-19  2:17 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

This log page becomes necessary to implement to allow checking for
Zone Append command support in Zoned Namespace Command Set.

This commit adds the code to report this log page for NVM Command
Set only. The parts that are specific to zoned operation will be
added later in the series.

All incoming admin and i/o commands are now only processed if their
corresponding support bits are set in this log. This provides an
easy way to control what commands to support and what not to
depending on set CC.CSS.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme-ns.h    |  1 +
 hw/block/nvme.c       | 98 +++++++++++++++++++++++++++++++++++++++----
 hw/block/trace-events |  2 +
 include/block/nvme.h  | 19 +++++++++
 4 files changed, 111 insertions(+), 9 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 83734f4606..ea8c2f785d 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -29,6 +29,7 @@ typedef struct NvmeNamespace {
     int32_t      bootindex;
     int64_t      size;
     NvmeIdNs     id_ns;
+    const uint32_t *iocs;
 
     NvmeNamespaceParams params;
 } NvmeNamespace;
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 9d30ca69dc..5a9493d89f 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -111,6 +111,28 @@ static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
     [NVME_TIMESTAMP]                = NVME_FEAT_CAP_CHANGE,
 };
 
+static const uint32_t nvme_cse_acs[256] = {
+    [NVME_ADM_CMD_DELETE_SQ]        = NVME_CMD_EFF_CSUPP,
+    [NVME_ADM_CMD_CREATE_SQ]        = NVME_CMD_EFF_CSUPP,
+    [NVME_ADM_CMD_DELETE_CQ]        = NVME_CMD_EFF_CSUPP,
+    [NVME_ADM_CMD_CREATE_CQ]        = NVME_CMD_EFF_CSUPP,
+    [NVME_ADM_CMD_IDENTIFY]         = NVME_CMD_EFF_CSUPP,
+    [NVME_ADM_CMD_SET_FEATURES]     = NVME_CMD_EFF_CSUPP,
+    [NVME_ADM_CMD_GET_FEATURES]     = NVME_CMD_EFF_CSUPP,
+    [NVME_ADM_CMD_GET_LOG_PAGE]     = NVME_CMD_EFF_CSUPP,
+    [NVME_ADM_CMD_ASYNC_EV_REQ]     = NVME_CMD_EFF_CSUPP,
+};
+
+static const uint32_t nvme_cse_iocs_none[256] = {
+};
+
+static const uint32_t nvme_cse_iocs_nvm[256] = {
+    [NVME_CMD_FLUSH]                = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
+    [NVME_CMD_WRITE_ZEROES]         = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
+    [NVME_CMD_WRITE]                = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
+    [NVME_CMD_READ]                 = NVME_CMD_EFF_CSUPP,
+};
+
 static void nvme_process_sq(void *opaque);
 
 static uint16_t nvme_cid(NvmeRequest *req)
@@ -1032,10 +1054,6 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
     trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req),
                           req->cmd.opcode, nvme_io_opc_str(req->cmd.opcode));
 
-    if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_ADMIN_ONLY) {
-        return NVME_INVALID_OPCODE | NVME_DNR;
-    }
-
     if (!nvme_nsid_valid(n, nsid)) {
         return NVME_INVALID_NSID | NVME_DNR;
     }
@@ -1045,6 +1063,11 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
+    if (!(req->ns->iocs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
+        trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
+        return NVME_INVALID_OPCODE | NVME_DNR;
+    }
+
     switch (req->cmd.opcode) {
     case NVME_CMD_FLUSH:
         return nvme_flush(n, req);
@@ -1054,8 +1077,7 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
     case NVME_CMD_READ:
         return nvme_rw(n, req);
     default:
-        trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
-        return NVME_INVALID_OPCODE | NVME_DNR;
+        assert(false);
     }
 }
 
@@ -1291,6 +1313,39 @@ static uint16_t nvme_error_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
                     DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
+                                 uint64_t off, NvmeRequest *req)
+{
+    NvmeEffectsLog log = {};
+    const uint32_t *src_iocs = NULL;
+    uint32_t trans_len;
+
+    trace_pci_nvme_cmd_supp_and_effects_log_read();
+
+    if (off >= sizeof(log)) {
+        trace_pci_nvme_err_invalid_effects_log_offset(off);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    switch (NVME_CC_CSS(n->bar.cc)) {
+    case NVME_CC_CSS_NVM:
+        src_iocs = nvme_cse_iocs_nvm;
+    case NVME_CC_CSS_ADMIN_ONLY:
+        break;
+    }
+
+    memcpy(log.acs, nvme_cse_acs, sizeof(nvme_cse_acs));
+
+    if (src_iocs) {
+        memcpy(log.iocs, src_iocs, sizeof(log.iocs));
+    }
+
+    trans_len = MIN(sizeof(log) - off, buf_len);
+
+    return nvme_dma(n, ((uint8_t *)&log) + off, trans_len,
+                    DMA_DIRECTION_FROM_DEVICE, req);
+}
+
 static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeCmd *cmd = &req->cmd;
@@ -1334,6 +1389,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
         return nvme_smart_info(n, rae, len, off, req);
     case NVME_LOG_FW_SLOT_INFO:
         return nvme_fw_log_info(n, len, off, req);
+    case NVME_LOG_CMD_EFFECTS:
+        return nvme_cmd_effects(n, len, off, req);
     default:
         trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1920,6 +1977,11 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
     trace_pci_nvme_admin_cmd(nvme_cid(req), nvme_sqid(req), req->cmd.opcode,
                              nvme_adm_opc_str(req->cmd.opcode));
 
+    if (!(nvme_cse_acs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
+        trace_pci_nvme_err_invalid_admin_opc(req->cmd.opcode);
+        return NVME_INVALID_OPCODE | NVME_DNR;
+    }
+
     switch (req->cmd.opcode) {
     case NVME_ADM_CMD_DELETE_SQ:
         return nvme_del_sq(n, req);
@@ -1942,8 +2004,7 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
     case NVME_ADM_CMD_ASYNC_EV_REQ:
         return nvme_aer(n, req);
     default:
-        trace_pci_nvme_err_invalid_admin_opc(req->cmd.opcode);
-        return NVME_INVALID_OPCODE | NVME_DNR;
+        assert(false);
     }
 }
 
@@ -2031,6 +2092,23 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
     n->bar.cc = 0;
 }
 
+static void nvme_select_ns_iocs(NvmeCtrl *n)
+{
+    NvmeNamespace *ns;
+    int i;
+
+    for (i = 1; i <= n->num_namespaces; i++) {
+        ns = nvme_ns(n, i);
+        if (!ns) {
+            continue;
+        }
+        ns->iocs = nvme_cse_iocs_none;
+        if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) {
+            ns->iocs = nvme_cse_iocs_nvm;
+        }
+    }
+}
+
 static int nvme_start_ctrl(NvmeCtrl *n)
 {
     uint32_t page_bits = NVME_CC_MPS(n->bar.cc) + 12;
@@ -2129,6 +2207,8 @@ static int nvme_start_ctrl(NvmeCtrl *n)
 
     QTAILQ_INIT(&n->aer_queue);
 
+    nvme_select_ns_iocs(n);
+
     return 0;
 }
 
@@ -2737,7 +2817,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     id->acl = 3;
     id->aerl = n->params.aerl;
     id->frmw = (NVME_NUM_FW_SLOTS << 1) | NVME_FRMW_SLOT1_RO;
-    id->lpa = NVME_LPA_NS_SMART | NVME_LPA_EXTENDED;
+    id->lpa = NVME_LPA_NS_SMART | NVME_LPA_CSE | NVME_LPA_EXTENDED;
 
     /* recommended default value (~70 C) */
     id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
diff --git a/hw/block/trace-events b/hw/block/trace-events
index fac5995d94..0ae9cb0d35 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -85,6 +85,7 @@ pci_nvme_mmio_start_success(void) "setting controller enable bit succeeded"
 pci_nvme_mmio_stopped(void) "cleared controller enable bit"
 pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
 pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
+pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
 
 # nvme traces for error conditions
 pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
@@ -104,6 +105,7 @@ pci_nvme_err_invalid_prp(void) "invalid PRP"
 pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
 pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
 pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
+pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
 pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 6de2d5aa75..4779495b7d 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -744,10 +744,27 @@ enum NvmeSmartWarn {
     NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,
 };
 
+typedef struct NvmeEffectsLog {
+    uint32_t    acs[256];
+    uint32_t    iocs[256];
+    uint8_t     resv[2048];
+} NvmeEffectsLog;
+
+enum {
+    NVME_CMD_EFF_CSUPP      = 1 << 0,
+    NVME_CMD_EFF_LBCC       = 1 << 1,
+    NVME_CMD_EFF_NCC        = 1 << 2,
+    NVME_CMD_EFF_NIC        = 1 << 3,
+    NVME_CMD_EFF_CCC        = 1 << 4,
+    NVME_CMD_EFF_CSE_MASK   = 3 << 16,
+    NVME_CMD_EFF_UUID_SEL   = 1 << 19,
+};
+
 enum NvmeLogIdentifier {
     NVME_LOG_ERROR_INFO     = 0x01,
     NVME_LOG_SMART_INFO     = 0x02,
     NVME_LOG_FW_SLOT_INFO   = 0x03,
+    NVME_LOG_CMD_EFFECTS    = 0x05,
 };
 
 typedef struct QEMU_PACKED NvmePSD {
@@ -860,6 +877,7 @@ enum NvmeIdCtrlFrmw {
 
 enum NvmeIdCtrlLpa {
     NVME_LPA_NS_SMART = 1 << 0,
+    NVME_LPA_CSE      = 1 << 1,
     NVME_LPA_EXTENDED = 1 << 2,
 };
 
@@ -1059,6 +1077,7 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 02/11] hw/block/nvme: Generate namespace UUIDs
  2020-10-19  2:17 [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
  2020-10-19  2:17 ` [PATCH v7 01/11] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
@ 2020-10-19  2:17 ` Dmitry Fomichev
  2020-10-19 19:24   ` Keith Busch
  2020-10-19 19:30   ` Klaus Jensen
  2020-10-19  2:17 ` [PATCH v7 03/11] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-19  2:17 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

In NVMe 1.4, a namespace must report an ID descriptor of UUID type
if it doesn't support EUI64 or NGUID. Add a new namespace property,
"uuid", that provides the user the option to either specify the UUID
explicitly or have a UUID generated automatically every time a
namespace is initialized.

Suggested-by: Klaus Jansen <its@irrelevant.dk>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Klaus Jansen <its@irrelevant.dk>
---
 hw/block/nvme-ns.c | 1 +
 hw/block/nvme-ns.h | 1 +
 hw/block/nvme.c    | 9 +++++----
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index b69cdaf27e..de735eb9f3 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -129,6 +129,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 static Property nvme_ns_props[] = {
     DEFINE_BLOCK_PROPERTIES(NvmeNamespace, blkconf),
     DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
+    DEFINE_PROP_UUID("uuid", NvmeNamespace, params.uuid),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index ea8c2f785d..a38071884a 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -21,6 +21,7 @@
 
 typedef struct NvmeNamespaceParams {
     uint32_t nsid;
+    QemuUUID uuid;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 5a9493d89f..29139d8a17 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1574,6 +1574,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
 
 static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
 {
+    NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
     uint32_t nsid = le32_to_cpu(c->nsid);
     uint8_t list[NVME_IDENTIFY_DATA_SIZE];
@@ -1593,7 +1594,8 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
         return NVME_INVALID_NSID | NVME_DNR;
     }
 
-    if (unlikely(!nvme_ns(n, nsid))) {
+    ns = nvme_ns(n, nsid);
+    if (unlikely(!ns)) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
@@ -1602,12 +1604,11 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
     /*
      * Because the NGUID and EUI64 fields are 0 in the Identify Namespace data
      * structure, a Namespace UUID (nidt = 0x3) must be reported in the
-     * Namespace Identification Descriptor. Add a very basic Namespace UUID
-     * here.
+     * Namespace Identification Descriptor. Add the namespace UUID here.
      */
     ns_descrs->uuid.hdr.nidt = NVME_NIDT_UUID;
     ns_descrs->uuid.hdr.nidl = NVME_NIDT_UUID_LEN;
-    stl_be_p(&ns_descrs->uuid.v, nsid);
+    memcpy(&ns_descrs->uuid.v, ns->params.uuid.data, NVME_NIDT_UUID_LEN);
 
     return nvme_dma(n, list, NVME_IDENTIFY_DATA_SIZE,
                     DMA_DIRECTION_FROM_DEVICE, req);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 03/11] hw/block/nvme: Add support for Namespace Types
  2020-10-19  2:17 [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
  2020-10-19  2:17 ` [PATCH v7 01/11] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
  2020-10-19  2:17 ` [PATCH v7 02/11] hw/block/nvme: Generate namespace UUIDs Dmitry Fomichev
@ 2020-10-19  2:17 ` Dmitry Fomichev
  2020-10-19 19:51   ` Keith Busch
  2020-10-19 20:53   ` Klaus Jensen
  2020-10-19  2:17 ` [PATCH v7 04/11] hw/block/nvme: Support allocated CNS command variants Dmitry Fomichev
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-19  2:17 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

From: Niklas Cassel <niklas.cassel@wdc.com>

Define the structures and constants required to implement
Namespace Types support.

Namespace Types introduce a new command set, "I/O Command Sets",
that allows the host to retrieve the command sets associated with
a namespace. Introduce support for the command set and enable
detection for the NVM Command Set.

The new workflows for identify commands rely heavily on zero-filled
identify structs. E.g., certain CNS commands are defined to return
a zero-filled identify struct when an inactive namespace NSID
is supplied.

Add a helper function in order to avoid code duplication when
reporting zero-filled identify structures.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme-ns.c    |   2 +
 hw/block/nvme-ns.h    |   1 +
 hw/block/nvme.c       | 169 +++++++++++++++++++++++++++++++++++-------
 hw/block/trace-events |   7 ++
 include/block/nvme.h  |  65 ++++++++++++----
 5 files changed, 202 insertions(+), 42 deletions(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index de735eb9f3..c0362426cc 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -41,6 +41,8 @@ static void nvme_ns_init(NvmeNamespace *ns)
 
     id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(ns));
 
+    ns->csi = NVME_CSI_NVM;
+
     /* no thin provisioning */
     id_ns->ncap = id_ns->nsze;
     id_ns->nuse = id_ns->ncap;
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index a38071884a..d795e44bab 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -31,6 +31,7 @@ typedef struct NvmeNamespace {
     int64_t      size;
     NvmeIdNs     id_ns;
     const uint32_t *iocs;
+    uint8_t      csi;
 
     NvmeNamespaceParams params;
 } NvmeNamespace;
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 29139d8a17..ca0d0abf5c 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1503,6 +1503,13 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
     return NVME_SUCCESS;
 }
 
+static uint16_t nvme_rpt_empty_id_struct(NvmeCtrl *n, NvmeRequest *req)
+{
+    uint8_t id[NVME_IDENTIFY_DATA_SIZE] = {};
+
+    return nvme_dma(n, id, sizeof(id), DMA_DIRECTION_FROM_DEVICE, req);
+}
+
 static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
 {
     trace_pci_nvme_identify_ctrl();
@@ -1511,11 +1518,23 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
                     DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+
+    trace_pci_nvme_identify_ctrl_csi(c->csi);
+
+    if (c->csi == NVME_CSI_NVM) {
+        return nvme_rpt_empty_id_struct(n, req);
+    }
+
+    return NVME_INVALID_FIELD | NVME_DNR;
+}
+
 static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
-    NvmeIdNs *id_ns, inactive = { 0 };
     uint32_t nsid = le32_to_cpu(c->nsid);
 
     trace_pci_nvme_identify_ns(nsid);
@@ -1526,23 +1545,46 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
 
     ns = nvme_ns(n, nsid);
     if (unlikely(!ns)) {
-        id_ns = &inactive;
-    } else {
-        id_ns = &ns->id_ns;
+        return nvme_rpt_empty_id_struct(n, req);
     }
 
-    return nvme_dma(n, (uint8_t *)id_ns, sizeof(NvmeIdNs),
+    return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
                     DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeNamespace *ns;
+    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+    uint32_t nsid = le32_to_cpu(c->nsid);
+
+    trace_pci_nvme_identify_ns_csi(nsid, c->csi);
+
+    if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
+        return NVME_INVALID_NSID | NVME_DNR;
+    }
+
+    ns = nvme_ns(n, nsid);
+    if (unlikely(!ns)) {
+        return nvme_rpt_empty_id_struct(n, req);
+    }
+
+    if (c->csi == NVME_CSI_NVM) {
+        return nvme_rpt_empty_id_struct(n, req);
+    }
+
+    return NVME_INVALID_FIELD | NVME_DNR;
+}
+
 static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
 {
+    NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
-    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
     uint32_t min_nsid = le32_to_cpu(c->nsid);
-    uint32_t *list;
-    uint16_t ret;
-    int j = 0;
+    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
+    static const int data_len = sizeof(list);
+    uint32_t *list_ptr = (uint32_t *)list;
+    int i, j = 0;
 
     trace_pci_nvme_identify_nslist(min_nsid);
 
@@ -1556,20 +1598,54 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
         return NVME_INVALID_NSID | NVME_DNR;
     }
 
-    list = g_malloc0(data_len);
-    for (int i = 1; i <= n->num_namespaces; i++) {
-        if (i <= min_nsid || !nvme_ns(n, i)) {
+    for (i = 1; i <= n->num_namespaces; i++) {
+        ns = nvme_ns(n, i);
+        if (!ns) {
             continue;
         }
-        list[j++] = cpu_to_le32(i);
+        if (ns->params.nsid < min_nsid) {
+            continue;
+        }
+        list_ptr[j++] = cpu_to_le32(ns->params.nsid);
         if (j == data_len / sizeof(uint32_t)) {
             break;
         }
     }
-    ret = nvme_dma(n, (uint8_t *)list, data_len, DMA_DIRECTION_FROM_DEVICE,
-                   req);
-    g_free(list);
-    return ret;
+
+    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
+}
+
+static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeNamespace *ns;
+    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+    uint32_t min_nsid = le32_to_cpu(c->nsid);
+    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
+    static const int data_len = sizeof(list);
+    uint32_t *list_ptr = (uint32_t *)list;
+    int i, j = 0;
+
+    trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
+
+    if (c->csi != NVME_CSI_NVM) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    for (i = 1; i <= n->num_namespaces; i++) {
+        ns = nvme_ns(n, i);
+        if (!ns) {
+            continue;
+        }
+        if (ns->params.nsid < min_nsid) {
+            continue;
+        }
+        list_ptr[j++] = cpu_to_le32(ns->params.nsid);
+        if (j == data_len / sizeof(uint32_t)) {
+            break;
+        }
+    }
+
+    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
@@ -1577,13 +1653,17 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
     NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
     uint32_t nsid = le32_to_cpu(c->nsid);
-    uint8_t list[NVME_IDENTIFY_DATA_SIZE];
+    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
 
     struct data {
         struct {
             NvmeIdNsDescr hdr;
-            uint8_t v[16];
+            uint8_t v[NVME_NIDL_UUID];
         } uuid;
+        struct {
+            NvmeIdNsDescr hdr;
+            uint8_t v;
+        } csi;
     };
 
     struct data *ns_descrs = (struct data *)list;
@@ -1599,19 +1679,31 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
-    memset(list, 0x0, sizeof(list));
-
     /*
      * Because the NGUID and EUI64 fields are 0 in the Identify Namespace data
      * structure, a Namespace UUID (nidt = 0x3) must be reported in the
      * Namespace Identification Descriptor. Add the namespace UUID here.
      */
     ns_descrs->uuid.hdr.nidt = NVME_NIDT_UUID;
-    ns_descrs->uuid.hdr.nidl = NVME_NIDT_UUID_LEN;
-    memcpy(&ns_descrs->uuid.v, ns->params.uuid.data, NVME_NIDT_UUID_LEN);
+    ns_descrs->uuid.hdr.nidl = NVME_NIDL_UUID;
+    memcpy(&ns_descrs->uuid.v, ns->params.uuid.data, NVME_NIDL_UUID);
 
-    return nvme_dma(n, list, NVME_IDENTIFY_DATA_SIZE,
-                    DMA_DIRECTION_FROM_DEVICE, req);
+    ns_descrs->csi.hdr.nidt = NVME_NIDT_CSI;
+    ns_descrs->csi.hdr.nidl = NVME_NIDL_CSI;
+    ns_descrs->csi.v = ns->csi;
+
+    return nvme_dma(n, list, sizeof(list), DMA_DIRECTION_FROM_DEVICE, req);
+}
+
+static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeRequest *req)
+{
+    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
+    static const int data_len = sizeof(list);
+
+    trace_pci_nvme_identify_cmd_set();
+
+    NVME_SET_CSI(*list, NVME_CSI_NVM);
+    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
@@ -1621,12 +1713,20 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
     switch (le32_to_cpu(c->cns)) {
     case NVME_ID_CNS_NS:
         return nvme_identify_ns(n, req);
+    case NVME_ID_CNS_CS_NS:
+        return nvme_identify_ns_csi(n, req);
     case NVME_ID_CNS_CTRL:
         return nvme_identify_ctrl(n, req);
+    case NVME_ID_CNS_CS_CTRL:
+        return nvme_identify_ctrl_csi(n, req);
     case NVME_ID_CNS_NS_ACTIVE_LIST:
         return nvme_identify_nslist(n, req);
+    case NVME_ID_CNS_CS_NS_ACTIVE_LIST:
+        return nvme_identify_nslist_csi(n, req);
     case NVME_ID_CNS_NS_DESCR_LIST:
         return nvme_identify_ns_descr_list(n, req);
+    case NVME_ID_CNS_IO_COMMAND_SET:
+        return nvme_identify_cmd_set(n, req);
     default:
         trace_pci_nvme_err_invalid_identify_cns(le32_to_cpu(c->cns));
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1807,7 +1907,9 @@ defaults:
         if (iv == n->admin_cq.vector) {
             result |= NVME_INTVC_NOCOALESCING;
         }
-
+        break;
+    case NVME_COMMAND_SET_PROFILE:
+        result = 0;
         break;
     default:
         result = nvme_feature_default[fid];
@@ -1948,6 +2050,12 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
         break;
     case NVME_TIMESTAMP:
         return nvme_set_feature_timestamp(n, req);
+    case NVME_COMMAND_SET_PROFILE:
+        if (dw11 & 0x1ff) {
+            trace_pci_nvme_err_invalid_iocsci(dw11 & 0x1ff);
+            return NVME_CMD_SET_CMB_REJECTED | NVME_DNR;
+        }
+        break;
     default:
         return NVME_FEAT_NOT_CHANGEABLE | NVME_DNR;
     }
@@ -2104,8 +2212,12 @@ static void nvme_select_ns_iocs(NvmeCtrl *n)
             continue;
         }
         ns->iocs = nvme_cse_iocs_none;
-        if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) {
-            ns->iocs = nvme_cse_iocs_nvm;
+        switch (ns->csi) {
+        case NVME_CSI_NVM:
+            if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) {
+                ns->iocs = nvme_cse_iocs_nvm;
+            }
+            break;
         }
     }
 }
@@ -2847,6 +2959,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     NVME_CAP_SET_CQR(n->bar.cap, 1);
     NVME_CAP_SET_TO(n->bar.cap, 0xf);
     NVME_CAP_SET_CSS(n->bar.cap, NVME_CAP_CSS_NVM);
+    NVME_CAP_SET_CSS(n->bar.cap, NVME_CAP_CSS_CSI_SUPP);
     NVME_CAP_SET_CSS(n->bar.cap, NVME_CAP_CSS_ADMIN_ONLY);
     NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
 
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 0ae9cb0d35..65b964c894 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -48,8 +48,12 @@ pci_nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size,
 pci_nvme_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
 pci_nvme_del_cq(uint16_t cqid) "deleted completion queue, cqid=%"PRIu16""
 pci_nvme_identify_ctrl(void) "identify controller"
+pci_nvme_identify_ctrl_csi(uint8_t csi) "identify controller, csi=0x%"PRIx8""
 pci_nvme_identify_ns(uint32_t ns) "nsid %"PRIu32""
+pci_nvme_identify_ns_csi(uint32_t ns, uint8_t csi) "nsid=%"PRIu32", csi=0x%"PRIx8""
 pci_nvme_identify_nslist(uint32_t ns) "nsid %"PRIu32""
+pci_nvme_identify_nslist_csi(uint16_t ns, uint8_t csi) "nsid=%"PRIu16", csi=0x%"PRIx8""
+pci_nvme_identify_cmd_set(void) "identify i/o command set"
 pci_nvme_identify_ns_descr_list(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_get_log(uint16_t cid, uint8_t lid, uint8_t lsp, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" lsp 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
 pci_nvme_getfeat(uint16_t cid, uint32_t nsid, uint8_t fid, uint8_t sel, uint32_t cdw11) "cid %"PRIu16" nsid 0x%"PRIx32" fid 0x%"PRIx8" sel 0x%"PRIx8" cdw11 0x%"PRIx32""
@@ -106,6 +110,8 @@ pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
 pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
 pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
 pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
+pci_nvme_err_only_nvm_cmd_set_avail(void) "setting 110b CC.CSS, but only NVM command set is enabled"
+pci_nvme_err_invalid_iocsci(uint32_t idx) "unsupported command set combination index %"PRIu32""
 pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
@@ -162,6 +168,7 @@ pci_nvme_ub_db_wr_invalid_cq(uint32_t qid) "completion queue doorbell write for
 pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t new_head) "completion queue doorbell write value beyond queue size, cqid=%"PRIu32", new_head=%"PRIu16", ignoring"
 pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write for nonexistent queue, sqid=%"PRIu32", ignoring"
 pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission queue doorbell write value beyond queue size, sqid=%"PRIu32", new_head=%"PRIu16", ignoring"
+pci_nvme_ub_unknown_css_value(void) "unknown value in cc.css field"
 
 # xen-block.c
 xen_block_realize(const char *type, uint32_t disk, uint32_t partition) "%s d%up%u"
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 4779495b7d..f5ac9143c4 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -84,6 +84,7 @@ enum NvmeCapMask {
 
 enum NvmeCapCss {
     NVME_CAP_CSS_NVM        = 1 << 0,
+    NVME_CAP_CSS_CSI_SUPP   = 1 << 6,
     NVME_CAP_CSS_ADMIN_ONLY = 1 << 7,
 };
 
@@ -117,9 +118,25 @@ enum NvmeCcMask {
 
 enum NvmeCcCss {
     NVME_CC_CSS_NVM        = 0x0,
+    NVME_CC_CSS_CSI        = 0x6,
     NVME_CC_CSS_ADMIN_ONLY = 0x7,
 };
 
+#define NVME_SET_CC_EN(cc, val)     \
+    (cc |= (uint32_t)((val) & CC_EN_MASK) << CC_EN_SHIFT)
+#define NVME_SET_CC_CSS(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_CSS_MASK) << CC_CSS_SHIFT)
+#define NVME_SET_CC_MPS(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_MPS_MASK) << CC_MPS_SHIFT)
+#define NVME_SET_CC_AMS(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_AMS_MASK) << CC_AMS_SHIFT)
+#define NVME_SET_CC_SHN(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_SHN_MASK) << CC_SHN_SHIFT)
+#define NVME_SET_CC_IOSQES(cc, val) \
+    (cc |= (uint32_t)((val) & CC_IOSQES_MASK) << CC_IOSQES_SHIFT)
+#define NVME_SET_CC_IOCQES(cc, val) \
+    (cc |= (uint32_t)((val) & CC_IOCQES_MASK) << CC_IOCQES_SHIFT)
+
 enum NvmeCstsShift {
     CSTS_RDY_SHIFT      = 0,
     CSTS_CFS_SHIFT      = 1,
@@ -534,8 +551,13 @@ typedef struct QEMU_PACKED NvmeIdentify {
     uint64_t    rsvd2[2];
     uint64_t    prp1;
     uint64_t    prp2;
-    uint32_t    cns;
-    uint32_t    rsvd11[5];
+    uint8_t     cns;
+    uint8_t     rsvd10;
+    uint16_t    ctrlid;
+    uint16_t    nvmsetid;
+    uint8_t     rsvd11;
+    uint8_t     csi;
+    uint32_t    rsvd12[4];
 } NvmeIdentify;
 
 typedef struct QEMU_PACKED NvmeRwCmd {
@@ -655,6 +677,7 @@ enum NvmeStatusCodes {
     NVME_MD_SGL_LEN_INVALID     = 0x0010,
     NVME_SGL_DESCR_TYPE_INVALID = 0x0011,
     NVME_INVALID_USE_OF_CMB     = 0x0012,
+    NVME_CMD_SET_CMB_REJECTED   = 0x002b,
     NVME_LBA_RANGE              = 0x0080,
     NVME_CAP_EXCEEDED           = 0x0081,
     NVME_NS_NOT_READY           = 0x0082,
@@ -781,11 +804,15 @@ typedef struct QEMU_PACKED NvmePSD {
 
 #define NVME_IDENTIFY_DATA_SIZE 4096
 
-enum {
-    NVME_ID_CNS_NS             = 0x0,
-    NVME_ID_CNS_CTRL           = 0x1,
-    NVME_ID_CNS_NS_ACTIVE_LIST = 0x2,
-    NVME_ID_CNS_NS_DESCR_LIST  = 0x3,
+enum NvmeIdCns {
+    NVME_ID_CNS_NS                = 0x00,
+    NVME_ID_CNS_CTRL              = 0x01,
+    NVME_ID_CNS_NS_ACTIVE_LIST    = 0x02,
+    NVME_ID_CNS_NS_DESCR_LIST     = 0x03,
+    NVME_ID_CNS_CS_NS             = 0x05,
+    NVME_ID_CNS_CS_CTRL           = 0x06,
+    NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
+    NVME_ID_CNS_IO_COMMAND_SET    = 0x1c,
 };
 
 typedef struct QEMU_PACKED NvmeIdCtrl {
@@ -933,6 +960,7 @@ enum NvmeFeatureIds {
     NVME_WRITE_ATOMICITY            = 0xa,
     NVME_ASYNCHRONOUS_EVENT_CONF    = 0xb,
     NVME_TIMESTAMP                  = 0xe,
+    NVME_COMMAND_SET_PROFILE        = 0x19,
     NVME_SOFTWARE_PROGRESS_MARKER   = 0x80,
     NVME_FID_MAX                    = 0x100,
 };
@@ -1017,18 +1045,26 @@ typedef struct QEMU_PACKED NvmeIdNsDescr {
     uint8_t rsvd2[2];
 } NvmeIdNsDescr;
 
-enum {
-    NVME_NIDT_EUI64_LEN =  8,
-    NVME_NIDT_NGUID_LEN = 16,
-    NVME_NIDT_UUID_LEN  = 16,
+enum NvmeNsIdentifierLength {
+    NVME_NIDL_EUI64             = 8,
+    NVME_NIDL_NGUID             = 16,
+    NVME_NIDL_UUID              = 16,
+    NVME_NIDL_CSI               = 1,
 };
 
 enum NvmeNsIdentifierType {
-    NVME_NIDT_EUI64 = 0x1,
-    NVME_NIDT_NGUID = 0x2,
-    NVME_NIDT_UUID  = 0x3,
+    NVME_NIDT_EUI64             = 0x01,
+    NVME_NIDT_NGUID             = 0x02,
+    NVME_NIDT_UUID              = 0x03,
+    NVME_NIDT_CSI               = 0x04,
 };
 
+enum NvmeCsi {
+    NVME_CSI_NVM                = 0x00,
+};
+
+#define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
+
 /*Deallocate Logical Block Features*/
 #define NVME_ID_NS_DLFEAT_GUARD_CRC(dlfeat)       ((dlfeat) & 0x10)
 #define NVME_ID_NS_DLFEAT_WRITE_ZEROES(dlfeat)    ((dlfeat) & 0x08)
@@ -1079,6 +1115,7 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 04/11] hw/block/nvme: Support allocated CNS command variants
  2020-10-19  2:17 [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (2 preceding siblings ...)
  2020-10-19  2:17 ` [PATCH v7 03/11] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
@ 2020-10-19  2:17 ` Dmitry Fomichev
  2020-10-19 20:07   ` Keith Busch
  2020-10-20  8:21   ` Klaus Jensen
  2020-10-19  2:17 ` [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-19  2:17 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

From: Niklas Cassel <niklas.cassel@wdc.com>

Many CNS commands have "allocated" command variants. These include
a namespace as long as it is allocated, that is a namespace is
included regardless if it is active (attached) or not.

While these commands are optional (they are mandatory for controllers
supporting the namespace attachment command), our QEMU implementation
is more complete by actually providing support for these CNS values.

However, since our QEMU model currently does not support the namespace
attachment command, these new allocated CNS commands will return the
same result as the active CNS command variants.

In NVMe, a namespace is active if it exists and is attached to the
controller.

CAP.CSS (together with the I/O Command Set data structure) defines
what command sets are supported by the controller.

CC.CSS (together with Set Profile) can be set to enable a subset of
the available command sets.

Even if a user configures CC.CSS to e.g. Admin only, NVM namespaces
will still be attached (and thus marked as active).
Similarly, if a user configures CC.CSS to e.g. NVM, ZNS namespaces
will still be attached (and thus marked as active).

However, any operation from a disabled command set will result in a
Invalid Command Opcode.

Add a new Boolean namespace property, "attached", to provide the most
basic namespace attachment support. The default value for this new
property is true. Also, implement the logic in the new CNS values to
include/exclude namespaces based on this new property. The only thing
missing is hooking up the actual Namespace Attachment command opcode,
which will allow a user to toggle the "attached" flag per namespace.

The reason for not hooking up this command completely is because the
NVMe specification requires the namespace management command to be
supported if the namespace attachment command is supported.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme-ns.c   |  1 +
 hw/block/nvme-ns.h   |  1 +
 hw/block/nvme.c      | 68 ++++++++++++++++++++++++++++++++++++--------
 include/block/nvme.h | 20 +++++++------
 4 files changed, 70 insertions(+), 20 deletions(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index c0362426cc..974aea33f7 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -132,6 +132,7 @@ static Property nvme_ns_props[] = {
     DEFINE_BLOCK_PROPERTIES(NvmeNamespace, blkconf),
     DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
     DEFINE_PROP_UUID("uuid", NvmeNamespace, params.uuid),
+    DEFINE_PROP_BOOL("attached", NvmeNamespace, params.attached, true),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index d795e44bab..d6b2808b97 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -21,6 +21,7 @@
 
 typedef struct NvmeNamespaceParams {
     uint32_t nsid;
+    bool     attached;
     QemuUUID uuid;
 } NvmeNamespaceParams;
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index ca0d0abf5c..93728e51b3 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1062,6 +1062,9 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
     if (unlikely(!req->ns)) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
+    if (!req->ns->params.attached) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
 
     if (!(req->ns->iocs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
         trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
@@ -1222,6 +1225,7 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
     uint32_t trans_len;
     NvmeNamespace *ns;
     time_t current_ms;
+    int i;
 
     if (off >= sizeof(smart)) {
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1232,15 +1236,18 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
         if (!ns) {
             return NVME_INVALID_NSID | NVME_DNR;
         }
-        nvme_set_blk_stats(ns, &stats);
+        if (ns->params.attached) {
+            nvme_set_blk_stats(ns, &stats);
+        }
     } else {
-        int i;
-
         for (i = 1; i <= n->num_namespaces; i++) {
             ns = nvme_ns(n, i);
             if (!ns) {
                 continue;
             }
+            if (!ns->params.attached) {
+                continue;
+            }
             nvme_set_blk_stats(ns, &stats);
         }
     }
@@ -1531,7 +1538,8 @@ static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
     return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req,
+                                 bool only_active)
 {
     NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -1548,11 +1556,16 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
         return nvme_rpt_empty_id_struct(n, req);
     }
 
+    if (only_active && !ns->params.attached) {
+        return nvme_rpt_empty_id_struct(n, req);
+    }
+
     return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
                     DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
+                                     bool only_active)
 {
     NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -1569,6 +1582,10 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
         return nvme_rpt_empty_id_struct(n, req);
     }
 
+    if (only_active && !ns->params.attached) {
+        return nvme_rpt_empty_id_struct(n, req);
+    }
+
     if (c->csi == NVME_CSI_NVM) {
         return nvme_rpt_empty_id_struct(n, req);
     }
@@ -1576,7 +1593,8 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
     return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req,
+                                     bool only_active)
 {
     NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -1606,6 +1624,9 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
         if (ns->params.nsid < min_nsid) {
             continue;
         }
+        if (only_active && !ns->params.attached) {
+            continue;
+        }
         list_ptr[j++] = cpu_to_le32(ns->params.nsid);
         if (j == data_len / sizeof(uint32_t)) {
             break;
@@ -1615,7 +1636,8 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
     return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req,
+                                         bool only_active)
 {
     NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -1639,6 +1661,9 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
         if (ns->params.nsid < min_nsid) {
             continue;
         }
+        if (only_active && !ns->params.attached) {
+            continue;
+        }
         list_ptr[j++] = cpu_to_le32(ns->params.nsid);
         if (j == data_len / sizeof(uint32_t)) {
             break;
@@ -1712,17 +1737,25 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
 
     switch (le32_to_cpu(c->cns)) {
     case NVME_ID_CNS_NS:
-        return nvme_identify_ns(n, req);
+        return nvme_identify_ns(n, req, true);
     case NVME_ID_CNS_CS_NS:
-        return nvme_identify_ns_csi(n, req);
+        return nvme_identify_ns_csi(n, req, true);
+    case NVME_ID_CNS_NS_PRESENT:
+        return nvme_identify_ns(n, req, false);
+    case NVME_ID_CNS_CS_NS_PRESENT:
+        return nvme_identify_ns_csi(n, req, false);
     case NVME_ID_CNS_CTRL:
         return nvme_identify_ctrl(n, req);
     case NVME_ID_CNS_CS_CTRL:
         return nvme_identify_ctrl_csi(n, req);
     case NVME_ID_CNS_NS_ACTIVE_LIST:
-        return nvme_identify_nslist(n, req);
+        return nvme_identify_nslist(n, req, true);
     case NVME_ID_CNS_CS_NS_ACTIVE_LIST:
-        return nvme_identify_nslist_csi(n, req);
+        return nvme_identify_nslist_csi(n, req, true);
+    case NVME_ID_CNS_NS_PRESENT_LIST:
+        return nvme_identify_nslist(n, req, false);
+    case NVME_ID_CNS_CS_NS_PRESENT_LIST:
+        return nvme_identify_nslist_csi(n, req, false);
     case NVME_ID_CNS_NS_DESCR_LIST:
         return nvme_identify_ns_descr_list(n, req);
     case NVME_ID_CNS_IO_COMMAND_SET:
@@ -1795,6 +1828,7 @@ static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeRequest *req)
 
 static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
 {
+    NvmeNamespace *ns;
     NvmeCmd *cmd = &req->cmd;
     uint32_t dw10 = le32_to_cpu(cmd->cdw10);
     uint32_t dw11 = le32_to_cpu(cmd->cdw11);
@@ -1826,7 +1860,11 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
             return NVME_INVALID_NSID | NVME_DNR;
         }
 
-        if (!nvme_ns(n, nsid)) {
+        ns = nvme_ns(n, nsid);
+        if (!ns) {
+            return NVME_INVALID_FIELD | NVME_DNR;
+        }
+        if (!ns->params.attached) {
             return NVME_INVALID_FIELD | NVME_DNR;
         }
     }
@@ -1968,6 +2006,9 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
             if (unlikely(!ns)) {
                 return NVME_INVALID_FIELD | NVME_DNR;
             }
+            if (!ns->params.attached) {
+                return NVME_INVALID_FIELD | NVME_DNR;
+            }
         }
     } else if (nsid && nsid != NVME_NSID_BROADCAST) {
         if (!nvme_nsid_valid(n, nsid)) {
@@ -2015,6 +2056,9 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
             if (!ns) {
                 continue;
             }
+            if (!ns->params.attached) {
+                continue;
+            }
 
             if (!(dw11 & 0x1) && blk_enable_write_cache(ns->blkconf.blk)) {
                 blk_flush(ns->blkconf.blk);
diff --git a/include/block/nvme.h b/include/block/nvme.h
index f5ac9143c4..27125c9d28 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -805,14 +805,18 @@ typedef struct QEMU_PACKED NvmePSD {
 #define NVME_IDENTIFY_DATA_SIZE 4096
 
 enum NvmeIdCns {
-    NVME_ID_CNS_NS                = 0x00,
-    NVME_ID_CNS_CTRL              = 0x01,
-    NVME_ID_CNS_NS_ACTIVE_LIST    = 0x02,
-    NVME_ID_CNS_NS_DESCR_LIST     = 0x03,
-    NVME_ID_CNS_CS_NS             = 0x05,
-    NVME_ID_CNS_CS_CTRL           = 0x06,
-    NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
-    NVME_ID_CNS_IO_COMMAND_SET    = 0x1c,
+    NVME_ID_CNS_NS                    = 0x00,
+    NVME_ID_CNS_CTRL                  = 0x01,
+    NVME_ID_CNS_NS_ACTIVE_LIST        = 0x02,
+    NVME_ID_CNS_NS_DESCR_LIST         = 0x03,
+    NVME_ID_CNS_CS_NS                 = 0x05,
+    NVME_ID_CNS_CS_CTRL               = 0x06,
+    NVME_ID_CNS_CS_NS_ACTIVE_LIST     = 0x07,
+    NVME_ID_CNS_NS_PRESENT_LIST       = 0x10,
+    NVME_ID_CNS_NS_PRESENT            = 0x11,
+    NVME_ID_CNS_CS_NS_PRESENT_LIST    = 0x1a,
+    NVME_ID_CNS_CS_NS_PRESENT         = 0x1b,
+    NVME_ID_CNS_IO_COMMAND_SET        = 0x1c,
 };
 
 typedef struct QEMU_PACKED NvmeIdCtrl {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set
  2020-10-19  2:17 [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (3 preceding siblings ...)
  2020-10-19  2:17 ` [PATCH v7 04/11] hw/block/nvme: Support allocated CNS command variants Dmitry Fomichev
@ 2020-10-19  2:17 ` Dmitry Fomichev
  2020-10-19  9:50   ` Klaus Jensen
                     ` (3 more replies)
  2020-10-19  2:17 ` [PATCH v7 06/11] hw/block/nvme: Introduce max active and open zone limits Dmitry Fomichev
                   ` (6 subsequent siblings)
  11 siblings, 4 replies; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-19  2:17 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

The emulation code has been changed to advertise NVM Command Set when
"zoned" device property is not set (default) and Zoned Namespace
Command Set otherwise.

Define values and structures that are needed to support Zoned
Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.
Define trace events where needed in newly introduced code.

In order to improve scalability, all open, closed and full zones
are organized in separate linked lists. Consequently, almost all
zone operations don't require scanning of the entire zone array
(which potentially can be quite large) - it is only necessary to
enumerate one or more zone lists.

Handlers for three new NVMe commands introduced in Zoned Namespace
Command Set specification are added, namely for Zone Management
Receive, Zone Management Send and Zone Append.

Device initialization code has been extended to create a proper
configuration for zoned operation using device properties.

Read/Write command handler is modified to only allow writes at the
write pointer if the namespace is zoned. For Zone Append command,
writes implicitly happen at the write pointer and the starting write
pointer value is returned as the result of the command. Write Zeroes
handler is modified to add zoned checks that are identical to those
done as a part of Write flow.

Subsequent commits in this series add ZDE support and checks for
active and open zone limits.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 block/nvme.c          |   2 +-
 hw/block/nvme-ns.c    | 193 +++++++++
 hw/block/nvme-ns.h    |  54 +++
 hw/block/nvme.c       | 975 ++++++++++++++++++++++++++++++++++++++++--
 hw/block/nvme.h       |   9 +
 hw/block/trace-events |  21 +
 include/block/nvme.h  | 113 ++++-
 7 files changed, 1339 insertions(+), 28 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 05485fdd11..7a513c9a17 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -333,7 +333,7 @@ static inline int nvme_translate_error(const NvmeCqe *c)
 {
     uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF;
     if (status) {
-        trace_nvme_error(le32_to_cpu(c->result),
+        trace_nvme_error(le32_to_cpu(c->result32),
                          le16_to_cpu(c->sq_head),
                          le16_to_cpu(c->sq_id),
                          le16_to_cpu(c->cid),
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 974aea33f7..fedfad595c 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -25,6 +25,7 @@
 #include "hw/qdev-properties.h"
 #include "hw/qdev-core.h"
 
+#include "trace.h"
 #include "nvme.h"
 #include "nvme-ns.h"
 
@@ -76,6 +77,171 @@ static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
     return 0;
 }
 
+static int nvme_calc_zone_geometry(NvmeNamespace *ns, Error **errp)
+{
+    uint64_t zone_size, zone_cap;
+    uint32_t nz, lbasz = ns->blkconf.logical_block_size;
+
+    if (ns->params.zone_size_bs) {
+        zone_size = ns->params.zone_size_bs;
+    } else {
+        zone_size = NVME_DEFAULT_ZONE_SIZE;
+    }
+    if (ns->params.zone_cap_bs) {
+        zone_cap = ns->params.zone_cap_bs;
+    } else {
+        zone_cap = zone_size;
+    }
+    if (zone_cap > zone_size) {
+        error_setg(errp, "zone capacity %luB exceeds zone size %luB",
+                   zone_cap, zone_size);
+        return -1;
+    }
+    if (zone_size < lbasz) {
+        error_setg(errp, "zone size %luB too small, must be at least %uB",
+                   zone_size, lbasz);
+        return -1;
+    }
+    if (zone_cap < lbasz) {
+        error_setg(errp, "zone capacity %luB too small, must be at least %uB",
+                   zone_cap, lbasz);
+        return -1;
+    }
+    ns->zone_size = zone_size / lbasz;
+    ns->zone_capacity = zone_cap / lbasz;
+
+    nz = DIV_ROUND_UP(ns->size / lbasz, ns->zone_size);
+    ns->num_zones = nz;
+    ns->zone_array_size = sizeof(NvmeZone) * nz;
+    ns->zone_size_log2 = 0;
+    if (is_power_of_2(ns->zone_size)) {
+        ns->zone_size_log2 = 63 - clz64(ns->zone_size);
+    }
+
+    return 0;
+}
+
+static void nvme_init_zone_state(NvmeNamespace *ns)
+{
+    uint64_t start = 0, zone_size = ns->zone_size;
+    uint64_t capacity = ns->num_zones * zone_size;
+    NvmeZone *zone;
+    int i;
+
+    ns->zone_array = g_malloc0(ns->zone_array_size);
+
+    QTAILQ_INIT(&ns->exp_open_zones);
+    QTAILQ_INIT(&ns->imp_open_zones);
+    QTAILQ_INIT(&ns->closed_zones);
+    QTAILQ_INIT(&ns->full_zones);
+
+    zone = ns->zone_array;
+    for (i = 0; i < ns->num_zones; i++, zone++) {
+        if (start + zone_size > capacity) {
+            zone_size = capacity - start;
+        }
+        zone->d.zt = NVME_ZONE_TYPE_SEQ_WRITE;
+        nvme_set_zone_state(zone, NVME_ZONE_STATE_EMPTY);
+        zone->d.za = 0;
+        zone->d.zcap = ns->zone_capacity;
+        zone->d.zslba = start;
+        zone->d.wp = start;
+        zone->w_ptr = start;
+        start += zone_size;
+    }
+}
+
+static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
+                              Error **errp)
+{
+    NvmeIdNsZoned *id_ns_z;
+
+    if (n->params.fill_pattern == 0xff) {
+        ns->id_ns.dlfeat |= 0x02;
+    }
+    if (n->params.fill_pattern != 0x00) {
+        ns->id_ns.dlfeat &= ~0x01;
+    }
+
+    if (nvme_calc_zone_geometry(ns, errp) != 0) {
+        return -1;
+    }
+
+    nvme_init_zone_state(ns);
+
+    id_ns_z = g_malloc0(sizeof(NvmeIdNsZoned));
+
+    /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
+    id_ns_z->mar = 0xffffffff;
+    id_ns_z->mor = 0xffffffff;
+    id_ns_z->zoc = 0;
+    id_ns_z->ozcs = ns->params.cross_zone_read ? 0x01 : 0x00;
+
+    id_ns_z->lbafe[lba_index].zsze = cpu_to_le64(ns->zone_size);
+    id_ns_z->lbafe[lba_index].zdes = 0;
+
+    ns->csi = NVME_CSI_ZONED;
+    ns->id_ns.nsze = cpu_to_le64(ns->zone_size * ns->num_zones);
+    ns->id_ns.ncap = cpu_to_le64(ns->zone_capacity * ns->num_zones);
+    ns->id_ns.nuse = ns->id_ns.ncap;
+
+    ns->id_ns_zoned = id_ns_z;
+
+    return 0;
+}
+
+/*
+ * Close or finish all the zones that are currently open.
+ */
+static void nvme_zoned_clear_ns(NvmeNamespace *ns)
+{
+    NvmeZone *zone;
+    uint32_t set_state;
+    int i;
+
+    zone = ns->zone_array;
+    for (i = 0; i < ns->num_zones; i++, zone++) {
+        switch (nvme_get_zone_state(zone)) {
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+            QTAILQ_REMOVE(&ns->imp_open_zones, zone, entry);
+            break;
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+            QTAILQ_REMOVE(&ns->exp_open_zones, zone, entry);
+            break;
+        case NVME_ZONE_STATE_CLOSED:
+            /* fall through */
+        default:
+            continue;
+        }
+
+        if (zone->d.wp == zone->d.zslba) {
+            set_state = NVME_ZONE_STATE_EMPTY;
+        } else {
+            set_state = NVME_ZONE_STATE_CLOSED;
+        }
+
+        switch (set_state) {
+        case NVME_ZONE_STATE_CLOSED:
+            trace_pci_nvme_clear_ns_close(nvme_get_zone_state(zone),
+                                          zone->d.zslba);
+            QTAILQ_INSERT_TAIL(&ns->closed_zones, zone, entry);
+            break;
+        case NVME_ZONE_STATE_EMPTY:
+            trace_pci_nvme_clear_ns_reset(nvme_get_zone_state(zone),
+                                          zone->d.zslba);
+            break;
+        case NVME_ZONE_STATE_FULL:
+            trace_pci_nvme_clear_ns_full(nvme_get_zone_state(zone),
+                                         zone->d.zslba);
+            zone->d.wp = nvme_zone_wr_boundary(zone);
+            QTAILQ_INSERT_TAIL(&ns->full_zones, zone, entry);
+        }
+
+        zone->w_ptr = zone->d.wp;
+        nvme_set_zone_state(zone, set_state);
+    }
+}
+
 static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
 {
     if (!ns->blkconf.blk) {
@@ -97,6 +263,12 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
     }
 
     nvme_ns_init(ns);
+    if (ns->params.zoned) {
+        if (nvme_zoned_init_ns(n, ns, 0, errp) != 0) {
+            return -1;
+        }
+    }
+
     if (nvme_register_namespace(n, ns, errp)) {
         return -1;
     }
@@ -114,6 +286,21 @@ void nvme_ns_flush(NvmeNamespace *ns)
     blk_flush(ns->blkconf.blk);
 }
 
+void nvme_ns_clear(NvmeNamespace *ns)
+{
+    if (ns->params.zoned) {
+        nvme_zoned_clear_ns(ns);
+    }
+}
+
+void nvme_ns_cleanup(NvmeNamespace *ns)
+{
+    if (ns->params.zoned) {
+        g_free(ns->id_ns_zoned);
+        g_free(ns->zone_array);
+    }
+}
+
 static void nvme_ns_realize(DeviceState *dev, Error **errp)
 {
     NvmeNamespace *ns = NVME_NS(dev);
@@ -133,6 +320,12 @@ static Property nvme_ns_props[] = {
     DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
     DEFINE_PROP_UUID("uuid", NvmeNamespace, params.uuid),
     DEFINE_PROP_BOOL("attached", NvmeNamespace, params.attached, true),
+    DEFINE_PROP_BOOL("zoned", NvmeNamespace, params.zoned, false),
+    DEFINE_PROP_SIZE("zone_size", NvmeNamespace, params.zone_size_bs,
+                     NVME_DEFAULT_ZONE_SIZE),
+    DEFINE_PROP_SIZE("zone_capacity", NvmeNamespace, params.zone_cap_bs, 0),
+    DEFINE_PROP_BOOL("cross_zone_read", NvmeNamespace,
+                     params.cross_zone_read, false),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index d6b2808b97..170cbb8cdc 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -19,10 +19,21 @@
 #define NVME_NS(obj) \
     OBJECT_CHECK(NvmeNamespace, (obj), TYPE_NVME_NS)
 
+typedef struct NvmeZone {
+    NvmeZoneDescr   d;
+    uint64_t        w_ptr;
+    QTAILQ_ENTRY(NvmeZone) entry;
+} NvmeZone;
+
 typedef struct NvmeNamespaceParams {
     uint32_t nsid;
     bool     attached;
     QemuUUID uuid;
+
+    bool     zoned;
+    bool     cross_zone_read;
+    uint64_t zone_size_bs;
+    uint64_t zone_cap_bs;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
@@ -34,6 +45,18 @@ typedef struct NvmeNamespace {
     const uint32_t *iocs;
     uint8_t      csi;
 
+    NvmeIdNsZoned   *id_ns_zoned;
+    NvmeZone        *zone_array;
+    QTAILQ_HEAD(, NvmeZone) exp_open_zones;
+    QTAILQ_HEAD(, NvmeZone) imp_open_zones;
+    QTAILQ_HEAD(, NvmeZone) closed_zones;
+    QTAILQ_HEAD(, NvmeZone) full_zones;
+    uint32_t        num_zones;
+    uint64_t        zone_size;
+    uint64_t        zone_capacity;
+    uint64_t        zone_array_size;
+    uint32_t        zone_size_log2;
+
     NvmeNamespaceParams params;
 } NvmeNamespace;
 
@@ -71,8 +94,39 @@ static inline size_t nvme_l2b(NvmeNamespace *ns, uint64_t lba)
 
 typedef struct NvmeCtrl NvmeCtrl;
 
+static inline uint8_t nvme_get_zone_state(NvmeZone *zone)
+{
+    return zone->d.zs >> 4;
+}
+
+static inline void nvme_set_zone_state(NvmeZone *zone, enum NvmeZoneState state)
+{
+    zone->d.zs = state << 4;
+}
+
+static inline uint64_t nvme_zone_rd_boundary(NvmeNamespace *ns, NvmeZone *zone)
+{
+    return zone->d.zslba + ns->zone_size;
+}
+
+static inline uint64_t nvme_zone_wr_boundary(NvmeZone *zone)
+{
+    return zone->d.zslba + zone->d.zcap;
+}
+
+static inline bool nvme_wp_is_valid(NvmeZone *zone)
+{
+    uint8_t st = nvme_get_zone_state(zone);
+
+    return st != NVME_ZONE_STATE_FULL &&
+           st != NVME_ZONE_STATE_READ_ONLY &&
+           st != NVME_ZONE_STATE_OFFLINE;
+}
+
 int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
 void nvme_ns_drain(NvmeNamespace *ns);
 void nvme_ns_flush(NvmeNamespace *ns);
+void nvme_ns_clear(NvmeNamespace *ns);
+void nvme_ns_cleanup(NvmeNamespace *ns);
 
 #endif /* NVME_NS_H */
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 93728e51b3..34d0d0250d 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -133,6 +133,16 @@ static const uint32_t nvme_cse_iocs_nvm[256] = {
     [NVME_CMD_READ]                 = NVME_CMD_EFF_CSUPP,
 };
 
+static const uint32_t nvme_cse_iocs_zoned[256] = {
+    [NVME_CMD_FLUSH]                = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
+    [NVME_CMD_WRITE_ZEROES]         = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
+    [NVME_CMD_WRITE]                = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
+    [NVME_CMD_READ]                 = NVME_CMD_EFF_CSUPP,
+    [NVME_CMD_ZONE_APPEND]          = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
+    [NVME_CMD_ZONE_MGMT_SEND]       = NVME_CMD_EFF_CSUPP,
+    [NVME_CMD_ZONE_MGMT_RECV]       = NVME_CMD_EFF_CSUPP,
+};
+
 static void nvme_process_sq(void *opaque);
 
 static uint16_t nvme_cid(NvmeRequest *req)
@@ -149,6 +159,46 @@ static uint16_t nvme_sqid(NvmeRequest *req)
     return le16_to_cpu(req->sq->sqid);
 }
 
+static void nvme_assign_zone_state(NvmeNamespace *ns, NvmeZone *zone,
+                                   uint8_t state)
+{
+    if (QTAILQ_IN_USE(zone, entry)) {
+        switch (nvme_get_zone_state(zone)) {
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+            QTAILQ_REMOVE(&ns->exp_open_zones, zone, entry);
+            break;
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+            QTAILQ_REMOVE(&ns->imp_open_zones, zone, entry);
+            break;
+        case NVME_ZONE_STATE_CLOSED:
+            QTAILQ_REMOVE(&ns->closed_zones, zone, entry);
+            break;
+        case NVME_ZONE_STATE_FULL:
+            QTAILQ_REMOVE(&ns->full_zones, zone, entry);
+        }
+    }
+
+    nvme_set_zone_state(zone, state);
+
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+        QTAILQ_INSERT_TAIL(&ns->exp_open_zones, zone, entry);
+        break;
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        QTAILQ_INSERT_TAIL(&ns->imp_open_zones, zone, entry);
+        break;
+    case NVME_ZONE_STATE_CLOSED:
+        QTAILQ_INSERT_TAIL(&ns->closed_zones, zone, entry);
+        break;
+    case NVME_ZONE_STATE_FULL:
+        QTAILQ_INSERT_TAIL(&ns->full_zones, zone, entry);
+    case NVME_ZONE_STATE_READ_ONLY:
+        break;
+    default:
+        zone->d.za = 0;
+    }
+}
+
 static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
 {
     hwaddr low = n->ctrl_mem.addr;
@@ -841,7 +891,7 @@ static void nvme_process_aers(void *opaque)
 
         req = n->aer_reqs[n->outstanding_aers];
 
-        result = (NvmeAerResult *) &req->cqe.result;
+        result = (NvmeAerResult *) &req->cqe.result32;
         result->event_type = event->result.event_type;
         result->event_info = event->result.event_info;
         result->log_page = event->result.log_page;
@@ -910,6 +960,326 @@ static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace *ns,
     return NVME_SUCCESS;
 }
 
+static void nvme_fill_read_data(NvmeRequest *req, uint64_t offset,
+                                uint32_t max_len, uint8_t pattern)
+{
+    QEMUSGList *qsg = &req->qsg;
+    QEMUIOVector *iov = &req->iov;
+    ScatterGatherEntry *entry;
+    uint32_t len, ent_len;
+
+    if (qsg->nsg > 0) {
+        entry = qsg->sg;
+        len = qsg->size;
+        if (max_len) {
+            len = MIN(len, max_len);
+        }
+        for (; len > 0; len -= ent_len) {
+            ent_len = MIN(len, entry->len);
+            if (offset > ent_len) {
+                offset -= ent_len;
+            } else if (offset != 0) {
+                dma_memory_set(qsg->as, entry->base + offset,
+                               pattern, ent_len - offset);
+                offset = 0;
+            } else {
+                dma_memory_set(qsg->as, entry->base, pattern, ent_len);
+            }
+            entry++;
+        }
+    } else if (iov->iov) {
+        len = iov_size(iov->iov, iov->niov);
+        if (max_len) {
+            len = MIN(len, max_len);
+        }
+        qemu_iovec_memset(iov, offset, pattern, len - offset);
+    }
+}
+
+static inline uint32_t nvme_zone_idx(NvmeNamespace *ns, uint64_t slba)
+{
+    return ns->zone_size_log2 > 0 ? slba >> ns->zone_size_log2 :
+                                    slba / ns->zone_size;
+}
+
+static inline NvmeZone *nvme_get_zone_by_slba(NvmeNamespace *ns, uint64_t slba)
+{
+    uint32_t zone_idx = nvme_zone_idx(ns, slba);
+
+    assert(zone_idx < ns->num_zones);
+    return &ns->zone_array[zone_idx];
+}
+
+static uint16_t nvme_zone_state_ok_to_write(NvmeZone *zone)
+{
+    uint16_t status;
+
+    switch (nvme_get_zone_state(zone)) {
+    case NVME_ZONE_STATE_EMPTY:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_CLOSED:
+        status = NVME_SUCCESS;
+        break;
+    case NVME_ZONE_STATE_FULL:
+        status = NVME_ZONE_FULL;
+        break;
+    case NVME_ZONE_STATE_OFFLINE:
+        status = NVME_ZONE_OFFLINE;
+        break;
+    case NVME_ZONE_STATE_READ_ONLY:
+        status = NVME_ZONE_READ_ONLY;
+        break;
+    default:
+        assert(false);
+    }
+
+    return status;
+}
+
+static uint16_t nvme_check_zone_write(NvmeCtrl *n, NvmeNamespace *ns,
+                                      NvmeZone *zone, uint64_t slba,
+                                      uint32_t nlb, bool append)
+{
+    uint16_t status;
+
+    if (unlikely((slba + nlb) > nvme_zone_wr_boundary(zone))) {
+        status = NVME_ZONE_BOUNDARY_ERROR;
+    } else {
+        status = nvme_zone_state_ok_to_write(zone);
+    }
+
+    if (status != NVME_SUCCESS) {
+        trace_pci_nvme_err_zone_write_not_ok(slba, nlb, status);
+    } else {
+        assert(nvme_wp_is_valid(zone));
+        if (append) {
+            if (unlikely(slba != zone->d.zslba)) {
+                trace_pci_nvme_err_append_not_at_start(slba, zone->d.zslba);
+                status = NVME_ZONE_INVALID_WRITE;
+            }
+            if (nvme_l2b(ns, nlb) > (n->page_size << n->zasl)) {
+                trace_pci_nvme_err_append_too_large(slba, nlb, n->zasl);
+                status = NVME_INVALID_FIELD;
+            }
+        } else if (unlikely(slba != zone->w_ptr)) {
+            trace_pci_nvme_err_write_not_at_wp(slba, zone->d.zslba,
+                                               zone->w_ptr);
+            status = NVME_ZONE_INVALID_WRITE;
+        }
+    }
+
+    return status;
+}
+
+static uint16_t nvme_zone_state_ok_to_read(NvmeZone *zone)
+{
+    uint16_t status;
+
+    switch (nvme_get_zone_state(zone)) {
+    case NVME_ZONE_STATE_EMPTY:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_FULL:
+    case NVME_ZONE_STATE_CLOSED:
+    case NVME_ZONE_STATE_READ_ONLY:
+        status = NVME_SUCCESS;
+        break;
+    case NVME_ZONE_STATE_OFFLINE:
+        status = NVME_ZONE_OFFLINE | NVME_DNR;
+        break;
+    default:
+        assert(false);
+    }
+
+    return status;
+}
+
+typedef struct NvmeReadFillCtx {
+    uint64_t  pre_rd_fill_slba;
+    uint64_t  read_slba;
+    uint64_t  post_rd_fill_slba;
+
+    uint32_t  pre_rd_fill_nlb;
+    uint32_t  read_nlb;
+    uint32_t  post_rd_fill_nlb;
+} NvmeReadFillCtx;
+
+static uint16_t nvme_check_zone_read(NvmeNamespace *ns, NvmeZone *zone,
+                                     uint64_t slba, uint32_t nlb,
+                                     NvmeReadFillCtx *rfc)
+{
+    NvmeZone *next_zone;
+    uint64_t bndry = nvme_zone_rd_boundary(ns, zone);
+    uint64_t end = slba + nlb, wp1, wp2;
+    uint16_t status;
+
+    rfc->pre_rd_fill_slba = ~0ULL;
+    rfc->pre_rd_fill_nlb = 0;
+    rfc->read_slba = slba;
+    rfc->read_nlb = nlb;
+    rfc->post_rd_fill_slba = ~0ULL;
+    rfc->post_rd_fill_nlb = 0;
+
+    status = nvme_zone_state_ok_to_read(zone);
+    if (status != NVME_SUCCESS) {
+        ;
+    } else if (likely(end <= bndry)) {
+        if (end > zone->w_ptr) {
+            wp1 = zone->w_ptr;
+            if (slba >= wp1) {
+                /* No i/o necessary, just fill */
+                rfc->pre_rd_fill_slba = slba;
+                rfc->pre_rd_fill_nlb = nlb;
+                rfc->read_nlb = 0;
+            } else {
+                rfc->read_nlb = wp1 - slba;
+                rfc->post_rd_fill_slba = wp1;
+                rfc->post_rd_fill_nlb = nlb - rfc->read_nlb;
+           }
+        }
+    } else if (!ns->params.cross_zone_read) {
+        status = NVME_ZONE_BOUNDARY_ERROR;
+    } else {
+        /*
+         * Read across zone boundary, look at the next zone.
+         * Earlier bounds checks ensure that the current zone
+         * is not the last one.
+         */
+        next_zone = zone + 1;
+        status = nvme_zone_state_ok_to_read(next_zone);
+        if (status != NVME_SUCCESS) {
+            ;
+        } else if (end > nvme_zone_rd_boundary(ns, next_zone)) {
+            /*
+             * As zone size is much larger than a typical maximum
+             * i/o size in real hardware, allow the i/o range
+             * to span no more than one pair of zones.
+             */
+            status = NVME_ZONE_BOUNDARY_ERROR;
+        } else {
+            wp1 = zone->w_ptr;
+            wp2 = next_zone->w_ptr;
+            if (wp2 == bndry) {
+                if (slba >= wp1) {
+                    /* Again, no i/o necessary, just fill */
+                    rfc->pre_rd_fill_slba = slba;
+                    rfc->pre_rd_fill_nlb = nlb;
+                    rfc->read_nlb = 0;
+                } else {
+                    rfc->read_nlb = wp1 - slba;
+                    rfc->post_rd_fill_slba = wp1;
+                    rfc->post_rd_fill_nlb = nlb - rfc->read_nlb;
+                }
+            } else if (slba < wp1) {
+                if (end > wp2) {
+                    if (wp1 == bndry) {
+                        rfc->post_rd_fill_slba = wp2;
+                        rfc->post_rd_fill_nlb = end - wp2;
+                        rfc->read_nlb = wp2 - slba;
+                    } else {
+                        rfc->pre_rd_fill_slba = wp2;
+                        rfc->pre_rd_fill_nlb = end - wp2;
+                        rfc->read_nlb = wp2 - slba;
+                        rfc->post_rd_fill_slba = wp1;
+                        rfc->post_rd_fill_nlb = bndry - wp1;
+                    }
+                } else {
+                    rfc->post_rd_fill_slba = wp1;
+                    rfc->post_rd_fill_nlb = bndry - wp1;
+                }
+            } else {
+                if (end > wp2) {
+                    rfc->pre_rd_fill_slba = slba;
+                    rfc->pre_rd_fill_nlb = end - slba;
+                    rfc->read_slba = bndry;
+                    rfc->read_nlb = wp2 - bndry;
+                } else {
+                    rfc->read_slba = bndry;
+                    rfc->read_nlb = end - bndry;
+                    rfc->post_rd_fill_slba = slba;
+                    rfc->post_rd_fill_nlb = bndry - slba;
+                }
+            }
+        }
+    }
+
+    return status;
+}
+
+static bool nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
+                                      bool failed)
+{
+    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
+    NvmeZone *zone;
+    uint64_t slba, start_wp = req->cqe.result64;
+    uint32_t nlb;
+
+    if (rw->opcode != NVME_CMD_WRITE &&
+        rw->opcode != NVME_CMD_ZONE_APPEND &&
+        rw->opcode != NVME_CMD_WRITE_ZEROES) {
+        return false;
+    }
+
+    slba = le64_to_cpu(rw->slba);
+    nlb = le16_to_cpu(rw->nlb) + 1;
+    zone = nvme_get_zone_by_slba(ns, slba);
+
+    if (!failed && zone->w_ptr < start_wp + nlb) {
+        /*
+         * A preceding queued write to the zone has failed,
+         * now this write is not at the WP, fail it too.
+         */
+        failed = true;
+    }
+
+    if (failed) {
+        if (zone->w_ptr > start_wp) {
+            zone->w_ptr = start_wp;
+            zone->d.wp = start_wp;
+        }
+        req->cqe.result64 = 0;
+    } else if (zone->w_ptr == nvme_zone_wr_boundary(zone)) {
+        switch (nvme_get_zone_state(zone)) {
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+        case NVME_ZONE_STATE_CLOSED:
+        case NVME_ZONE_STATE_EMPTY:
+            nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_FULL);
+            /* fall through */
+        case NVME_ZONE_STATE_FULL:
+            break;
+        default:
+            assert(false);
+        }
+        zone->d.wp = zone->w_ptr;
+    } else {
+        zone->d.wp += nlb;
+    }
+
+    return failed;
+}
+
+static uint64_t nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone,
+                                     uint32_t nlb)
+{
+    uint64_t result = zone->w_ptr;
+    uint8_t zs;
+
+    zone->w_ptr += nlb;
+
+    if (zone->w_ptr < nvme_zone_wr_boundary(zone)) {
+        zs = nvme_get_zone_state(zone);
+        switch (zs) {
+        case NVME_ZONE_STATE_EMPTY:
+        case NVME_ZONE_STATE_CLOSED:
+            nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_IMPLICITLY_OPEN);
+        }
+    }
+
+    return result;
+}
+
 static void nvme_rw_cb(void *opaque, int ret)
 {
     NvmeRequest *req = opaque;
@@ -924,10 +1294,27 @@ static void nvme_rw_cb(void *opaque, int ret)
     trace_pci_nvme_rw_cb(nvme_cid(req), blk_name(blk));
 
     if (!ret) {
-        block_acct_done(stats, acct);
+        if (ns->params.zoned) {
+            if (nvme_finalize_zoned_write(ns, req, false)) {
+                ret = EIO;
+                block_acct_failed(stats, acct);
+                req->status = NVME_ZONE_INVALID_WRITE;
+            } else if (req->fill_len) {
+                nvme_fill_read_data(req, req->fill_ofs, req->fill_len,
+                                    nvme_ctrl(req)->params.fill_pattern);
+                req->fill_len = 0;
+            }
+        }
+        if (!ret) {
+            block_acct_done(stats, acct);
+        }
     } else {
         uint16_t status;
 
+        if (ns->params.zoned) {
+            nvme_finalize_zoned_write(ns, req, true);
+        }
+
         block_acct_failed(stats, acct);
 
         switch (req->cmd.opcode) {
@@ -969,8 +1356,10 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
     NvmeNamespace *ns = req->ns;
     uint64_t slba = le64_to_cpu(rw->slba);
     uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
+    NvmeZone *zone;
     uint64_t offset = nvme_l2b(ns, slba);
     uint32_t count = nvme_l2b(ns, nlb);
+    BlockBackend *blk = ns->blkconf.blk;
     uint16_t status;
 
     trace_pci_nvme_write_zeroes(nvme_cid(req), nvme_nsid(ns), slba, nlb);
@@ -981,24 +1370,41 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
         return status;
     }
 
-    block_acct_start(blk_get_stats(req->ns->blkconf.blk), &req->acct, 0,
-                     BLOCK_ACCT_WRITE);
-    req->aiocb = blk_aio_pwrite_zeroes(req->ns->blkconf.blk, offset, count,
+    if (ns->params.zoned) {
+        zone = nvme_get_zone_by_slba(ns, slba);
+
+        status = nvme_check_zone_write(n, ns, zone, slba, nlb, false);
+        if (status != NVME_SUCCESS) {
+            goto invalid;
+        }
+
+        req->cqe.result64 = nvme_advance_zone_wp(ns, zone, nlb);
+    }
+
+    block_acct_start(blk_get_stats(blk), &req->acct, 0, BLOCK_ACCT_WRITE);
+    req->aiocb = blk_aio_pwrite_zeroes(blk, offset, count,
                                        BDRV_REQ_MAY_UNMAP, nvme_rw_cb, req);
     return NVME_NO_COMPLETE;
+
+invalid:
+    block_acct_invalid(blk_get_stats(blk), BLOCK_ACCT_WRITE);
+    return status | NVME_DNR;
 }
 
-static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req, bool append)
 {
     NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
     NvmeNamespace *ns = req->ns;
     uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
     uint64_t slba = le64_to_cpu(rw->slba);
-
     uint64_t data_size = nvme_l2b(ns, nlb);
-    uint64_t data_offset = nvme_l2b(ns, slba);
-    enum BlockAcctType acct = req->cmd.opcode == NVME_CMD_WRITE ?
-        BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
+    uint64_t data_offset, fill_ofs;
+
+    NvmeZone *zone;
+    uint32_t fill_len;
+    NvmeReadFillCtx rfc;
+    bool is_write = rw->opcode == NVME_CMD_WRITE || append;
+    enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
     BlockBackend *blk = ns->blkconf.blk;
     uint16_t status;
 
@@ -1017,14 +1423,71 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
         goto invalid;
     }
 
+    if (ns->params.zoned) {
+        zone = nvme_get_zone_by_slba(ns, slba);
+
+        if (is_write) {
+            status = nvme_check_zone_write(n, ns, zone, slba, nlb, append);
+            if (status != NVME_SUCCESS) {
+                goto invalid;
+            }
+
+            if (append) {
+                slba = zone->w_ptr;
+            }
+
+            req->cqe.result64 = nvme_advance_zone_wp(ns, zone, nlb);
+        } else {
+            status = nvme_check_zone_read(ns, zone, slba, nlb, &rfc);
+            if (status != NVME_SUCCESS) {
+                trace_pci_nvme_err_zone_read_not_ok(slba, nlb, status);
+                goto invalid;
+            }
+        }
+    } else if (append) {
+        trace_pci_nvme_err_invalid_opc(rw->opcode);
+        status = NVME_INVALID_OPCODE;
+        goto invalid;
+    }
+
+    data_offset = nvme_l2b(ns, slba);
+
     status = nvme_map_dptr(n, data_size, req);
     if (status) {
         goto invalid;
     }
 
+    if (ns->params.zoned) {
+        if (is_write) {
+            req->cqe.result64 = nvme_advance_zone_wp(ns, zone, nlb);
+        } else {
+            if (rfc.pre_rd_fill_nlb) {
+                fill_ofs = nvme_l2b(ns, rfc.pre_rd_fill_slba - slba);
+                fill_len = nvme_l2b(ns, rfc.pre_rd_fill_nlb);
+                nvme_fill_read_data(req, fill_ofs, fill_len,
+                                    n->params.fill_pattern);
+            }
+            if (!rfc.read_nlb) {
+                /* No backend I/O necessary, only needed to fill the buffer */
+                req->status = NVME_SUCCESS;
+                return NVME_SUCCESS;
+            }
+            if (rfc.post_rd_fill_nlb) {
+                req->fill_ofs = nvme_l2b(ns, rfc.post_rd_fill_slba - slba);
+                req->fill_len = nvme_l2b(ns, rfc.post_rd_fill_nlb);
+            } else {
+                req->fill_len = 0;
+            }
+            slba = rfc.read_slba;
+            data_size = nvme_l2b(ns, rfc.read_nlb);
+        }
+    }
+
+    data_offset = nvme_l2b(ns, slba);
+
     block_acct_start(blk_get_stats(blk), &req->acct, data_size, acct);
     if (req->qsg.sg) {
-        if (acct == BLOCK_ACCT_WRITE) {
+        if (is_write) {
             req->aiocb = dma_blk_write(blk, &req->qsg, data_offset,
                                        BDRV_SECTOR_SIZE, nvme_rw_cb, req);
         } else {
@@ -1032,7 +1495,7 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
                                       BDRV_SECTOR_SIZE, nvme_rw_cb, req);
         }
     } else {
-        if (acct == BLOCK_ACCT_WRITE) {
+        if (is_write) {
             req->aiocb = blk_aio_pwritev(blk, data_offset, &req->iov, 0,
                                          nvme_rw_cb, req);
         } else {
@@ -1043,10 +1506,383 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
     return NVME_NO_COMPLETE;
 
 invalid:
-    block_acct_invalid(blk_get_stats(ns->blkconf.blk), acct);
+    block_acct_invalid(blk_get_stats(blk), acct);
+    return status | NVME_DNR;
+}
+
+static uint16_t nvme_get_mgmt_zone_slba_idx(NvmeNamespace *ns, NvmeCmd *c,
+                                            uint64_t *slba, uint32_t *zone_idx)
+{
+    uint32_t dw10 = le32_to_cpu(c->cdw10);
+    uint32_t dw11 = le32_to_cpu(c->cdw11);
+
+    if (!ns->params.zoned) {
+        trace_pci_nvme_err_invalid_opc(c->opcode);
+        return NVME_INVALID_OPCODE | NVME_DNR;
+    }
+
+    *slba = ((uint64_t)dw11) << 32 | dw10;
+    if (unlikely(*slba >= ns->id_ns.nsze)) {
+        trace_pci_nvme_err_invalid_lba_range(*slba, 0, ns->id_ns.nsze);
+        *slba = 0;
+        return NVME_LBA_RANGE | NVME_DNR;
+    }
+
+    *zone_idx = nvme_zone_idx(ns, *slba);
+    assert(*zone_idx < ns->num_zones);
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_open_zone(NvmeNamespace *ns, NvmeZone *zone,
+                               uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EMPTY:
+    case NVME_ZONE_STATE_CLOSED:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_EXPLICITLY_OPEN);
+        /* fall through */
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_open_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_CLOSED;
+}
+
+static uint16_t nvme_close_zone(NvmeNamespace *ns, NvmeZone *zone,
+                                uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_CLOSED);
+        /* fall through */
+    case NVME_ZONE_STATE_CLOSED:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_close_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_EXPLICITLY_OPEN;
+}
+
+static uint16_t nvme_finish_zone(NvmeNamespace *ns, NvmeZone *zone,
+                                 uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+    case NVME_ZONE_STATE_CLOSED:
+    case NVME_ZONE_STATE_EMPTY:
+        zone->w_ptr = nvme_zone_wr_boundary(zone);
+        zone->d.wp = zone->w_ptr;
+        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_FULL);
+        /* fall through */
+    case NVME_ZONE_STATE_FULL:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_finish_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_EXPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_CLOSED;
+}
+
+static uint16_t nvme_reset_zone(NvmeNamespace *ns, NvmeZone *zone,
+                                uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+    case NVME_ZONE_STATE_CLOSED:
+    case NVME_ZONE_STATE_FULL:
+        zone->w_ptr = zone->d.zslba;
+        zone->d.wp = zone->w_ptr;
+        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_EMPTY);
+        /* fall through */
+    case NVME_ZONE_STATE_EMPTY:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_reset_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_EXPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_CLOSED ||
+           state == NVME_ZONE_STATE_FULL;
+}
+
+static uint16_t nvme_offline_zone(NvmeNamespace *ns, NvmeZone *zone,
+                                  uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_READ_ONLY:
+        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_OFFLINE);
+        /* fall through */
+    case NVME_ZONE_STATE_OFFLINE:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_offline_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_READ_ONLY;
+}
+
+typedef uint16_t (*op_handler_t)(NvmeNamespace *, NvmeZone *,
+                                 uint8_t);
+typedef bool (*need_to_proc_zone_t)(uint8_t);
+
+static uint16_t name_do_zone_op(NvmeNamespace *ns, NvmeZone *zone,
+                                uint8_t state, bool all,
+                                op_handler_t op_hndlr,
+                                need_to_proc_zone_t proc_zone)
+{
+    int i;
+    uint16_t status = 0;
+
+    if (!all) {
+        status = op_hndlr(ns, zone, state);
+    } else {
+        for (i = 0; i < ns->num_zones; i++, zone++) {
+            state = nvme_get_zone_state(zone);
+            if (proc_zone(state)) {
+                status = op_hndlr(ns, zone, state);
+                if (status != NVME_SUCCESS) {
+                    break;
+                }
+            }
+        }
+    }
+
     return status;
 }
 
+static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeCmd *cmd = (NvmeCmd *)&req->cmd;
+    NvmeNamespace *ns = req->ns;
+    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
+    uint64_t slba = 0;
+    uint32_t zone_idx = 0;
+    uint16_t status;
+    uint8_t action, state;
+    bool all;
+    NvmeZone *zone;
+
+    action = dw13 & 0xff;
+    all = dw13 & 0x100;
+
+    req->status = NVME_SUCCESS;
+
+    if (!all) {
+        status = nvme_get_mgmt_zone_slba_idx(ns, cmd, &slba, &zone_idx);
+        if (status) {
+            return status;
+        }
+    }
+
+    zone = &ns->zone_array[zone_idx];
+    if (slba != zone->d.zslba) {
+        trace_pci_nvme_err_unaligned_zone_cmd(action, slba, zone->d.zslba);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+    state = nvme_get_zone_state(zone);
+
+    switch (action) {
+
+    case NVME_ZONE_ACTION_OPEN:
+        trace_pci_nvme_open_zone(slba, zone_idx, all);
+        status = name_do_zone_op(ns, zone, state, all,
+                                 nvme_open_zone, nvme_cond_open_all);
+        break;
+
+    case NVME_ZONE_ACTION_CLOSE:
+        trace_pci_nvme_close_zone(slba, zone_idx, all);
+        status = name_do_zone_op(ns, zone, state, all,
+                                 nvme_close_zone, nvme_cond_close_all);
+        break;
+
+    case NVME_ZONE_ACTION_FINISH:
+        trace_pci_nvme_finish_zone(slba, zone_idx, all);
+        status = name_do_zone_op(ns, zone, state, all,
+                                 nvme_finish_zone, nvme_cond_finish_all);
+        break;
+
+    case NVME_ZONE_ACTION_RESET:
+        trace_pci_nvme_reset_zone(slba, zone_idx, all);
+        status = name_do_zone_op(ns, zone, state, all,
+                                 nvme_reset_zone, nvme_cond_reset_all);
+        break;
+
+    case NVME_ZONE_ACTION_OFFLINE:
+        trace_pci_nvme_offline_zone(slba, zone_idx, all);
+        status = name_do_zone_op(ns, zone, state, all,
+                                 nvme_offline_zone, nvme_cond_offline_all);
+        break;
+
+    case NVME_ZONE_ACTION_SET_ZD_EXT:
+        trace_pci_nvme_set_descriptor_extension(slba, zone_idx);
+        return NVME_INVALID_FIELD | NVME_DNR;
+        break;
+
+    default:
+        trace_pci_nvme_err_invalid_mgmt_action(action);
+        status = NVME_INVALID_FIELD;
+    }
+
+    if (status == NVME_ZONE_INVAL_TRANSITION) {
+        trace_pci_nvme_err_invalid_zone_state_transition(state, action, slba,
+                                                         zone->d.za);
+    }
+    if (status) {
+        status |= NVME_DNR;
+    }
+
+    return status;
+}
+
+static bool nvme_zone_matches_filter(uint32_t zafs, NvmeZone *zl)
+{
+    int zs = nvme_get_zone_state(zl);
+
+    switch (zafs) {
+    case NVME_ZONE_REPORT_ALL:
+        return true;
+    case NVME_ZONE_REPORT_EMPTY:
+        return zs == NVME_ZONE_STATE_EMPTY;
+    case NVME_ZONE_REPORT_IMPLICITLY_OPEN:
+        return zs == NVME_ZONE_STATE_IMPLICITLY_OPEN;
+    case NVME_ZONE_REPORT_EXPLICITLY_OPEN:
+        return zs == NVME_ZONE_STATE_EXPLICITLY_OPEN;
+    case NVME_ZONE_REPORT_CLOSED:
+        return zs == NVME_ZONE_STATE_CLOSED;
+    case NVME_ZONE_REPORT_FULL:
+        return zs == NVME_ZONE_STATE_FULL;
+    case NVME_ZONE_REPORT_READ_ONLY:
+        return zs == NVME_ZONE_STATE_READ_ONLY;
+    case NVME_ZONE_REPORT_OFFLINE:
+        return zs == NVME_ZONE_STATE_OFFLINE;
+    default:
+        return false;
+    }
+}
+
+static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeCmd *cmd = (NvmeCmd *)&req->cmd;
+    NvmeNamespace *ns = req->ns;
+    /* cdw12 is zero-based number of dwords to return. Convert to bytes */
+    uint32_t len = (le32_to_cpu(cmd->cdw12) + 1) << 2;
+    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
+    uint32_t zone_idx, zra, zrasf, partial;
+    uint64_t max_zones, nr_zones = 0;
+    uint16_t ret;
+    uint64_t slba;
+    NvmeZoneDescr *z;
+    NvmeZone *zs;
+    NvmeZoneReportHeader *header;
+    void *buf, *buf_p;
+    size_t zone_entry_sz;
+
+    req->status = NVME_SUCCESS;
+
+    ret = nvme_get_mgmt_zone_slba_idx(ns, cmd, &slba, &zone_idx);
+    if (ret) {
+        return ret;
+    }
+
+    if (len < sizeof(NvmeZoneReportHeader)) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    zra = dw13 & 0xff;
+    if (!(zra == NVME_ZONE_REPORT || zra == NVME_ZONE_REPORT_EXTENDED)) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    if (zra == NVME_ZONE_REPORT_EXTENDED) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    zrasf = (dw13 >> 8) & 0xff;
+    if (zrasf > NVME_ZONE_REPORT_OFFLINE) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    partial = (dw13 >> 16) & 0x01;
+
+    zone_entry_sz = sizeof(NvmeZoneDescr);
+
+    max_zones = (len - sizeof(NvmeZoneReportHeader)) / zone_entry_sz;
+    buf = g_malloc0(len);
+
+    header = (NvmeZoneReportHeader *)buf;
+    buf_p = buf + sizeof(NvmeZoneReportHeader);
+
+    while (zone_idx < ns->num_zones && nr_zones < max_zones) {
+        zs = &ns->zone_array[zone_idx];
+
+        if (!nvme_zone_matches_filter(zrasf, zs)) {
+            zone_idx++;
+            continue;
+        }
+
+        z = (NvmeZoneDescr *)buf_p;
+        buf_p += sizeof(NvmeZoneDescr);
+        nr_zones++;
+
+        z->zt = zs->d.zt;
+        z->zs = zs->d.zs;
+        z->zcap = cpu_to_le64(zs->d.zcap);
+        z->zslba = cpu_to_le64(zs->d.zslba);
+        z->za = zs->d.za;
+
+        if (nvme_wp_is_valid(zs)) {
+            z->wp = cpu_to_le64(zs->d.wp);
+        } else {
+            z->wp = cpu_to_le64(~0ULL);
+        }
+
+        zone_idx++;
+    }
+
+    if (!partial) {
+        for (; zone_idx < ns->num_zones; zone_idx++) {
+            zs = &ns->zone_array[zone_idx];
+            if (nvme_zone_matches_filter(zrasf, zs)) {
+                nr_zones++;
+            }
+        }
+    }
+    header->nr_zones = cpu_to_le64(nr_zones);
+
+    ret = nvme_dma(n, (uint8_t *)buf, len, DMA_DIRECTION_FROM_DEVICE, req);
+
+    g_free(buf);
+
+    return ret;
+}
+
 static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
     uint32_t nsid = le32_to_cpu(req->cmd.nsid);
@@ -1076,9 +1912,15 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
         return nvme_flush(n, req);
     case NVME_CMD_WRITE_ZEROES:
         return nvme_write_zeroes(n, req);
+    case NVME_CMD_ZONE_APPEND:
+        return nvme_rw(n, req, true);
     case NVME_CMD_WRITE:
     case NVME_CMD_READ:
-        return nvme_rw(n, req);
+        return nvme_rw(n, req, false);
+    case NVME_CMD_ZONE_MGMT_SEND:
+        return nvme_zone_mgmt_send(n, req);
+    case NVME_CMD_ZONE_MGMT_RECV:
+        return nvme_zone_mgmt_recv(n, req);
     default:
         assert(false);
     }
@@ -1320,7 +2162,7 @@ static uint16_t nvme_error_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
                     DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
+static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint8_t csi, uint32_t buf_len,
                                  uint64_t off, NvmeRequest *req)
 {
     NvmeEffectsLog log = {};
@@ -1339,6 +2181,15 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
         src_iocs = nvme_cse_iocs_nvm;
     case NVME_CC_CSS_ADMIN_ONLY:
         break;
+    case NVME_CC_CSS_CSI:
+        switch (csi) {
+        case NVME_CSI_NVM:
+            src_iocs = nvme_cse_iocs_nvm;
+            break;
+        case NVME_CSI_ZONED:
+            src_iocs = nvme_cse_iocs_zoned;
+            break;
+        }
     }
 
     memcpy(log.acs, nvme_cse_acs, sizeof(nvme_cse_acs));
@@ -1364,6 +2215,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
     uint8_t  lid = dw10 & 0xff;
     uint8_t  lsp = (dw10 >> 8) & 0xf;
     uint8_t  rae = (dw10 >> 15) & 0x1;
+    uint8_t csi = le32_to_cpu(cmd->cdw14) >> 24;
     uint32_t numdl, numdu;
     uint64_t off, lpol, lpou;
     size_t   len;
@@ -1397,7 +2249,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
     case NVME_LOG_FW_SLOT_INFO:
         return nvme_fw_log_info(n, len, off, req);
     case NVME_LOG_CMD_EFFECTS:
-        return nvme_cmd_effects(n, len, off, req);
+        return nvme_cmd_effects(n, csi, len, off, req);
     default:
         trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1517,6 +2369,16 @@ static uint16_t nvme_rpt_empty_id_struct(NvmeCtrl *n, NvmeRequest *req)
     return nvme_dma(n, id, sizeof(id), DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static inline bool nvme_csi_has_nvm_support(NvmeNamespace *ns)
+{
+    switch (ns->csi) {
+    case NVME_CSI_NVM:
+    case NVME_CSI_ZONED:
+        return true;
+    }
+    return false;
+}
+
 static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
 {
     trace_pci_nvme_identify_ctrl();
@@ -1528,11 +2390,16 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
 static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+    NvmeIdCtrlZoned id = {};
 
     trace_pci_nvme_identify_ctrl_csi(c->csi);
 
     if (c->csi == NVME_CSI_NVM) {
         return nvme_rpt_empty_id_struct(n, req);
+    } else if (c->csi == NVME_CSI_ZONED) {
+        id.zasl = n->zasl;
+        return nvme_dma(n, (uint8_t *)&id, sizeof(id),
+                        DMA_DIRECTION_FROM_DEVICE, req);
     }
 
     return NVME_INVALID_FIELD | NVME_DNR;
@@ -1560,8 +2427,12 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req,
         return nvme_rpt_empty_id_struct(n, req);
     }
 
-    return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
-                    DMA_DIRECTION_FROM_DEVICE, req);
+    if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
+        return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
+                        DMA_DIRECTION_FROM_DEVICE, req);
+    }
+
+    return NVME_INVALID_CMD_SET | NVME_DNR;
 }
 
 static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
@@ -1586,8 +2457,11 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
         return nvme_rpt_empty_id_struct(n, req);
     }
 
-    if (c->csi == NVME_CSI_NVM) {
+    if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
         return nvme_rpt_empty_id_struct(n, req);
+    } else if (c->csi == NVME_CSI_ZONED && ns->csi == NVME_CSI_ZONED) {
+        return nvme_dma(n, (uint8_t *)ns->id_ns_zoned, sizeof(NvmeIdNsZoned),
+                        DMA_DIRECTION_FROM_DEVICE, req);
     }
 
     return NVME_INVALID_FIELD | NVME_DNR;
@@ -1649,7 +2523,7 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req,
 
     trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
 
-    if (c->csi != NVME_CSI_NVM) {
+    if (c->csi != NVME_CSI_NVM && c->csi != NVME_CSI_ZONED) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
@@ -1658,7 +2532,7 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req,
         if (!ns) {
             continue;
         }
-        if (ns->params.nsid < min_nsid) {
+        if (ns->params.nsid < min_nsid || c->csi != ns->csi) {
             continue;
         }
         if (only_active && !ns->params.attached) {
@@ -1728,6 +2602,8 @@ static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeRequest *req)
     trace_pci_nvme_identify_cmd_set();
 
     NVME_SET_CSI(*list, NVME_CSI_NVM);
+    NVME_SET_CSI(*list, NVME_CSI_ZONED);
+
     return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
@@ -1770,7 +2646,7 @@ static uint16_t nvme_abort(NvmeCtrl *n, NvmeRequest *req)
 {
     uint16_t sqid = le32_to_cpu(req->cmd.cdw10) & 0xffff;
 
-    req->cqe.result = 1;
+    req->cqe.result32 = 1;
     if (nvme_check_sqid(n, sqid)) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
@@ -1955,7 +2831,7 @@ defaults:
     }
 
 out:
-    req->cqe.result = cpu_to_le32(result);
+    req->cqe.result32 = cpu_to_le32(result);
     return NVME_SUCCESS;
 }
 
@@ -2086,8 +2962,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
                                     ((dw11 >> 16) & 0xFFFF) + 1,
                                     n->params.max_ioqpairs,
                                     n->params.max_ioqpairs);
-        req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
-                                      ((n->params.max_ioqpairs - 1) << 16));
+        req->cqe.result32 = cpu_to_le32((n->params.max_ioqpairs - 1) |
+                                        ((n->params.max_ioqpairs - 1) << 16));
         break;
     case NVME_ASYNCHRONOUS_EVENT_CONF:
         n->features.async_config = dw11;
@@ -2242,6 +3118,15 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
         nvme_ns_flush(ns);
     }
 
+    for (i = 1; i <= n->num_namespaces; i++) {
+        ns = nvme_ns(n, i);
+        if (!ns) {
+            continue;
+        }
+
+        nvme_ns_clear(ns);
+    }
+
     n->bar.cc = 0;
 }
 
@@ -2262,6 +3147,13 @@ static void nvme_select_ns_iocs(NvmeCtrl *n)
                 ns->iocs = nvme_cse_iocs_nvm;
             }
             break;
+        case NVME_CSI_ZONED:
+            if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_CSI) {
+                ns->iocs = nvme_cse_iocs_zoned;
+            } else if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_NVM) {
+                ns->iocs = nvme_cse_iocs_nvm;
+            }
+            break;
         }
     }
 }
@@ -2360,6 +3252,17 @@ static int nvme_start_ctrl(NvmeCtrl *n)
     nvme_init_sq(&n->admin_sq, n, n->bar.asq, 0, 0,
                  NVME_AQA_ASQS(n->bar.aqa) + 1);
 
+    if (!n->params.zasl_bs) {
+        n->zasl = n->params.mdts;
+    } else {
+        if (n->params.zasl_bs < n->page_size) {
+            trace_pci_nvme_err_startfail_zasl_too_small(n->params.zasl_bs,
+                                                        n->page_size);
+            return -1;
+        }
+        n->zasl = 31 - clz32(n->params.zasl_bs / n->page_size);
+    }
+
     nvme_set_timestamp(n, 0ULL);
 
     QTAILQ_INIT(&n->aer_queue);
@@ -2784,6 +3687,13 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
 
         host_memory_backend_set_mapped(n->pmrdev, true);
     }
+
+    if (n->params.zasl_bs) {
+        if (!is_power_of_2(n->params.zasl_bs)) {
+            error_setg(errp, "zone append size limit has to be a power of 2");
+            return;
+        }
+    }
 }
 
 static void nvme_init_state(NvmeCtrl *n)
@@ -3049,9 +3959,21 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 static void nvme_exit(PCIDevice *pci_dev)
 {
     NvmeCtrl *n = NVME(pci_dev);
+    NvmeNamespace *ns;
+    int i;
 
     nvme_clear_ctrl(n);
+
+    for (i = 1; i <= n->num_namespaces; i++) {
+        ns = nvme_ns(n, i);
+        if (!ns) {
+            continue;
+        }
+
+        nvme_ns_cleanup(ns);
+    }
     g_free(n->namespaces);
+
     g_free(n->cq);
     g_free(n->sq);
     g_free(n->aer_reqs);
@@ -3079,6 +4001,9 @@ static Property nvme_props[] = {
     DEFINE_PROP_UINT32("aer_max_queued", NvmeCtrl, params.aer_max_queued, 64),
     DEFINE_PROP_UINT8("mdts", NvmeCtrl, params.mdts, 7),
     DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false),
+    DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
+    DEFINE_PROP_SIZE32("zone_append_size_limit", NvmeCtrl, params.zasl_bs,
+                       NVME_DEFAULT_MAX_ZA_SIZE),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index e080a2318a..c406cb1c65 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -6,6 +6,9 @@
 
 #define NVME_MAX_NAMESPACES 256
 
+#define NVME_DEFAULT_ZONE_SIZE   (128 * MiB)
+#define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
+
 typedef struct NvmeParams {
     char     *serial;
     uint32_t num_queues; /* deprecated since 5.1 */
@@ -16,6 +19,8 @@ typedef struct NvmeParams {
     uint32_t aer_max_queued;
     uint8_t  mdts;
     bool     use_intel_id;
+    uint8_t  fill_pattern;
+    uint32_t zasl_bs;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
@@ -28,6 +33,8 @@ typedef struct NvmeRequest {
     struct NvmeNamespace    *ns;
     BlockAIOCB              *aiocb;
     uint16_t                status;
+    uint64_t                fill_ofs;
+    uint32_t                fill_len;
     NvmeCqe                 cqe;
     NvmeCmd                 cmd;
     BlockAcctCookie         acct;
@@ -147,6 +154,8 @@ typedef struct NvmeCtrl {
     QTAILQ_HEAD(, NvmeAsyncEvent) aer_queue;
     int         aer_queued;
 
+    uint8_t     zasl;
+
     NvmeNamespace   namespace;
     NvmeNamespace   *namespaces[NVME_MAX_NAMESPACES];
     NvmeSQueue      **sq;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 65b964c894..af53e31fcb 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -90,6 +90,15 @@ pci_nvme_mmio_stopped(void) "cleared controller enable bit"
 pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
 pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
 pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
+pci_nvme_open_zone(uint64_t slba, uint32_t zone_idx, int all) "open zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_close_zone(uint64_t slba, uint32_t zone_idx, int all) "close zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_finish_zone(uint64_t slba, uint32_t zone_idx, int all) "finish zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_reset_zone(uint64_t slba, uint32_t zone_idx, int all) "reset zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_offline_zone(uint64_t slba, uint32_t zone_idx, int all) "offline zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_set_descriptor_extension(uint64_t slba, uint32_t zone_idx) "set zone descriptor extension, slba=%"PRIu64", idx=%"PRIu32""
+pci_nvme_clear_ns_close(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Closed state"
+pci_nvme_clear_ns_reset(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Empty state"
+pci_nvme_clear_ns_full(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Full state"
 
 # nvme traces for error conditions
 pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
@@ -109,8 +118,18 @@ pci_nvme_err_invalid_prp(void) "invalid PRP"
 pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
 pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
 pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
+pci_nvme_err_unaligned_zone_cmd(uint8_t action, uint64_t slba, uint64_t zslba) "unaligned zone op 0x%"PRIx32", got slba=%"PRIu64", zslba=%"PRIu64""
+pci_nvme_err_invalid_zone_state_transition(uint8_t state, uint8_t action, uint64_t slba, uint8_t attrs) "0x%"PRIx32"->0x%"PRIx32", slba=%"PRIu64", attrs=0x%"PRIx32""
+pci_nvme_err_write_not_at_wp(uint64_t slba, uint64_t zone, uint64_t wp) "writing at slba=%"PRIu64", zone=%"PRIu64", but wp=%"PRIu64""
+pci_nvme_err_append_not_at_start(uint64_t slba, uint64_t zone) "appending at slba=%"PRIu64", but zone=%"PRIu64""
+pci_nvme_err_zone_write_not_ok(uint64_t slba, uint32_t nlb, uint32_t status) "slba=%"PRIu64", nlb=%"PRIu32", status=0x%"PRIx16""
+pci_nvme_err_zone_read_not_ok(uint64_t slba, uint32_t nlb, uint32_t status) "slba=%"PRIu64", nlb=%"PRIu32", status=0x%"PRIx16""
+pci_nvme_err_append_too_large(uint64_t slba, uint32_t nlb, uint8_t zasl) "slba=%"PRIu64", nlb=%"PRIu32", zasl=%"PRIu8""
+pci_nvme_err_insuff_active_res(uint32_t max_active) "max_active=%"PRIu32" zone limit exceeded"
+pci_nvme_err_insuff_open_res(uint32_t max_open) "max_open=%"PRIu32" zone limit exceeded"
 pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
 pci_nvme_err_only_nvm_cmd_set_avail(void) "setting 110b CC.CSS, but only NVM command set is enabled"
+pci_nvme_err_only_zoned_cmd_set_avail(void) "setting 001b CC.CSS, but only ZONED+NVM command set is enabled"
 pci_nvme_err_invalid_iocsci(uint32_t idx) "unsupported command set combination index %"PRIu32""
 pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
@@ -144,7 +163,9 @@ pci_nvme_err_startfail_sqent_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_
 pci_nvme_err_startfail_css(uint8_t css) "nvme_start_ctrl failed because invalid command set selected:%u"
 pci_nvme_err_startfail_asqent_sz_zero(void) "nvme_start_ctrl failed because the admin submission queue size is zero"
 pci_nvme_err_startfail_acqent_sz_zero(void) "nvme_start_ctrl failed because the admin completion queue size is zero"
+pci_nvme_err_startfail_zasl_too_small(uint32_t zasl, uint32_t pagesz) "nvme_start_ctrl failed because zone append size limit %"PRIu32" is too small, needs to be >= %"PRIu32""
 pci_nvme_err_startfail(void) "setting controller enable bit failed"
+pci_nvme_err_invalid_mgmt_action(int action) "action=0x%"PRIx8""
 
 # Traces for undefined behavior
 pci_nvme_ub_mmiowr_misaligned32(uint64_t offset) "MMIO write not 32-bit aligned, offset=0x%"PRIx64""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 27125c9d28..54bc93b6ab 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -489,6 +489,9 @@ enum NvmeIoCommands {
     NVME_CMD_COMPARE            = 0x05,
     NVME_CMD_WRITE_ZEROES       = 0x08,
     NVME_CMD_DSM                = 0x09,
+    NVME_CMD_ZONE_MGMT_SEND     = 0x79,
+    NVME_CMD_ZONE_MGMT_RECV     = 0x7a,
+    NVME_CMD_ZONE_APPEND        = 0x7d,
 };
 
 typedef struct QEMU_PACKED NvmeDeleteQ {
@@ -649,8 +652,10 @@ typedef struct QEMU_PACKED NvmeAerResult {
 } NvmeAerResult;
 
 typedef struct QEMU_PACKED NvmeCqe {
-    uint32_t    result;
-    uint32_t    rsvd;
+    union {
+        uint64_t     result64;
+        uint32_t     result32;
+    };
     uint16_t    sq_head;
     uint16_t    sq_id;
     uint16_t    cid;
@@ -678,6 +683,7 @@ enum NvmeStatusCodes {
     NVME_SGL_DESCR_TYPE_INVALID = 0x0011,
     NVME_INVALID_USE_OF_CMB     = 0x0012,
     NVME_CMD_SET_CMB_REJECTED   = 0x002b,
+    NVME_INVALID_CMD_SET        = 0x002c,
     NVME_LBA_RANGE              = 0x0080,
     NVME_CAP_EXCEEDED           = 0x0081,
     NVME_NS_NOT_READY           = 0x0082,
@@ -702,6 +708,14 @@ enum NvmeStatusCodes {
     NVME_CONFLICTING_ATTRS      = 0x0180,
     NVME_INVALID_PROT_INFO      = 0x0181,
     NVME_WRITE_TO_RO            = 0x0182,
+    NVME_ZONE_BOUNDARY_ERROR    = 0x01b8,
+    NVME_ZONE_FULL              = 0x01b9,
+    NVME_ZONE_READ_ONLY         = 0x01ba,
+    NVME_ZONE_OFFLINE           = 0x01bb,
+    NVME_ZONE_INVALID_WRITE     = 0x01bc,
+    NVME_ZONE_TOO_MANY_ACTIVE   = 0x01bd,
+    NVME_ZONE_TOO_MANY_OPEN     = 0x01be,
+    NVME_ZONE_INVAL_TRANSITION  = 0x01bf,
     NVME_WRITE_FAULT            = 0x0280,
     NVME_UNRECOVERED_READ       = 0x0281,
     NVME_E2E_GUARD_ERROR        = 0x0282,
@@ -886,6 +900,11 @@ typedef struct QEMU_PACKED NvmeIdCtrl {
     uint8_t     vs[1024];
 } NvmeIdCtrl;
 
+typedef struct NvmeIdCtrlZoned {
+    uint8_t     zasl;
+    uint8_t     rsvd1[4095];
+} NvmeIdCtrlZoned;
+
 enum NvmeIdCtrlOacs {
     NVME_OACS_SECURITY  = 1 << 0,
     NVME_OACS_FORMAT    = 1 << 1,
@@ -1011,6 +1030,12 @@ typedef struct QEMU_PACKED NvmeLBAF {
     uint8_t     rp;
 } NvmeLBAF;
 
+typedef struct QEMU_PACKED NvmeLBAFE {
+    uint64_t    zsze;
+    uint8_t     zdes;
+    uint8_t     rsvd9[7];
+} NvmeLBAFE;
+
 #define NVME_NSID_BROADCAST 0xffffffff
 
 typedef struct QEMU_PACKED NvmeIdNs {
@@ -1065,10 +1090,24 @@ enum NvmeNsIdentifierType {
 
 enum NvmeCsi {
     NVME_CSI_NVM                = 0x00,
+    NVME_CSI_ZONED              = 0x02,
 };
 
 #define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
 
+typedef struct QEMU_PACKED NvmeIdNsZoned {
+    uint16_t    zoc;
+    uint16_t    ozcs;
+    uint32_t    mar;
+    uint32_t    mor;
+    uint32_t    rrl;
+    uint32_t    frl;
+    uint8_t     rsvd20[2796];
+    NvmeLBAFE   lbafe[16];
+    uint8_t     rsvd3072[768];
+    uint8_t     vs[256];
+} NvmeIdNsZoned;
+
 /*Deallocate Logical Block Features*/
 #define NVME_ID_NS_DLFEAT_GUARD_CRC(dlfeat)       ((dlfeat) & 0x10)
 #define NVME_ID_NS_DLFEAT_WRITE_ZEROES(dlfeat)    ((dlfeat) & 0x08)
@@ -1100,6 +1139,71 @@ enum NvmeIdNsDps {
     DPS_FIRST_EIGHT = 8,
 };
 
+enum NvmeZoneAttr {
+    NVME_ZA_FINISHED_BY_CTLR         = 1 << 0,
+    NVME_ZA_FINISH_RECOMMENDED       = 1 << 1,
+    NVME_ZA_RESET_RECOMMENDED        = 1 << 2,
+    NVME_ZA_ZD_EXT_VALID             = 1 << 7,
+};
+
+typedef struct QEMU_PACKED NvmeZoneReportHeader {
+    uint64_t    nr_zones;
+    uint8_t     rsvd[56];
+} NvmeZoneReportHeader;
+
+enum NvmeZoneReceiveAction {
+    NVME_ZONE_REPORT                 = 0,
+    NVME_ZONE_REPORT_EXTENDED        = 1,
+};
+
+enum NvmeZoneReportType {
+    NVME_ZONE_REPORT_ALL             = 0,
+    NVME_ZONE_REPORT_EMPTY           = 1,
+    NVME_ZONE_REPORT_IMPLICITLY_OPEN = 2,
+    NVME_ZONE_REPORT_EXPLICITLY_OPEN = 3,
+    NVME_ZONE_REPORT_CLOSED          = 4,
+    NVME_ZONE_REPORT_FULL            = 5,
+    NVME_ZONE_REPORT_READ_ONLY       = 6,
+    NVME_ZONE_REPORT_OFFLINE         = 7,
+};
+
+enum NvmeZoneType {
+    NVME_ZONE_TYPE_RESERVED          = 0x00,
+    NVME_ZONE_TYPE_SEQ_WRITE         = 0x02,
+};
+
+enum NvmeZoneSendAction {
+    NVME_ZONE_ACTION_RSD             = 0x00,
+    NVME_ZONE_ACTION_CLOSE           = 0x01,
+    NVME_ZONE_ACTION_FINISH          = 0x02,
+    NVME_ZONE_ACTION_OPEN            = 0x03,
+    NVME_ZONE_ACTION_RESET           = 0x04,
+    NVME_ZONE_ACTION_OFFLINE         = 0x05,
+    NVME_ZONE_ACTION_SET_ZD_EXT      = 0x10,
+};
+
+typedef struct QEMU_PACKED NvmeZoneDescr {
+    uint8_t     zt;
+    uint8_t     zs;
+    uint8_t     za;
+    uint8_t     rsvd3[5];
+    uint64_t    zcap;
+    uint64_t    zslba;
+    uint64_t    wp;
+    uint8_t     rsvd32[32];
+} NvmeZoneDescr;
+
+enum NvmeZoneState {
+    NVME_ZONE_STATE_RESERVED         = 0x00,
+    NVME_ZONE_STATE_EMPTY            = 0x01,
+    NVME_ZONE_STATE_IMPLICITLY_OPEN  = 0x02,
+    NVME_ZONE_STATE_EXPLICITLY_OPEN  = 0x03,
+    NVME_ZONE_STATE_CLOSED           = 0x04,
+    NVME_ZONE_STATE_READ_ONLY        = 0x0D,
+    NVME_ZONE_STATE_FULL             = 0x0E,
+    NVME_ZONE_STATE_OFFLINE          = 0x0F,
+};
+
 static inline void _nvme_check_size(void)
 {
     QEMU_BUILD_BUG_ON(sizeof(NvmeBar) != 4096);
@@ -1119,9 +1223,14 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrlZoned) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeLBAF) != 4);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeLBAFE) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsZoned) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeZoneDescr) != 64);
 }
 #endif
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 06/11] hw/block/nvme: Introduce max active and open zone limits
  2020-10-19  2:17 [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (4 preceding siblings ...)
  2020-10-19  2:17 ` [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
@ 2020-10-19  2:17 ` Dmitry Fomichev
  2020-10-19  2:17 ` [PATCH v7 07/11] hw/block/nvme: Support Zone Descriptor Extensions Dmitry Fomichev
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-19  2:17 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

Add two module properties, "max_active" and "max_open" to control
the maximum number of zones that can be active or open. Once these
variables are set to non-default values, these limits are checked
during I/O and Too Many Active or Too Many Open command status is
returned if they are exceeded.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme-ns.c | 28 ++++++++++++-
 hw/block/nvme-ns.h | 41 +++++++++++++++++++
 hw/block/nvme.c    | 99 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 166 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index fedfad595c..8d9e11eef2 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -118,6 +118,20 @@ static int nvme_calc_zone_geometry(NvmeNamespace *ns, Error **errp)
         ns->zone_size_log2 = 63 - clz64(ns->zone_size);
     }
 
+    /* Make sure that the values of all ZNS properties are sane */
+    if (ns->params.max_open_zones > nz) {
+        error_setg(errp,
+                   "max_open_zones value %u exceeds the number of zones %u",
+                   ns->params.max_open_zones, nz);
+        return -1;
+    }
+    if (ns->params.max_active_zones > nz) {
+        error_setg(errp,
+                   "max_active_zones value %u exceeds the number of zones %u",
+                   ns->params.max_active_zones, nz);
+        return -1;
+    }
+
     return 0;
 }
 
@@ -172,8 +186,8 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
     id_ns_z = g_malloc0(sizeof(NvmeIdNsZoned));
 
     /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
-    id_ns_z->mar = 0xffffffff;
-    id_ns_z->mor = 0xffffffff;
+    id_ns_z->mar = cpu_to_le32(ns->params.max_active_zones - 1);
+    id_ns_z->mor = cpu_to_le32(ns->params.max_open_zones - 1);
     id_ns_z->zoc = 0;
     id_ns_z->ozcs = ns->params.cross_zone_read ? 0x01 : 0x00;
 
@@ -199,6 +213,9 @@ static void nvme_zoned_clear_ns(NvmeNamespace *ns)
     uint32_t set_state;
     int i;
 
+    ns->nr_active_zones = 0;
+    ns->nr_open_zones = 0;
+
     zone = ns->zone_array;
     for (i = 0; i < ns->num_zones; i++, zone++) {
         switch (nvme_get_zone_state(zone)) {
@@ -209,6 +226,7 @@ static void nvme_zoned_clear_ns(NvmeNamespace *ns)
             QTAILQ_REMOVE(&ns->exp_open_zones, zone, entry);
             break;
         case NVME_ZONE_STATE_CLOSED:
+            nvme_aor_inc_active(ns);
             /* fall through */
         default:
             continue;
@@ -216,6 +234,9 @@ static void nvme_zoned_clear_ns(NvmeNamespace *ns)
 
         if (zone->d.wp == zone->d.zslba) {
             set_state = NVME_ZONE_STATE_EMPTY;
+        } else if (ns->params.max_active_zones == 0 ||
+                   ns->nr_active_zones < ns->params.max_active_zones) {
+            set_state = NVME_ZONE_STATE_CLOSED;
         } else {
             set_state = NVME_ZONE_STATE_CLOSED;
         }
@@ -224,6 +245,7 @@ static void nvme_zoned_clear_ns(NvmeNamespace *ns)
         case NVME_ZONE_STATE_CLOSED:
             trace_pci_nvme_clear_ns_close(nvme_get_zone_state(zone),
                                           zone->d.zslba);
+            nvme_aor_inc_active(ns);
             QTAILQ_INSERT_TAIL(&ns->closed_zones, zone, entry);
             break;
         case NVME_ZONE_STATE_EMPTY:
@@ -326,6 +348,8 @@ static Property nvme_ns_props[] = {
     DEFINE_PROP_SIZE("zone_capacity", NvmeNamespace, params.zone_cap_bs, 0),
     DEFINE_PROP_BOOL("cross_zone_read", NvmeNamespace,
                      params.cross_zone_read, false),
+    DEFINE_PROP_UINT32("max_active", NvmeNamespace, params.max_active_zones, 0),
+    DEFINE_PROP_UINT32("max_open", NvmeNamespace, params.max_open_zones, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 170cbb8cdc..b0633d0def 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -34,6 +34,8 @@ typedef struct NvmeNamespaceParams {
     bool     cross_zone_read;
     uint64_t zone_size_bs;
     uint64_t zone_cap_bs;
+    uint32_t max_active_zones;
+    uint32_t max_open_zones;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
@@ -56,6 +58,8 @@ typedef struct NvmeNamespace {
     uint64_t        zone_capacity;
     uint64_t        zone_array_size;
     uint32_t        zone_size_log2;
+    int32_t         nr_open_zones;
+    int32_t         nr_active_zones;
 
     NvmeNamespaceParams params;
 } NvmeNamespace;
@@ -123,6 +127,43 @@ static inline bool nvme_wp_is_valid(NvmeZone *zone)
            st != NVME_ZONE_STATE_OFFLINE;
 }
 
+static inline void nvme_aor_inc_open(NvmeNamespace *ns)
+{
+    assert(ns->nr_open_zones >= 0);
+    if (ns->params.max_open_zones) {
+        ns->nr_open_zones++;
+        assert(ns->nr_open_zones <= ns->params.max_open_zones);
+    }
+}
+
+static inline void nvme_aor_dec_open(NvmeNamespace *ns)
+{
+    if (ns->params.max_open_zones) {
+        assert(ns->nr_open_zones > 0);
+        ns->nr_open_zones--;
+    }
+    assert(ns->nr_open_zones >= 0);
+}
+
+static inline void nvme_aor_inc_active(NvmeNamespace *ns)
+{
+    assert(ns->nr_active_zones >= 0);
+    if (ns->params.max_active_zones) {
+        ns->nr_active_zones++;
+        assert(ns->nr_active_zones <= ns->params.max_active_zones);
+    }
+}
+
+static inline void nvme_aor_dec_active(NvmeNamespace *ns)
+{
+    if (ns->params.max_active_zones) {
+        assert(ns->nr_active_zones > 0);
+        ns->nr_active_zones--;
+        assert(ns->nr_active_zones >= ns->nr_open_zones);
+    }
+    assert(ns->nr_active_zones >= 0);
+}
+
 int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
 void nvme_ns_drain(NvmeNamespace *ns);
 void nvme_ns_flush(NvmeNamespace *ns);
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 34d0d0250d..b3cdfccdfb 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -199,6 +199,26 @@ static void nvme_assign_zone_state(NvmeNamespace *ns, NvmeZone *zone,
     }
 }
 
+/*
+ * Check if we can open a zone without exceeding open/active limits.
+ * AOR stands for "Active and Open Resources" (see TP 4053 section 2.5).
+ */
+static int nvme_aor_check(NvmeNamespace *ns, uint32_t act, uint32_t opn)
+{
+    if (ns->params.max_active_zones != 0 &&
+        ns->nr_active_zones + act > ns->params.max_active_zones) {
+        trace_pci_nvme_err_insuff_active_res(ns->params.max_active_zones);
+        return NVME_ZONE_TOO_MANY_ACTIVE | NVME_DNR;
+    }
+    if (ns->params.max_open_zones != 0 &&
+        ns->nr_open_zones + opn > ns->params.max_open_zones) {
+        trace_pci_nvme_err_insuff_open_res(ns->params.max_open_zones);
+        return NVME_ZONE_TOO_MANY_OPEN | NVME_DNR;
+    }
+
+    return NVME_SUCCESS;
+}
+
 static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
 {
     hwaddr low = n->ctrl_mem.addr;
@@ -1207,6 +1227,41 @@ static uint16_t nvme_check_zone_read(NvmeNamespace *ns, NvmeZone *zone,
     return status;
 }
 
+static void nvme_auto_transition_zone(NvmeNamespace *ns, bool implicit,
+                                      bool adding_active)
+{
+    NvmeZone *zone;
+
+    if (implicit && ns->params.max_open_zones &&
+        ns->nr_open_zones == ns->params.max_open_zones) {
+        zone = QTAILQ_FIRST(&ns->imp_open_zones);
+        if (zone) {
+            /*
+             * Automatically close this implicitly open zone.
+             */
+            QTAILQ_REMOVE(&ns->imp_open_zones, zone, entry);
+            nvme_aor_dec_open(ns);
+            nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_CLOSED);
+        }
+    }
+}
+
+static uint16_t nvme_auto_open_zone(NvmeNamespace *ns, NvmeZone *zone)
+{
+    uint16_t status = NVME_SUCCESS;
+    uint8_t zs = nvme_get_zone_state(zone);
+
+    if (zs == NVME_ZONE_STATE_EMPTY) {
+        nvme_auto_transition_zone(ns, true, true);
+        status = nvme_aor_check(ns, 1, 1);
+    } else if (zs == NVME_ZONE_STATE_CLOSED) {
+        nvme_auto_transition_zone(ns, true, false);
+        status = nvme_aor_check(ns, 0, 1);
+    }
+
+    return status;
+}
+
 static bool nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
                                       bool failed)
 {
@@ -1243,7 +1298,11 @@ static bool nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
         switch (nvme_get_zone_state(zone)) {
         case NVME_ZONE_STATE_IMPLICITLY_OPEN:
         case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+            nvme_aor_dec_open(ns);
+            /* fall through */
         case NVME_ZONE_STATE_CLOSED:
+            nvme_aor_dec_active(ns);
+            /* fall through */
         case NVME_ZONE_STATE_EMPTY:
             nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_FULL);
             /* fall through */
@@ -1272,7 +1331,10 @@ static uint64_t nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone,
         zs = nvme_get_zone_state(zone);
         switch (zs) {
         case NVME_ZONE_STATE_EMPTY:
+            nvme_aor_inc_active(ns);
+            /* fall through */
         case NVME_ZONE_STATE_CLOSED:
+            nvme_aor_inc_open(ns);
             nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_IMPLICITLY_OPEN);
         }
     }
@@ -1378,6 +1440,11 @@ static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
             goto invalid;
         }
 
+        status = nvme_auto_open_zone(ns, zone);
+        if (status != NVME_SUCCESS) {
+            goto invalid;
+        }
+
         req->cqe.result64 = nvme_advance_zone_wp(ns, zone, nlb);
     }
 
@@ -1436,6 +1503,11 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req, bool append)
                 slba = zone->w_ptr;
             }
 
+            status = nvme_auto_open_zone(ns, zone);
+            if (status != NVME_SUCCESS) {
+                goto invalid;
+            }
+
             req->cqe.result64 = nvme_advance_zone_wp(ns, zone, nlb);
         } else {
             status = nvme_check_zone_read(ns, zone, slba, nlb, &rfc);
@@ -1537,9 +1609,27 @@ static uint16_t nvme_get_mgmt_zone_slba_idx(NvmeNamespace *ns, NvmeCmd *c,
 static uint16_t nvme_open_zone(NvmeNamespace *ns, NvmeZone *zone,
                                uint8_t state)
 {
+    uint16_t status;
+
     switch (state) {
     case NVME_ZONE_STATE_EMPTY:
+        nvme_auto_transition_zone(ns, false, true);
+        status = nvme_aor_check(ns, 1, 0);
+        if (status != NVME_SUCCESS) {
+            return status;
+        }
+        nvme_aor_inc_active(ns);
+        /* fall through */
     case NVME_ZONE_STATE_CLOSED:
+        status = nvme_aor_check(ns, 0, 1);
+        if (status != NVME_SUCCESS) {
+            if (state == NVME_ZONE_STATE_EMPTY) {
+                nvme_aor_dec_active(ns);
+            }
+            return status;
+        }
+        nvme_aor_inc_open(ns);
+        /* fall through */
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
         nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_EXPLICITLY_OPEN);
         /* fall through */
@@ -1561,6 +1651,7 @@ static uint16_t nvme_close_zone(NvmeNamespace *ns, NvmeZone *zone,
     switch (state) {
     case NVME_ZONE_STATE_EXPLICITLY_OPEN:
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_aor_dec_open(ns);
         nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_CLOSED);
         /* fall through */
     case NVME_ZONE_STATE_CLOSED:
@@ -1582,7 +1673,11 @@ static uint16_t nvme_finish_zone(NvmeNamespace *ns, NvmeZone *zone,
     switch (state) {
     case NVME_ZONE_STATE_EXPLICITLY_OPEN:
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_aor_dec_open(ns);
+        /* fall through */
     case NVME_ZONE_STATE_CLOSED:
+        nvme_aor_dec_active(ns);
+        /* fall through */
     case NVME_ZONE_STATE_EMPTY:
         zone->w_ptr = nvme_zone_wr_boundary(zone);
         zone->d.wp = zone->w_ptr;
@@ -1608,7 +1703,11 @@ static uint16_t nvme_reset_zone(NvmeNamespace *ns, NvmeZone *zone,
     switch (state) {
     case NVME_ZONE_STATE_EXPLICITLY_OPEN:
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_aor_dec_open(ns);
+        /* fall through */
     case NVME_ZONE_STATE_CLOSED:
+        nvme_aor_dec_active(ns);
+        /* fall through */
     case NVME_ZONE_STATE_FULL:
         zone->w_ptr = zone->d.zslba;
         zone->d.wp = zone->w_ptr;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 07/11] hw/block/nvme: Support Zone Descriptor Extensions
  2020-10-19  2:17 [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (5 preceding siblings ...)
  2020-10-19  2:17 ` [PATCH v7 06/11] hw/block/nvme: Introduce max active and open zone limits Dmitry Fomichev
@ 2020-10-19  2:17 ` Dmitry Fomichev
  2020-10-19  2:17 ` [PATCH v7 08/11] hw/block/nvme: Add injection of Offline/Read-Only zones Dmitry Fomichev
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-19  2:17 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

Zone Descriptor Extension is a label that can be assigned to a zone.
It can be set to an Empty zone and it stays assigned until the zone
is reset.

This commit adds a new optional module property, "zone_descr_ext_size".
Its value must be a multiple of 64 bytes. If this value is non-zero,
it becomes possible to assign extensions of that size to any Empty
zones. The default value for this property is 0, therefore setting
extensions is disabled by default.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme-ns.c    | 14 ++++++++++--
 hw/block/nvme-ns.h    |  8 +++++++
 hw/block/nvme.c       | 51 +++++++++++++++++++++++++++++++++++++++++--
 hw/block/trace-events |  2 ++
 4 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 8d9e11eef2..255ded2b43 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -143,6 +143,10 @@ static void nvme_init_zone_state(NvmeNamespace *ns)
     int i;
 
     ns->zone_array = g_malloc0(ns->zone_array_size);
+    if (ns->params.zd_extension_size) {
+        ns->zd_extensions = g_malloc0(ns->params.zd_extension_size *
+                                      ns->num_zones);
+    }
 
     QTAILQ_INIT(&ns->exp_open_zones);
     QTAILQ_INIT(&ns->imp_open_zones);
@@ -192,7 +196,8 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
     id_ns_z->ozcs = ns->params.cross_zone_read ? 0x01 : 0x00;
 
     id_ns_z->lbafe[lba_index].zsze = cpu_to_le64(ns->zone_size);
-    id_ns_z->lbafe[lba_index].zdes = 0;
+    id_ns_z->lbafe[lba_index].zdes =
+        ns->params.zd_extension_size >> 6; /* Units of 64B */
 
     ns->csi = NVME_CSI_ZONED;
     ns->id_ns.nsze = cpu_to_le64(ns->zone_size * ns->num_zones);
@@ -232,7 +237,9 @@ static void nvme_zoned_clear_ns(NvmeNamespace *ns)
             continue;
         }
 
-        if (zone->d.wp == zone->d.zslba) {
+        if (zone->d.za & NVME_ZA_ZD_EXT_VALID) {
+            set_state = NVME_ZONE_STATE_CLOSED;
+        } else if (zone->d.wp == zone->d.zslba) {
             set_state = NVME_ZONE_STATE_EMPTY;
         } else if (ns->params.max_active_zones == 0 ||
                    ns->nr_active_zones < ns->params.max_active_zones) {
@@ -320,6 +327,7 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
     if (ns->params.zoned) {
         g_free(ns->id_ns_zoned);
         g_free(ns->zone_array);
+        g_free(ns->zd_extensions);
     }
 }
 
@@ -350,6 +358,8 @@ static Property nvme_ns_props[] = {
                      params.cross_zone_read, false),
     DEFINE_PROP_UINT32("max_active", NvmeNamespace, params.max_active_zones, 0),
     DEFINE_PROP_UINT32("max_open", NvmeNamespace, params.max_open_zones, 0),
+    DEFINE_PROP_UINT32("zone_descr_ext_size", NvmeNamespace,
+                       params.zd_extension_size, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index b0633d0def..2d70a13701 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -36,6 +36,7 @@ typedef struct NvmeNamespaceParams {
     uint64_t zone_cap_bs;
     uint32_t max_active_zones;
     uint32_t max_open_zones;
+    uint32_t zd_extension_size;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
@@ -58,6 +59,7 @@ typedef struct NvmeNamespace {
     uint64_t        zone_capacity;
     uint64_t        zone_array_size;
     uint32_t        zone_size_log2;
+    uint8_t         *zd_extensions;
     int32_t         nr_open_zones;
     int32_t         nr_active_zones;
 
@@ -127,6 +129,12 @@ static inline bool nvme_wp_is_valid(NvmeZone *zone)
            st != NVME_ZONE_STATE_OFFLINE;
 }
 
+static inline uint8_t *nvme_get_zd_extension(NvmeNamespace *ns,
+                                             uint32_t zone_idx)
+{
+    return &ns->zd_extensions[zone_idx * ns->params.zd_extension_size];
+}
+
 static inline void nvme_aor_inc_open(NvmeNamespace *ns)
 {
     assert(ns->nr_open_zones >= 0);
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b3cdfccdfb..fbf27a5098 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1747,6 +1747,26 @@ static bool nvme_cond_offline_all(uint8_t state)
     return state == NVME_ZONE_STATE_READ_ONLY;
 }
 
+static uint16_t nvme_set_zd_ext(NvmeNamespace *ns, NvmeZone *zone,
+                                uint8_t state)
+{
+    uint16_t status;
+
+    if (state == NVME_ZONE_STATE_EMPTY) {
+        nvme_auto_transition_zone(ns, false, true);
+        status = nvme_aor_check(ns, 1, 0);
+        if (status != NVME_SUCCESS) {
+            return status;
+        }
+        nvme_aor_inc_active(ns);
+        zone->d.za |= NVME_ZA_ZD_EXT_VALID;
+        nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_CLOSED);
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
 typedef uint16_t (*op_handler_t)(NvmeNamespace *, NvmeZone *,
                                  uint8_t);
 typedef bool (*need_to_proc_zone_t)(uint8_t);
@@ -1787,6 +1807,7 @@ static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeRequest *req)
     uint8_t action, state;
     bool all;
     NvmeZone *zone;
+    uint8_t *zd_ext;
 
     action = dw13 & 0xff;
     all = dw13 & 0x100;
@@ -1841,7 +1862,22 @@ static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeRequest *req)
 
     case NVME_ZONE_ACTION_SET_ZD_EXT:
         trace_pci_nvme_set_descriptor_extension(slba, zone_idx);
-        return NVME_INVALID_FIELD | NVME_DNR;
+        if (all || !ns->params.zd_extension_size) {
+            return NVME_INVALID_FIELD | NVME_DNR;
+        }
+        zd_ext = nvme_get_zd_extension(ns, zone_idx);
+        status = nvme_dma(n, zd_ext, ns->params.zd_extension_size,
+                          DMA_DIRECTION_TO_DEVICE, req);
+        if (status) {
+            trace_pci_nvme_err_zd_extension_map_error(zone_idx);
+            return status;
+        }
+
+        status = nvme_set_zd_ext(ns, zone, state);
+        if (status == NVME_SUCCESS) {
+            trace_pci_nvme_zd_extension_set(zone_idx);
+            return status;
+        }
         break;
 
     default:
@@ -1919,7 +1955,7 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
-    if (zra == NVME_ZONE_REPORT_EXTENDED) {
+    if (zra == NVME_ZONE_REPORT_EXTENDED && !ns->params.zd_extension_size) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
@@ -1931,6 +1967,9 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
     partial = (dw13 >> 16) & 0x01;
 
     zone_entry_sz = sizeof(NvmeZoneDescr);
+    if (zra == NVME_ZONE_REPORT_EXTENDED) {
+        zone_entry_sz += ns->params.zd_extension_size;
+    }
 
     max_zones = (len - sizeof(NvmeZoneReportHeader)) / zone_entry_sz;
     buf = g_malloc0(len);
@@ -1962,6 +2001,14 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
             z->wp = cpu_to_le64(~0ULL);
         }
 
+        if (zra == NVME_ZONE_REPORT_EXTENDED) {
+            if (zs->d.za & NVME_ZA_ZD_EXT_VALID) {
+                memcpy(buf_p, nvme_get_zd_extension(ns, zone_idx),
+                       ns->params.zd_extension_size);
+            }
+            buf_p += ns->params.zd_extension_size;
+        }
+
         zone_idx++;
     }
 
diff --git a/hw/block/trace-events b/hw/block/trace-events
index af53e31fcb..962084e40c 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -96,6 +96,7 @@ pci_nvme_finish_zone(uint64_t slba, uint32_t zone_idx, int all) "finish zone, sl
 pci_nvme_reset_zone(uint64_t slba, uint32_t zone_idx, int all) "reset zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
 pci_nvme_offline_zone(uint64_t slba, uint32_t zone_idx, int all) "offline zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
 pci_nvme_set_descriptor_extension(uint64_t slba, uint32_t zone_idx) "set zone descriptor extension, slba=%"PRIu64", idx=%"PRIu32""
+pci_nvme_zd_extension_set(uint32_t zone_idx) "set descriptor extension for zone_idx=%"PRIu32""
 pci_nvme_clear_ns_close(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Closed state"
 pci_nvme_clear_ns_reset(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Empty state"
 pci_nvme_clear_ns_full(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Full state"
@@ -127,6 +128,7 @@ pci_nvme_err_zone_read_not_ok(uint64_t slba, uint32_t nlb, uint32_t status) "slb
 pci_nvme_err_append_too_large(uint64_t slba, uint32_t nlb, uint8_t zasl) "slba=%"PRIu64", nlb=%"PRIu32", zasl=%"PRIu8""
 pci_nvme_err_insuff_active_res(uint32_t max_active) "max_active=%"PRIu32" zone limit exceeded"
 pci_nvme_err_insuff_open_res(uint32_t max_open) "max_open=%"PRIu32" zone limit exceeded"
+pci_nvme_err_zd_extension_map_error(uint32_t zone_idx) "can't map descriptor extension for zone_idx=%"PRIu32""
 pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
 pci_nvme_err_only_nvm_cmd_set_avail(void) "setting 110b CC.CSS, but only NVM command set is enabled"
 pci_nvme_err_only_zoned_cmd_set_avail(void) "setting 001b CC.CSS, but only ZONED+NVM command set is enabled"
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 08/11] hw/block/nvme: Add injection of Offline/Read-Only zones
  2020-10-19  2:17 [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (6 preceding siblings ...)
  2020-10-19  2:17 ` [PATCH v7 07/11] hw/block/nvme: Support Zone Descriptor Extensions Dmitry Fomichev
@ 2020-10-19  2:17 ` Dmitry Fomichev
  2020-10-19 11:42   ` Klaus Jensen
  2020-10-19  2:17 ` [PATCH v7 09/11] hw/block/nvme: Document zoned parameters in usage text Dmitry Fomichev
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-19  2:17 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

ZNS specification defines two zone conditions for the zones that no
longer can function properly, possibly because of flash wear or other
internal fault. It is useful to be able to "inject" a small number of
such zones for testing purposes.

This commit defines two optional device properties, "offline_zones"
and "rdonly_zones". Users can assign non-zero values to these variables
to specify the number of zones to be initialized as Offline or
Read-Only. The actual number of injected zones may be smaller than the
requested amount - Read-Only and Offline counts are expected to be much
smaller than the total number of zones on a drive.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme-ns.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++
 hw/block/nvme-ns.h |  2 ++
 2 files changed, 66 insertions(+)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 255ded2b43..d050f97909 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -21,6 +21,7 @@
 #include "sysemu/sysemu.h"
 #include "sysemu/block-backend.h"
 #include "qapi/error.h"
+#include "crypto/random.h"
 
 #include "hw/qdev-properties.h"
 #include "hw/qdev-core.h"
@@ -132,6 +133,32 @@ static int nvme_calc_zone_geometry(NvmeNamespace *ns, Error **errp)
         return -1;
     }
 
+    if (ns->params.zd_extension_size) {
+        if (ns->params.zd_extension_size & 0x3f) {
+            error_setg(errp,
+                "zone descriptor extension size must be a multiple of 64B");
+            return -1;
+        }
+        if ((ns->params.zd_extension_size >> 6) > 0xff) {
+            error_setg(errp, "zone descriptor extension size is too large");
+            return -1;
+        }
+    }
+
+    if (ns->params.max_open_zones < nz) {
+        if (ns->params.nr_offline_zones > nz - ns->params.max_open_zones) {
+            error_setg(errp, "offline_zones value %u is too large",
+                ns->params.nr_offline_zones);
+            return -1;
+        }
+        if (ns->params.nr_rdonly_zones >
+            nz - ns->params.max_open_zones - ns->params.nr_offline_zones) {
+            error_setg(errp, "rdonly_zones value %u is too large",
+                ns->params.nr_rdonly_zones);
+            return -1;
+        }
+    }
+
     return 0;
 }
 
@@ -140,7 +167,9 @@ static void nvme_init_zone_state(NvmeNamespace *ns)
     uint64_t start = 0, zone_size = ns->zone_size;
     uint64_t capacity = ns->num_zones * zone_size;
     NvmeZone *zone;
+    uint32_t rnd;
     int i;
+    uint16_t zs;
 
     ns->zone_array = g_malloc0(ns->zone_array_size);
     if (ns->params.zd_extension_size) {
@@ -167,6 +196,37 @@ static void nvme_init_zone_state(NvmeNamespace *ns)
         zone->w_ptr = start;
         start += zone_size;
     }
+
+    /* If required, make some zones Offline or Read Only */
+
+    for (i = 0; i < ns->params.nr_offline_zones; i++) {
+        do {
+            qcrypto_random_bytes(&rnd, sizeof(rnd), NULL);
+            rnd %= ns->num_zones;
+        } while (rnd < ns->params.max_open_zones);
+        zone = &ns->zone_array[rnd];
+        zs = nvme_get_zone_state(zone);
+        if (zs != NVME_ZONE_STATE_OFFLINE) {
+            nvme_set_zone_state(zone, NVME_ZONE_STATE_OFFLINE);
+        } else {
+            i--;
+        }
+    }
+
+    for (i = 0; i < ns->params.nr_rdonly_zones; i++) {
+        do {
+            qcrypto_random_bytes(&rnd, sizeof(rnd), NULL);
+            rnd %= ns->num_zones;
+        } while (rnd < ns->params.max_open_zones);
+        zone = &ns->zone_array[rnd];
+        zs = nvme_get_zone_state(zone);
+        if (zs != NVME_ZONE_STATE_OFFLINE &&
+            zs != NVME_ZONE_STATE_READ_ONLY) {
+            nvme_set_zone_state(zone, NVME_ZONE_STATE_READ_ONLY);
+        } else {
+            i--;
+        }
+    }
 }
 
 static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
@@ -360,6 +420,10 @@ static Property nvme_ns_props[] = {
     DEFINE_PROP_UINT32("max_open", NvmeNamespace, params.max_open_zones, 0),
     DEFINE_PROP_UINT32("zone_descr_ext_size", NvmeNamespace,
                        params.zd_extension_size, 0),
+    DEFINE_PROP_UINT32("offline_zones", NvmeNamespace,
+                       params.nr_offline_zones, 0),
+    DEFINE_PROP_UINT32("rdonly_zones", NvmeNamespace,
+                       params.nr_rdonly_zones, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 2d70a13701..d65d8b0930 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -37,6 +37,8 @@ typedef struct NvmeNamespaceParams {
     uint32_t max_active_zones;
     uint32_t max_open_zones;
     uint32_t zd_extension_size;
+    uint32_t nr_offline_zones;
+    uint32_t nr_rdonly_zones;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 09/11] hw/block/nvme: Document zoned parameters in usage text
  2020-10-19  2:17 [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (7 preceding siblings ...)
  2020-10-19  2:17 ` [PATCH v7 08/11] hw/block/nvme: Add injection of Offline/Read-Only zones Dmitry Fomichev
@ 2020-10-19  2:17 ` Dmitry Fomichev
  2020-10-19  2:17 ` [PATCH v7 10/11] hw/block/nvme: Separate read and write handlers Dmitry Fomichev
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-19  2:17 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

Added brief descriptions of the new device properties that are
now available to users to configure features of Zoned Namespace
Command Set in the emulator.

This patch is for documentation only, no functionality change.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 41 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index fbf27a5098..3b9ea326d7 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,7 +9,7 @@
  */
 
 /**
- * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
+ * Reference Specs: http://www.nvmexpress.org, 1.4, 1.3, 1.2, 1.1, 1.0e
  *
  *  https://nvmexpress.org/developers/nvme-specification/
  */
@@ -23,7 +23,8 @@
  *              max_ioqpairs=<N[optional]>, \
  *              aerl=<N[optional]>, aer_max_queued=<N[optional]>, \
  *              mdts=<N[optional]>
- *      -device nvme-ns,drive=<drive_id>,bus=bus_name,nsid=<nsid>
+ *      -device nvme-ns,drive=<drive_id>,bus=<bus_name>,nsid=<nsid>, \
+ *              zoned=<true|false[optional]>
  *
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
  * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
@@ -49,6 +50,42 @@
  *   completion when there are no oustanding AERs. When the maximum number of
  *   enqueued events are reached, subsequent events will be dropped.
  *
+ * Setting `zoned` to true selects Zoned Command Set at the namespace.
+ * In this case, the following options are available to configure zoned
+ * operation:
+ *     zone_size=<zone size in bytes, default: 128MiB>
+ *         The number may be followed by K, M, G as in kilo-, mega- or giga.
+ *
+ *     zone_capacity=<zone capacity in bytes, default: zone_size>
+ *         The value 0 (default) forces zone capacity to be the same as zone
+ *         size. The value of this property may not exceed zone size.
+ *
+ *     zone_descr_ext_size=<zone descriptor extension size, default 0>
+ *         This value needs to be specified in 64B units. If it is zero,
+ *         namespace(s) will not support zone descriptor extensions.
+ *
+ *     max_active=<Maximum Active Resources (zones), default: 0 - no limit>
+ *
+ *     max_open=<Maximum Open Resources (zones), default: 0 - no limit>
+ *
+ *     zone_append_size_limit=<zone append size limit in bytes, default: 128KiB>
+ *         The maximum I/O size that can be supported by Zone Append
+ *         command. Since internally this this value is maintained as
+ *         ZASL = log2(<maximum append size> / <page size>), some
+ *         values assigned to this property may be rounded down and
+ *         result in a lower maximum ZA data size being in effect.
+ *         By setting this property to 0, user can make ZASL to be
+ *         equial to MDTS.
+ *
+ *     offline_zones=<the number of offline zones to inject, default: 0>
+ *
+ *     rdonly_zones=<the number of read-only zones to inject, default: 0>
+ *
+ *     cross_zone_read=<enables Read Across Zone Boundaries, default: true>
+ *
+ *     fill_pattern=<data fill pattern, default: 0x00>
+ *         The byte pattern to return for any portions of unwritten data
+ *         during read.
  */
 
 #include "qemu/osdep.h"
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 10/11] hw/block/nvme: Separate read and write handlers
  2020-10-19  2:17 [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (8 preceding siblings ...)
  2020-10-19  2:17 ` [PATCH v7 09/11] hw/block/nvme: Document zoned parameters in usage text Dmitry Fomichev
@ 2020-10-19  2:17 ` Dmitry Fomichev
  2020-10-20  8:28   ` Klaus Jensen
  2020-10-19  2:17 ` [PATCH v7 11/11] hw/block/nvme: Merge nvme_write_zeroes() with nvme_write() Dmitry Fomichev
  2020-10-19  7:32 ` [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Niklas Cassel
  11 siblings, 1 reply; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-19  2:17 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

With ZNS support in place, the majority of code in nvme_rw() has
become read- or write-specific. Move these parts to two separate
handlers, nvme_read() and nvme_write() to make the code more
readable and to remove multiple is_write checks that so far existed
in the i/o path.

This is a refactoring patch, no change in functionality.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c       | 191 +++++++++++++++++++++++++-----------------
 hw/block/trace-events |   3 +-
 2 files changed, 114 insertions(+), 80 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3b9ea326d7..5ec4ce5e28 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1162,10 +1162,10 @@ typedef struct NvmeReadFillCtx {
     uint32_t  post_rd_fill_nlb;
 } NvmeReadFillCtx;
 
-static uint16_t nvme_check_zone_read(NvmeNamespace *ns, NvmeZone *zone,
-                                     uint64_t slba, uint32_t nlb,
-                                     NvmeReadFillCtx *rfc)
+static uint16_t nvme_check_zone_read(NvmeNamespace *ns, uint64_t slba,
+                                     uint32_t nlb, NvmeReadFillCtx *rfc)
 {
+    NvmeZone *zone = nvme_get_zone_by_slba(ns, slba);
     NvmeZone *next_zone;
     uint64_t bndry = nvme_zone_rd_boundary(ns, zone);
     uint64_t end = slba + nlb, wp1, wp2;
@@ -1449,6 +1449,86 @@ static uint16_t nvme_flush(NvmeCtrl *n, NvmeRequest *req)
     return NVME_NO_COMPLETE;
 }
 
+static uint16_t nvme_read(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
+    NvmeNamespace *ns = req->ns;
+    uint64_t slba = le64_to_cpu(rw->slba);
+    uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
+    uint32_t fill_len;
+    uint64_t data_size = nvme_l2b(ns, nlb);
+    uint64_t data_offset, fill_ofs;
+    NvmeReadFillCtx rfc;
+    BlockBackend *blk = ns->blkconf.blk;
+    uint16_t status;
+
+    trace_pci_nvme_read(nvme_cid(req), nvme_nsid(ns), nlb, data_size, slba);
+
+    status = nvme_check_mdts(n, data_size);
+    if (status) {
+        trace_pci_nvme_err_mdts(nvme_cid(req), data_size);
+        goto invalid;
+    }
+
+    status = nvme_check_bounds(n, ns, slba, nlb);
+    if (status) {
+        trace_pci_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
+        goto invalid;
+    }
+
+    if (ns->params.zoned) {
+        status = nvme_check_zone_read(ns, slba, nlb, &rfc);
+        if (status != NVME_SUCCESS) {
+            trace_pci_nvme_err_zone_read_not_ok(slba, nlb, status);
+            goto invalid;
+        }
+    }
+
+    status = nvme_map_dptr(n, data_size, req);
+    if (status) {
+        goto invalid;
+    }
+
+    if (ns->params.zoned) {
+        if (rfc.pre_rd_fill_nlb) {
+            fill_ofs = nvme_l2b(ns, rfc.pre_rd_fill_slba - slba);
+            fill_len = nvme_l2b(ns, rfc.pre_rd_fill_nlb);
+            nvme_fill_read_data(req, fill_ofs, fill_len,
+                                n->params.fill_pattern);
+        }
+        if (!rfc.read_nlb) {
+            /* No backend I/O necessary, only needed to fill the buffer */
+            req->status = NVME_SUCCESS;
+            return NVME_SUCCESS;
+        }
+        if (rfc.post_rd_fill_nlb) {
+            req->fill_ofs = nvme_l2b(ns, rfc.post_rd_fill_slba - slba);
+            req->fill_len = nvme_l2b(ns, rfc.post_rd_fill_nlb);
+        } else {
+            req->fill_len = 0;
+        }
+        slba = rfc.read_slba;
+        data_size = nvme_l2b(ns, rfc.read_nlb);
+    }
+
+    data_offset = nvme_l2b(ns, slba);
+
+    block_acct_start(blk_get_stats(blk), &req->acct, data_size,
+                     BLOCK_ACCT_READ);
+    if (req->qsg.sg) {
+        req->aiocb = dma_blk_read(blk, &req->qsg, data_offset,
+                                  BDRV_SECTOR_SIZE, nvme_rw_cb, req);
+    } else {
+        req->aiocb = blk_aio_preadv(blk, data_offset, &req->iov, 0,
+                                    nvme_rw_cb, req);
+    }
+    return NVME_NO_COMPLETE;
+
+invalid:
+    block_acct_invalid(blk_get_stats(blk), BLOCK_ACCT_READ);
+    return status | NVME_DNR;
+}
+
 static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
@@ -1495,25 +1575,20 @@ invalid:
     return status | NVME_DNR;
 }
 
-static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req, bool append)
+static uint16_t nvme_write(NvmeCtrl *n, NvmeRequest *req, bool append)
 {
     NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
     NvmeNamespace *ns = req->ns;
-    uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
     uint64_t slba = le64_to_cpu(rw->slba);
+    uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
     uint64_t data_size = nvme_l2b(ns, nlb);
-    uint64_t data_offset, fill_ofs;
-
+    uint64_t data_offset;
     NvmeZone *zone;
-    uint32_t fill_len;
-    NvmeReadFillCtx rfc;
-    bool is_write = rw->opcode == NVME_CMD_WRITE || append;
-    enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
     BlockBackend *blk = ns->blkconf.blk;
     uint16_t status;
 
-    trace_pci_nvme_rw(nvme_cid(req), nvme_io_opc_str(rw->opcode),
-                      nvme_nsid(ns), nlb, data_size, slba);
+    trace_pci_nvme_write(nvme_cid(req), nvme_io_opc_str(rw->opcode),
+                         nvme_nsid(ns), nlb, data_size, slba);
 
     status = nvme_check_mdts(n, data_size);
     if (status) {
@@ -1530,29 +1605,21 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req, bool append)
     if (ns->params.zoned) {
         zone = nvme_get_zone_by_slba(ns, slba);
 
-        if (is_write) {
-            status = nvme_check_zone_write(n, ns, zone, slba, nlb, append);
-            if (status != NVME_SUCCESS) {
-                goto invalid;
-            }
-
-            if (append) {
-                slba = zone->w_ptr;
-            }
-
-            status = nvme_auto_open_zone(ns, zone);
-            if (status != NVME_SUCCESS) {
-                goto invalid;
-            }
-
-            req->cqe.result64 = nvme_advance_zone_wp(ns, zone, nlb);
-        } else {
-            status = nvme_check_zone_read(ns, zone, slba, nlb, &rfc);
-            if (status != NVME_SUCCESS) {
-                trace_pci_nvme_err_zone_read_not_ok(slba, nlb, status);
-                goto invalid;
-            }
+        status = nvme_check_zone_write(n, ns, zone, slba, nlb, append);
+        if (status != NVME_SUCCESS) {
+            goto invalid;
         }
+
+        status = nvme_auto_open_zone(ns, zone);
+        if (status != NVME_SUCCESS) {
+            goto invalid;
+        }
+
+        if (append) {
+            slba = zone->w_ptr;
+        }
+
+        req->cqe.result64 = nvme_advance_zone_wp(ns, zone, nlb);
     } else if (append) {
         trace_pci_nvme_err_invalid_opc(rw->opcode);
         status = NVME_INVALID_OPCODE;
@@ -1566,56 +1633,21 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req, bool append)
         goto invalid;
     }
 
-    if (ns->params.zoned) {
-        if (is_write) {
-            req->cqe.result64 = nvme_advance_zone_wp(ns, zone, nlb);
-        } else {
-            if (rfc.pre_rd_fill_nlb) {
-                fill_ofs = nvme_l2b(ns, rfc.pre_rd_fill_slba - slba);
-                fill_len = nvme_l2b(ns, rfc.pre_rd_fill_nlb);
-                nvme_fill_read_data(req, fill_ofs, fill_len,
-                                    n->params.fill_pattern);
-            }
-            if (!rfc.read_nlb) {
-                /* No backend I/O necessary, only needed to fill the buffer */
-                req->status = NVME_SUCCESS;
-                return NVME_SUCCESS;
-            }
-            if (rfc.post_rd_fill_nlb) {
-                req->fill_ofs = nvme_l2b(ns, rfc.post_rd_fill_slba - slba);
-                req->fill_len = nvme_l2b(ns, rfc.post_rd_fill_nlb);
-            } else {
-                req->fill_len = 0;
-            }
-            slba = rfc.read_slba;
-            data_size = nvme_l2b(ns, rfc.read_nlb);
-        }
-    }
-
     data_offset = nvme_l2b(ns, slba);
 
-    block_acct_start(blk_get_stats(blk), &req->acct, data_size, acct);
+    block_acct_start(blk_get_stats(blk), &req->acct, data_size,
+                     BLOCK_ACCT_WRITE);
     if (req->qsg.sg) {
-        if (is_write) {
-            req->aiocb = dma_blk_write(blk, &req->qsg, data_offset,
-                                       BDRV_SECTOR_SIZE, nvme_rw_cb, req);
-        } else {
-            req->aiocb = dma_blk_read(blk, &req->qsg, data_offset,
-                                      BDRV_SECTOR_SIZE, nvme_rw_cb, req);
-        }
+        req->aiocb = dma_blk_write(blk, &req->qsg, data_offset,
+                                   BDRV_SECTOR_SIZE, nvme_rw_cb, req);
     } else {
-        if (is_write) {
-            req->aiocb = blk_aio_pwritev(blk, data_offset, &req->iov, 0,
-                                         nvme_rw_cb, req);
-        } else {
-            req->aiocb = blk_aio_preadv(blk, data_offset, &req->iov, 0,
-                                        nvme_rw_cb, req);
-        }
+        req->aiocb = blk_aio_pwritev(blk, data_offset, &req->iov, 0,
+                                     nvme_rw_cb, req);
     }
     return NVME_NO_COMPLETE;
 
 invalid:
-    block_acct_invalid(blk_get_stats(blk), acct);
+    block_acct_invalid(blk_get_stats(blk), BLOCK_ACCT_WRITE);
     return status | NVME_DNR;
 }
 
@@ -2096,10 +2128,11 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
     case NVME_CMD_WRITE_ZEROES:
         return nvme_write_zeroes(n, req);
     case NVME_CMD_ZONE_APPEND:
-        return nvme_rw(n, req, true);
+        return nvme_write(n, req, true);
     case NVME_CMD_WRITE:
+        return nvme_write(n, req, false);
     case NVME_CMD_READ:
-        return nvme_rw(n, req, false);
+        return nvme_read(n, req);
     case NVME_CMD_ZONE_MGMT_SEND:
         return nvme_zone_mgmt_send(n, req);
     case NVME_CMD_ZONE_MGMT_RECV:
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 962084e40c..7ee90a50c3 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -40,7 +40,8 @@ pci_nvme_map_prp(uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2,
 pci_nvme_map_sgl(uint16_t cid, uint8_t typ, uint64_t len) "cid %"PRIu16" type 0x%"PRIx8" len %"PRIu64""
 pci_nvme_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode, const char *opname) "cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8" opname '%s'"
 pci_nvme_admin_cmd(uint16_t cid, uint16_t sqid, uint8_t opcode, const char *opname) "cid %"PRIu16" sqid %"PRIu16" opc 0x%"PRIx8" opname '%s'"
-pci_nvme_rw(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, uint64_t count, uint64_t lba) "cid %"PRIu16" opname '%s' nsid %"PRIu32" nlb %"PRIu32" count %"PRIu64" lba 0x%"PRIx64""
+pci_nvme_read(uint16_t cid, uint32_t nsid, uint32_t nlb, uint64_t count, uint64_t lba) "cid %"PRIu16" nsid %"PRIu32" nlb %"PRIu32" count %"PRIu64" lba 0x%"PRIx64""
+pci_nvme_write(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, uint64_t count, uint64_t lba) "cid %"PRIu16" opname '%s' nsid %"PRIu32" nlb %"PRIu32" count %"PRIu64" lba 0x%"PRIx64""
 pci_nvme_rw_cb(uint16_t cid, const char *blkname) "cid %"PRIu16" blk '%s'"
 pci_nvme_write_zeroes(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32""
 pci_nvme_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16""
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 11/11] hw/block/nvme: Merge nvme_write_zeroes() with nvme_write()
  2020-10-19  2:17 [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (9 preceding siblings ...)
  2020-10-19  2:17 ` [PATCH v7 10/11] hw/block/nvme: Separate read and write handlers Dmitry Fomichev
@ 2020-10-19  2:17 ` Dmitry Fomichev
  2020-10-20  8:29   ` Klaus Jensen
  2020-10-19  7:32 ` [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Niklas Cassel
  11 siblings, 1 reply; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-19  2:17 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen, Kevin Wolf,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Fam Zheng
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Alistair Francis, Matias Bjorling

nvme_write() now handles WRITE, WRITE ZEROES and ZONE_APPEND.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c       | 95 +++++++++++++------------------------------
 hw/block/trace-events |  1 -
 2 files changed, 28 insertions(+), 68 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 5ec4ce5e28..aa929d1edf 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1529,53 +1529,7 @@ invalid:
     return status | NVME_DNR;
 }
 
-static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
-{
-    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
-    NvmeNamespace *ns = req->ns;
-    uint64_t slba = le64_to_cpu(rw->slba);
-    uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
-    NvmeZone *zone;
-    uint64_t offset = nvme_l2b(ns, slba);
-    uint32_t count = nvme_l2b(ns, nlb);
-    BlockBackend *blk = ns->blkconf.blk;
-    uint16_t status;
-
-    trace_pci_nvme_write_zeroes(nvme_cid(req), nvme_nsid(ns), slba, nlb);
-
-    status = nvme_check_bounds(n, ns, slba, nlb);
-    if (status) {
-        trace_pci_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
-        return status;
-    }
-
-    if (ns->params.zoned) {
-        zone = nvme_get_zone_by_slba(ns, slba);
-
-        status = nvme_check_zone_write(n, ns, zone, slba, nlb, false);
-        if (status != NVME_SUCCESS) {
-            goto invalid;
-        }
-
-        status = nvme_auto_open_zone(ns, zone);
-        if (status != NVME_SUCCESS) {
-            goto invalid;
-        }
-
-        req->cqe.result64 = nvme_advance_zone_wp(ns, zone, nlb);
-    }
-
-    block_acct_start(blk_get_stats(blk), &req->acct, 0, BLOCK_ACCT_WRITE);
-    req->aiocb = blk_aio_pwrite_zeroes(blk, offset, count,
-                                       BDRV_REQ_MAY_UNMAP, nvme_rw_cb, req);
-    return NVME_NO_COMPLETE;
-
-invalid:
-    block_acct_invalid(blk_get_stats(blk), BLOCK_ACCT_WRITE);
-    return status | NVME_DNR;
-}
-
-static uint16_t nvme_write(NvmeCtrl *n, NvmeRequest *req, bool append)
+static uint16_t nvme_write(NvmeCtrl *n, NvmeRequest *req, bool append, bool wrz)
 {
     NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
     NvmeNamespace *ns = req->ns;
@@ -1590,10 +1544,12 @@ static uint16_t nvme_write(NvmeCtrl *n, NvmeRequest *req, bool append)
     trace_pci_nvme_write(nvme_cid(req), nvme_io_opc_str(rw->opcode),
                          nvme_nsid(ns), nlb, data_size, slba);
 
-    status = nvme_check_mdts(n, data_size);
-    if (status) {
-        trace_pci_nvme_err_mdts(nvme_cid(req), data_size);
-        goto invalid;
+    if (!wrz) {
+        status = nvme_check_mdts(n, data_size);
+        if (status) {
+            trace_pci_nvme_err_mdts(nvme_cid(req), data_size);
+            goto invalid;
+        }
     }
 
     status = nvme_check_bounds(n, ns, slba, nlb);
@@ -1628,21 +1584,26 @@ static uint16_t nvme_write(NvmeCtrl *n, NvmeRequest *req, bool append)
 
     data_offset = nvme_l2b(ns, slba);
 
-    status = nvme_map_dptr(n, data_size, req);
-    if (status) {
-        goto invalid;
-    }
+    if (!wrz) {
+        status = nvme_map_dptr(n, data_size, req);
+        if (status) {
+            goto invalid;
+        }
 
-    data_offset = nvme_l2b(ns, slba);
-
-    block_acct_start(blk_get_stats(blk), &req->acct, data_size,
-                     BLOCK_ACCT_WRITE);
-    if (req->qsg.sg) {
-        req->aiocb = dma_blk_write(blk, &req->qsg, data_offset,
-                                   BDRV_SECTOR_SIZE, nvme_rw_cb, req);
+        block_acct_start(blk_get_stats(blk), &req->acct, data_size,
+                         BLOCK_ACCT_WRITE);
+        if (req->qsg.sg) {
+            req->aiocb = dma_blk_write(blk, &req->qsg, data_offset,
+                                       BDRV_SECTOR_SIZE, nvme_rw_cb, req);
+        } else {
+            req->aiocb = blk_aio_pwritev(blk, data_offset, &req->iov, 0,
+                                         nvme_rw_cb, req);
+        }
     } else {
-        req->aiocb = blk_aio_pwritev(blk, data_offset, &req->iov, 0,
-                                     nvme_rw_cb, req);
+        block_acct_start(blk_get_stats(blk), &req->acct, 0, BLOCK_ACCT_WRITE);
+        req->aiocb = blk_aio_pwrite_zeroes(blk, data_offset, data_size,
+                                           BDRV_REQ_MAY_UNMAP, nvme_rw_cb,
+                                           req);
     }
     return NVME_NO_COMPLETE;
 
@@ -2126,11 +2087,11 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
     case NVME_CMD_FLUSH:
         return nvme_flush(n, req);
     case NVME_CMD_WRITE_ZEROES:
-        return nvme_write_zeroes(n, req);
+        return nvme_write(n, req, false, true);
     case NVME_CMD_ZONE_APPEND:
-        return nvme_write(n, req, true);
+        return nvme_write(n, req, true, false);
     case NVME_CMD_WRITE:
-        return nvme_write(n, req, false);
+        return nvme_write(n, req, false, false);
     case NVME_CMD_READ:
         return nvme_read(n, req);
     case NVME_CMD_ZONE_MGMT_SEND:
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 7ee90a50c3..5a3cd4c5dc 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -43,7 +43,6 @@ pci_nvme_admin_cmd(uint16_t cid, uint16_t sqid, uint8_t opcode, const char *opna
 pci_nvme_read(uint16_t cid, uint32_t nsid, uint32_t nlb, uint64_t count, uint64_t lba) "cid %"PRIu16" nsid %"PRIu32" nlb %"PRIu32" count %"PRIu64" lba 0x%"PRIx64""
 pci_nvme_write(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, uint64_t count, uint64_t lba) "cid %"PRIu16" opname '%s' nsid %"PRIu32" nlb %"PRIu32" count %"PRIu64" lba 0x%"PRIx64""
 pci_nvme_rw_cb(uint16_t cid, const char *blkname) "cid %"PRIu16" blk '%s'"
-pci_nvme_write_zeroes(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32""
 pci_nvme_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16""
 pci_nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
 pci_nvme_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-10-19  2:17 [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (10 preceding siblings ...)
  2020-10-19  2:17 ` [PATCH v7 11/11] hw/block/nvme: Merge nvme_write_zeroes() with nvme_write() Dmitry Fomichev
@ 2020-10-19  7:32 ` Niklas Cassel
  11 siblings, 0 replies; 36+ messages in thread
From: Niklas Cassel @ 2020-10-19  7:32 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Klaus Jensen,
	qemu-devel, Maxim Levitsky, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Mon, Oct 19, 2020 at 11:17:15AM +0900, Dmitry Fomichev wrote:

(snip)

> 
> Dmitry Fomichev (9):
>   hw/block/nvme: Add Commands Supported and Effects log
>   hw/block/nvme: Generate namespace UUIDs
>   hw/block/nvme: Support Zoned Namespace Command Set
>   hw/block/nvme: Introduce max active and open zone limits
>   hw/block/nvme: Support Zone Descriptor Extensions
>   hw/block/nvme: Add injection of Offline/Read-Only zones
>   hw/block/nvme: Document zoned parameters in usage text
>   hw/block/nvme: Separate read and write handlers
>   hw/block/nvme: Merge nvme_write_zeroes() with nvme_write()
> 
> Niklas Cassel (2):
>   hw/block/nvme: Add support for Namespace Types
>   hw/block/nvme: Support allocated CNS command variants
> 
>  block/nvme.c          |    2 +-
>  hw/block/nvme-ns.c    |  295 ++++++++
>  hw/block/nvme-ns.h    |  109 +++
>  hw/block/nvme.c       | 1550 ++++++++++++++++++++++++++++++++++++++---
>  hw/block/nvme.h       |    9 +
>  hw/block/trace-events |   36 +-
>  include/block/nvme.h  |  201 +++++-
>  7 files changed, 2078 insertions(+), 124 deletions(-)
> 
> -- 
> 2.21.0
> 

Thank you Dmitry, this version was easier to review.

Except for a missing
/* fall through */ comment in nvme_cmd_effects().
(in the "hw/block/nvme: Add Commands Supported and Effects log" patch.)

For the whole series:
Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set
  2020-10-19  2:17 ` [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
@ 2020-10-19  9:50   ` Klaus Jensen
  2020-10-19 15:55     ` Klaus Jensen
  2020-10-19 12:33   ` Klaus Jensen
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Klaus Jensen @ 2020-10-19  9:50 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 11263 bytes --]

On Oct 19 11:17, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Define values and structures that are needed to support Zoned
> Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.
> Define trace events where needed in newly introduced code.
> 
> In order to improve scalability, all open, closed and full zones
> are organized in separate linked lists. Consequently, almost all
> zone operations don't require scanning of the entire zone array
> (which potentially can be quite large) - it is only necessary to
> enumerate one or more zone lists.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> Subsequent commits in this series add ZDE support and checks for
> active and open zone limits.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  block/nvme.c          |   2 +-
>  hw/block/nvme-ns.c    | 193 +++++++++
>  hw/block/nvme-ns.h    |  54 +++
>  hw/block/nvme.c       | 975 ++++++++++++++++++++++++++++++++++++++++--
>  hw/block/nvme.h       |   9 +
>  hw/block/trace-events |  21 +
>  include/block/nvme.h  | 113 ++++-
>  7 files changed, 1339 insertions(+), 28 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 05485fdd11..7a513c9a17 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> +static void nvme_fill_read_data(NvmeRequest *req, uint64_t offset,
> +                                uint32_t max_len, uint8_t pattern)
> +{
> +    QEMUSGList *qsg = &req->qsg;
> +    QEMUIOVector *iov = &req->iov;
> +    ScatterGatherEntry *entry;
> +    uint32_t len, ent_len;
> +
> +    if (qsg->nsg > 0) {
> +        entry = qsg->sg;
> +        len = qsg->size;
> +        if (max_len) {
> +            len = MIN(len, max_len);
> +        }
> +        for (; len > 0; len -= ent_len) {
> +            ent_len = MIN(len, entry->len);
> +            if (offset > ent_len) {
> +                offset -= ent_len;
> +            } else if (offset != 0) {
> +                dma_memory_set(qsg->as, entry->base + offset,
> +                               pattern, ent_len - offset);
> +                offset = 0;
> +            } else {
> +                dma_memory_set(qsg->as, entry->base, pattern, ent_len);
> +            }
> +            entry++;

dma_memory_set can fail. In that case, I think just fail the command
with NVME_DATA_TRAS_ERROR.

But I think this should be removed in any case, see below.

> +static uint16_t nvme_check_zone_read(NvmeNamespace *ns, NvmeZone *zone,
> +                                     uint64_t slba, uint32_t nlb,
> +                                     NvmeReadFillCtx *rfc)
> +{
> +    NvmeZone *next_zone;
> +    uint64_t bndry = nvme_zone_rd_boundary(ns, zone);
> +    uint64_t end = slba + nlb, wp1, wp2;
> +    uint16_t status;
> +
> +    rfc->pre_rd_fill_slba = ~0ULL;
> +    rfc->pre_rd_fill_nlb = 0;
> +    rfc->read_slba = slba;
> +    rfc->read_nlb = nlb;
> +    rfc->post_rd_fill_slba = ~0ULL;
> +    rfc->post_rd_fill_nlb = 0;
> +
> +    status = nvme_zone_state_ok_to_read(zone);
> +    if (status != NVME_SUCCESS) {
> +        ;
> +    } else if (likely(end <= bndry)) {
> +        if (end > zone->w_ptr) {
> +            wp1 = zone->w_ptr;
> +            if (slba >= wp1) {
> +                /* No i/o necessary, just fill */
> +                rfc->pre_rd_fill_slba = slba;
> +                rfc->pre_rd_fill_nlb = nlb;
> +                rfc->read_nlb = 0;
> +            } else {
> +                rfc->read_nlb = wp1 - slba;
> +                rfc->post_rd_fill_slba = wp1;
> +                rfc->post_rd_fill_nlb = nlb - rfc->read_nlb;
> +           }
> +        }
> +    } else if (!ns->params.cross_zone_read) {
> +        status = NVME_ZONE_BOUNDARY_ERROR;
> +    } else {
> +        /*
> +         * Read across zone boundary, look at the next zone.
> +         * Earlier bounds checks ensure that the current zone
> +         * is not the last one.
> +         */
> +        next_zone = zone + 1;
> +        status = nvme_zone_state_ok_to_read(next_zone);
> +        if (status != NVME_SUCCESS) {
> +            ;
> +        } else if (end > nvme_zone_rd_boundary(ns, next_zone)) {
> +            /*
> +             * As zone size is much larger than a typical maximum
> +             * i/o size in real hardware, allow the i/o range
> +             * to span no more than one pair of zones.
> +             */
> +            status = NVME_ZONE_BOUNDARY_ERROR;

While assumptious, it seems totally fair. But irrelevant if removed as I
propose below.

> +        } else {
> +            wp1 = zone->w_ptr;
> +            wp2 = next_zone->w_ptr;
> +            if (wp2 == bndry) {
> +                if (slba >= wp1) {
> +                    /* Again, no i/o necessary, just fill */
> +                    rfc->pre_rd_fill_slba = slba;
> +                    rfc->pre_rd_fill_nlb = nlb;
> +                    rfc->read_nlb = 0;
> +                } else {
> +                    rfc->read_nlb = wp1 - slba;
> +                    rfc->post_rd_fill_slba = wp1;
> +                    rfc->post_rd_fill_nlb = nlb - rfc->read_nlb;
> +                }
> +            } else if (slba < wp1) {
> +                if (end > wp2) {
> +                    if (wp1 == bndry) {
> +                        rfc->post_rd_fill_slba = wp2;
> +                        rfc->post_rd_fill_nlb = end - wp2;
> +                        rfc->read_nlb = wp2 - slba;
> +                    } else {
> +                        rfc->pre_rd_fill_slba = wp2;
> +                        rfc->pre_rd_fill_nlb = end - wp2;
> +                        rfc->read_nlb = wp2 - slba;
> +                        rfc->post_rd_fill_slba = wp1;
> +                        rfc->post_rd_fill_nlb = bndry - wp1;
> +                    }
> +                } else {
> +                    rfc->post_rd_fill_slba = wp1;
> +                    rfc->post_rd_fill_nlb = bndry - wp1;
> +                }
> +            } else {
> +                if (end > wp2) {
> +                    rfc->pre_rd_fill_slba = slba;
> +                    rfc->pre_rd_fill_nlb = end - slba;
> +                    rfc->read_slba = bndry;
> +                    rfc->read_nlb = wp2 - bndry;
> +                } else {
> +                    rfc->read_slba = bndry;
> +                    rfc->read_nlb = end - bndry;
> +                    rfc->post_rd_fill_slba = slba;
> +                    rfc->post_rd_fill_nlb = bndry - slba;
> +                }
> +            }
> +        }

This seems to use the read boundary (zone size), the gap between ZCAP
and ZSZE should also be filled with the fill pattern. This brings in one
more gap which causes this code to have even more edge cases.

This feels like an ad-hoc solution and this pre/post filling strategy is
going to fall short if (when) we implement support for DSM since that
may introduce deallocated blocks all over the place. Technically, this
is not your problem, (but then it is probably going to be my headache
soon since I'd like to introduce DSM support), so I would prefer that we
find a better solution here. I think my work on DULBE and relying on the
ability of the block layer to guarantee zeroes in most cases is useful
here and a better solution than faking it like this.

I fear that adding fill_pattern support in a way that only works for
zoned namespaces is going to cause us a lot of headaches if we want to
support it (and DSM) for NVM namespaces as well.

I propose that you drop the fill_pattern feature and just rely on the
block layer for this (by possibly integrating with the DULBE support
that I posted). This would allow RAZB to still be trivially supported
(and even across more than one boundary). For reference, my ZNS proposal
supports this (but not RAZB) along with the required discards on zone
resets, though I think it is missing a check on the zone size being a
multiple of the cluster size of the underlying blockdev for zeroes to be
guaranteed for non-zero cluster sizes.

> +    }
> +
> +    return status;
> +}
> +
> +static bool nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
> +                                      bool failed)
> +{
> +    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
> +    NvmeZone *zone;
> +    uint64_t slba, start_wp = req->cqe.result64;
> +    uint32_t nlb;
> +
> +    if (rw->opcode != NVME_CMD_WRITE &&
> +        rw->opcode != NVME_CMD_ZONE_APPEND &&
> +        rw->opcode != NVME_CMD_WRITE_ZEROES) {
> +        return false;
> +    }
> +
> +    slba = le64_to_cpu(rw->slba);
> +    nlb = le16_to_cpu(rw->nlb) + 1;
> +    zone = nvme_get_zone_by_slba(ns, slba);
> +
> +    if (!failed && zone->w_ptr < start_wp + nlb) {
> +        /*
> +         * A preceding queued write to the zone has failed,
> +         * now this write is not at the WP, fail it too.
> +         */
> +        failed = true;
> +    }
> +
> +    if (failed) {
> +        if (zone->w_ptr > start_wp) {
> +            zone->w_ptr = start_wp;
> +            zone->d.wp = start_wp;
> +        }

This doesn't fix the data corruption. The example from my last review
still applies.

> +        req->cqe.result64 = 0;
> +    } else if (zone->w_ptr == nvme_zone_wr_boundary(zone)) {
> +        switch (nvme_get_zone_state(zone)) {
> +        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +        case NVME_ZONE_STATE_CLOSED:
> +        case NVME_ZONE_STATE_EMPTY:
> +            nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_FULL);
> +            /* fall through */
> +        case NVME_ZONE_STATE_FULL:
> +            break;
> +        default:
> +            assert(false);
> +        }
> +        zone->d.wp = zone->w_ptr;
> +    } else {
> +        zone->d.wp += nlb;
> +    }
> +
> +    return failed;
> +}

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 08/11] hw/block/nvme: Add injection of Offline/Read-Only zones
  2020-10-19  2:17 ` [PATCH v7 08/11] hw/block/nvme: Add injection of Offline/Read-Only zones Dmitry Fomichev
@ 2020-10-19 11:42   ` Klaus Jensen
  2020-10-20 23:01     ` Dmitry Fomichev
  0 siblings, 1 reply; 36+ messages in thread
From: Klaus Jensen @ 2020-10-19 11:42 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 1946 bytes --]

On Oct 19 11:17, Dmitry Fomichev wrote:
> ZNS specification defines two zone conditions for the zones that no
> longer can function properly, possibly because of flash wear or other
> internal fault. It is useful to be able to "inject" a small number of
> such zones for testing purposes.
> 
> This commit defines two optional device properties, "offline_zones"
> and "rdonly_zones". Users can assign non-zero values to these variables
> to specify the number of zones to be initialized as Offline or
> Read-Only. The actual number of injected zones may be smaller than the
> requested amount - Read-Only and Offline counts are expected to be much
> smaller than the total number of zones on a drive.
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme-ns.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++
>  hw/block/nvme-ns.h |  2 ++
>  2 files changed, 66 insertions(+)
> 
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index 255ded2b43..d050f97909 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -21,6 +21,7 @@
>  #include "sysemu/sysemu.h"
>  #include "sysemu/block-backend.h"
>  #include "qapi/error.h"
> +#include "crypto/random.h"
>  
>  #include "hw/qdev-properties.h"
>  #include "hw/qdev-core.h"
> @@ -132,6 +133,32 @@ static int nvme_calc_zone_geometry(NvmeNamespace *ns, Error **errp)
>          return -1;
>      }
>  
> +    if (ns->params.zd_extension_size) {
> +        if (ns->params.zd_extension_size & 0x3f) {
> +            error_setg(errp,
> +                "zone descriptor extension size must be a multiple of 64B");
> +            return -1;
> +        }
> +        if ((ns->params.zd_extension_size >> 6) > 0xff) {
> +            error_setg(errp, "zone descriptor extension size is too large");
> +            return -1;
> +        }
> +    }

Looks like this should have been added in the previous patch.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set
  2020-10-19  2:17 ` [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
  2020-10-19  9:50   ` Klaus Jensen
@ 2020-10-19 12:33   ` Klaus Jensen
  2020-10-20 11:08   ` Klaus Jensen
  2020-10-21 10:26   ` Klaus Jensen
  3 siblings, 0 replies; 36+ messages in thread
From: Klaus Jensen @ 2020-10-19 12:33 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 981 bytes --]

On Oct 19 11:17, Dmitry Fomichev wrote:
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index d6b2808b97..170cbb8cdc 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -34,6 +45,18 @@ typedef struct NvmeNamespace {
>      const uint32_t *iocs;
>      uint8_t      csi;
>  
> +    NvmeIdNsZoned   *id_ns_zoned;
> +    NvmeZone        *zone_array;
> +    QTAILQ_HEAD(, NvmeZone) exp_open_zones;
> +    QTAILQ_HEAD(, NvmeZone) imp_open_zones;
> +    QTAILQ_HEAD(, NvmeZone) closed_zones;
> +    QTAILQ_HEAD(, NvmeZone) full_zones;

Apart from the imp_open_zones list that is being used in a later patch
to support Implicitly Opened to Closed transitions, these lists seem
rather pointless. As far as I can tell the only use they have is being
inserted into, removed from and checking if a zone is in one of those
four states?

The Zone Management Receive (and Send with Select All) is just iterating
on all zones and matching on state.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set
  2020-10-19  9:50   ` Klaus Jensen
@ 2020-10-19 15:55     ` Klaus Jensen
  0 siblings, 0 replies; 36+ messages in thread
From: Klaus Jensen @ 2020-10-19 15:55 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Alistair Francis, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 1344 bytes --]

On Oct 19 11:50, Klaus Jensen wrote:
> On Oct 19 11:17, Dmitry Fomichev wrote:
> > +static bool nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
> > +                                      bool failed)
> > +{
> > +    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
> > +    NvmeZone *zone;
> > +    uint64_t slba, start_wp = req->cqe.result64;
> > +    uint32_t nlb;
> > +
> > +    if (rw->opcode != NVME_CMD_WRITE &&
> > +        rw->opcode != NVME_CMD_ZONE_APPEND &&
> > +        rw->opcode != NVME_CMD_WRITE_ZEROES) {
> > +        return false;
> > +    }
> > +
> > +    slba = le64_to_cpu(rw->slba);
> > +    nlb = le16_to_cpu(rw->nlb) + 1;
> > +    zone = nvme_get_zone_by_slba(ns, slba);
> > +
> > +    if (!failed && zone->w_ptr < start_wp + nlb) {
> > +        /*
> > +         * A preceding queued write to the zone has failed,
> > +         * now this write is not at the WP, fail it too.
> > +         */
> > +        failed = true;
> > +    }
> > +
> > +    if (failed) {
> > +        if (zone->w_ptr > start_wp) {
> > +            zone->w_ptr = start_wp;
> > +            zone->d.wp = start_wp;
> > +        }
> 
> This doesn't fix the data corruption. The example from my last review
> still applies.
> 

An easy fix is to just unconditionally advance the write pointer in all
cases.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 01/11] hw/block/nvme: Add Commands Supported and Effects log
  2020-10-19  2:17 ` [PATCH v7 01/11] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
@ 2020-10-19 19:22   ` Keith Busch
  2020-10-19 20:16   ` Klaus Jensen
  1 sibling, 0 replies; 36+ messages in thread
From: Keith Busch @ 2020-10-19 19:22 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Fam Zheng, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Mon, Oct 19, 2020 at 11:17:16AM +0900, Dmitry Fomichev wrote:
> This log page becomes necessary to implement to allow checking for
> Zone Append command support in Zoned Namespace Command Set.
> 
> This commit adds the code to report this log page for NVM Command
> Set only. The parts that are specific to zoned operation will be
> added later in the series.
> 
> All incoming admin and i/o commands are now only processed if their
> corresponding support bits are set in this log. This provides an
> easy way to control what commands to support and what not to
> depending on set CC.CSS.
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

Looks good to me.

Reviewed-by: Keith Busch <kbusch@kernel.org>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 02/11] hw/block/nvme: Generate namespace UUIDs
  2020-10-19  2:17 ` [PATCH v7 02/11] hw/block/nvme: Generate namespace UUIDs Dmitry Fomichev
@ 2020-10-19 19:24   ` Keith Busch
  2020-10-19 19:30   ` Klaus Jensen
  1 sibling, 0 replies; 36+ messages in thread
From: Keith Busch @ 2020-10-19 19:24 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Fam Zheng, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Mon, Oct 19, 2020 at 11:17:17AM +0900, Dmitry Fomichev wrote:
> In NVMe 1.4, a namespace must report an ID descriptor of UUID type
> if it doesn't support EUI64 or NGUID. Add a new namespace property,
> "uuid", that provides the user the option to either specify the UUID
> explicitly or have a UUID generated automatically every time a
> namespace is initialized.
> 
> Suggested-by: Klaus Jansen <its@irrelevant.dk>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> Reviewed-by: Klaus Jansen <its@irrelevant.dk>

Looks good to me.

Reviewed-by: Keith Busch <kbusch@kernel.org>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 02/11] hw/block/nvme: Generate namespace UUIDs
  2020-10-19  2:17 ` [PATCH v7 02/11] hw/block/nvme: Generate namespace UUIDs Dmitry Fomichev
  2020-10-19 19:24   ` Keith Busch
@ 2020-10-19 19:30   ` Klaus Jensen
  1 sibling, 0 replies; 36+ messages in thread
From: Klaus Jensen @ 2020-10-19 19:30 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 428 bytes --]

On Oct 19 11:17, Dmitry Fomichev wrote:
> In NVMe 1.4, a namespace must report an ID descriptor of UUID type
> if it doesn't support EUI64 or NGUID. Add a new namespace property,
> "uuid", that provides the user the option to either specify the UUID
> explicitly or have a UUID generated automatically every time a
> namespace is initialized.
> 
> Suggested-by: Klaus Jansen <its@irrelevant.dk>

s/Jansen/Jensen

:)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 03/11] hw/block/nvme: Add support for Namespace Types
  2020-10-19  2:17 ` [PATCH v7 03/11] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
@ 2020-10-19 19:51   ` Keith Busch
  2020-10-19 20:53   ` Klaus Jensen
  1 sibling, 0 replies; 36+ messages in thread
From: Keith Busch @ 2020-10-19 19:51 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Fam Zheng, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Mon, Oct 19, 2020 at 11:17:18AM +0900, Dmitry Fomichev wrote:
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
...
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);

You've got duplicate sizeof checks for the NvmeIdNsDescr.

Otherwise, the patch looks fine.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 04/11] hw/block/nvme: Support allocated CNS command variants
  2020-10-19  2:17 ` [PATCH v7 04/11] hw/block/nvme: Support allocated CNS command variants Dmitry Fomichev
@ 2020-10-19 20:07   ` Keith Busch
  2020-10-20  8:21   ` Klaus Jensen
  1 sibling, 0 replies; 36+ messages in thread
From: Keith Busch @ 2020-10-19 20:07 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Fam Zheng, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Mon, Oct 19, 2020 at 11:17:19AM +0900, Dmitry Fomichev wrote:
> Add a new Boolean namespace property, "attached", to provide the most
> basic namespace attachment support. The default value for this new
> property is true. Also, implement the logic in the new CNS values to
> include/exclude namespaces based on this new property. The only thing
> missing is hooking up the actual Namespace Attachment command opcode,
> which will allow a user to toggle the "attached" flag per namespace.
> 
> The reason for not hooking up this command completely is because the
> NVMe specification requires the namespace management command to be
> supported if the namespace attachment command is supported.

Huh, the spec does require that, and that seems like an odd requirement
since it prevents dynamic namespace attach states in a static namespace
setup. I'm not sure why the spec assumes those two things go together,
but it sure enough does!

The implementation looks fine.

Reviewed-by: Keith Busch <kbusch@kernel.org>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 01/11] hw/block/nvme: Add Commands Supported and Effects log
  2020-10-19  2:17 ` [PATCH v7 01/11] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
  2020-10-19 19:22   ` Keith Busch
@ 2020-10-19 20:16   ` Klaus Jensen
  2020-10-20 23:04     ` Dmitry Fomichev
  1 sibling, 1 reply; 36+ messages in thread
From: Klaus Jensen @ 2020-10-19 20:16 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 11884 bytes --]

On Oct 19 11:17, Dmitry Fomichev wrote:
> This log page becomes necessary to implement to allow checking for
> Zone Append command support in Zoned Namespace Command Set.
> 
> This commit adds the code to report this log page for NVM Command
> Set only. The parts that are specific to zoned operation will be
> added later in the series.
> 
> All incoming admin and i/o commands are now only processed if their
> corresponding support bits are set in this log. This provides an
> easy way to control what commands to support and what not to
> depending on set CC.CSS.
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme-ns.h    |  1 +
>  hw/block/nvme.c       | 98 +++++++++++++++++++++++++++++++++++++++----
>  hw/block/trace-events |  2 +
>  include/block/nvme.h  | 19 +++++++++
>  4 files changed, 111 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index 83734f4606..ea8c2f785d 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -29,6 +29,7 @@ typedef struct NvmeNamespace {
>      int32_t      bootindex;
>      int64_t      size;
>      NvmeIdNs     id_ns;
> +    const uint32_t *iocs;
>  
>      NvmeNamespaceParams params;
>  } NvmeNamespace;
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 9d30ca69dc..5a9493d89f 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -111,6 +111,28 @@ static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
>      [NVME_TIMESTAMP]                = NVME_FEAT_CAP_CHANGE,
>  };
>  
> +static const uint32_t nvme_cse_acs[256] = {
> +    [NVME_ADM_CMD_DELETE_SQ]        = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_CREATE_SQ]        = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_DELETE_CQ]        = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_CREATE_CQ]        = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_IDENTIFY]         = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_SET_FEATURES]     = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_GET_FEATURES]     = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_GET_LOG_PAGE]     = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_ASYNC_EV_REQ]     = NVME_CMD_EFF_CSUPP,
> +};

NVME_ADM_CMD_ABORT is missing. And since you added a (redundant) check
in nvme_admin_cmd that cheks this table, Abort is now an invalid
command.

Also, can you reorder it according to opcode instead of
pseudo-lexicographically?

> +
> +static const uint32_t nvme_cse_iocs_none[256] = {
> +};

[-pedantic] no need for the '= {}'

> +
> +static const uint32_t nvme_cse_iocs_nvm[256] = {
> +    [NVME_CMD_FLUSH]                = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
> +    [NVME_CMD_WRITE_ZEROES]         = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
> +    [NVME_CMD_WRITE]                = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
> +    [NVME_CMD_READ]                 = NVME_CMD_EFF_CSUPP,
> +};
> +
>  static void nvme_process_sq(void *opaque);
>  
>  static uint16_t nvme_cid(NvmeRequest *req)
> @@ -1032,10 +1054,6 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
>      trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req),
>                            req->cmd.opcode, nvme_io_opc_str(req->cmd.opcode));
>  
> -    if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_ADMIN_ONLY) {
> -        return NVME_INVALID_OPCODE | NVME_DNR;
> -    }
> -

I would assume the device to respond with invalid opcode before
validating the nsid if it is an admin only device.

>      if (!nvme_nsid_valid(n, nsid)) {
>          return NVME_INVALID_NSID | NVME_DNR;
>      }
> @@ -1045,6 +1063,11 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> +    if (!(req->ns->iocs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
> +        trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
> +        return NVME_INVALID_OPCODE | NVME_DNR;
> +    }
> +
>      switch (req->cmd.opcode) {
>      case NVME_CMD_FLUSH:
>          return nvme_flush(n, req);
> @@ -1054,8 +1077,7 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
>      case NVME_CMD_READ:
>          return nvme_rw(n, req);
>      default:
> -        trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
> -        return NVME_INVALID_OPCODE | NVME_DNR;
> +        assert(false);
>      }
>  }
>  
> @@ -1291,6 +1313,39 @@ static uint16_t nvme_error_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
>                      DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> +static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
> +                                 uint64_t off, NvmeRequest *req)
> +{
> +    NvmeEffectsLog log = {};

[-pedantic] and empty initializer list is not allowed, should be '{0}'.

> +    const uint32_t *src_iocs = NULL;
> +    uint32_t trans_len;
> +
> +    trace_pci_nvme_cmd_supp_and_effects_log_read();

This has just been traced in nvme_admin_cmd and this doesn't add any
additional info.

> +
> +    if (off >= sizeof(log)) {
> +        trace_pci_nvme_err_invalid_effects_log_offset(off);

Can we do `trace_pci_nvme_err_invalid_log_page_offset(off) instead? Then
we can easily reuse it in the other log pages.

> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    switch (NVME_CC_CSS(n->bar.cc)) {
> +    case NVME_CC_CSS_NVM:
> +        src_iocs = nvme_cse_iocs_nvm;
> +    case NVME_CC_CSS_ADMIN_ONLY:
> +        break;
> +    }
> +
> +    memcpy(log.acs, nvme_cse_acs, sizeof(nvme_cse_acs));
> +
> +    if (src_iocs) {
> +        memcpy(log.iocs, src_iocs, sizeof(log.iocs));
> +    }
> +
> +    trans_len = MIN(sizeof(log) - off, buf_len);
> +
> +    return nvme_dma(n, ((uint8_t *)&log) + off, trans_len,
> +                    DMA_DIRECTION_FROM_DEVICE, req);
> +}
> +
>  static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeCmd *cmd = &req->cmd;
> @@ -1334,6 +1389,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
>          return nvme_smart_info(n, rae, len, off, req);
>      case NVME_LOG_FW_SLOT_INFO:
>          return nvme_fw_log_info(n, len, off, req);
> +    case NVME_LOG_CMD_EFFECTS:
> +        return nvme_cmd_effects(n, len, off, req);
>      default:
>          trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1920,6 +1977,11 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
>      trace_pci_nvme_admin_cmd(nvme_cid(req), nvme_sqid(req), req->cmd.opcode,
>                               nvme_adm_opc_str(req->cmd.opcode));
>  
> +    if (!(nvme_cse_acs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
> +        trace_pci_nvme_err_invalid_admin_opc(req->cmd.opcode);
> +        return NVME_INVALID_OPCODE | NVME_DNR;
> +    }
> +

This is the (redundant) check that effectively makes Abort an invalid
command.

>      switch (req->cmd.opcode) {
>      case NVME_ADM_CMD_DELETE_SQ:
>          return nvme_del_sq(n, req);
> @@ -1942,8 +2004,7 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
>      case NVME_ADM_CMD_ASYNC_EV_REQ:
>          return nvme_aer(n, req);
>      default:
> -        trace_pci_nvme_err_invalid_admin_opc(req->cmd.opcode);
> -        return NVME_INVALID_OPCODE | NVME_DNR;
> +        assert(false);
>      }
>  }
>  
> @@ -2031,6 +2092,23 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
>      n->bar.cc = 0;
>  }
>  
> +static void nvme_select_ns_iocs(NvmeCtrl *n)
> +{
> +    NvmeNamespace *ns;
> +    int i;
> +
> +    for (i = 1; i <= n->num_namespaces; i++) {
> +        ns = nvme_ns(n, i);
> +        if (!ns) {
> +            continue;
> +        }
> +        ns->iocs = nvme_cse_iocs_none;
> +        if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) {
> +            ns->iocs = nvme_cse_iocs_nvm;
> +        }
> +    }
> +}
> +
>  static int nvme_start_ctrl(NvmeCtrl *n)
>  {
>      uint32_t page_bits = NVME_CC_MPS(n->bar.cc) + 12;
> @@ -2129,6 +2207,8 @@ static int nvme_start_ctrl(NvmeCtrl *n)
>  
>      QTAILQ_INIT(&n->aer_queue);
>  
> +    nvme_select_ns_iocs(n);
> +
>      return 0;
>  }
>  
> @@ -2737,7 +2817,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>      id->acl = 3;
>      id->aerl = n->params.aerl;
>      id->frmw = (NVME_NUM_FW_SLOTS << 1) | NVME_FRMW_SLOT1_RO;
> -    id->lpa = NVME_LPA_NS_SMART | NVME_LPA_EXTENDED;
> +    id->lpa = NVME_LPA_NS_SMART | NVME_LPA_CSE | NVME_LPA_EXTENDED;
>  
>      /* recommended default value (~70 C) */
>      id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index fac5995d94..0ae9cb0d35 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -85,6 +85,7 @@ pci_nvme_mmio_start_success(void) "setting controller enable bit succeeded"
>  pci_nvme_mmio_stopped(void) "cleared controller enable bit"
>  pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
>  pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
> +pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
>  
>  # nvme traces for error conditions
>  pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
> @@ -104,6 +105,7 @@ pci_nvme_err_invalid_prp(void) "invalid PRP"
>  pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
>  pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
>  pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
> +pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
>  pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
>  pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
>  pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 6de2d5aa75..4779495b7d 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -744,10 +744,27 @@ enum NvmeSmartWarn {
>      NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,
>  };
>  
> +typedef struct NvmeEffectsLog {
> +    uint32_t    acs[256];
> +    uint32_t    iocs[256];
> +    uint8_t     resv[2048];
> +} NvmeEffectsLog;
> +
> +enum {
> +    NVME_CMD_EFF_CSUPP      = 1 << 0,
> +    NVME_CMD_EFF_LBCC       = 1 << 1,
> +    NVME_CMD_EFF_NCC        = 1 << 2,
> +    NVME_CMD_EFF_NIC        = 1 << 3,
> +    NVME_CMD_EFF_CCC        = 1 << 4,
> +    NVME_CMD_EFF_CSE_MASK   = 3 << 16,
> +    NVME_CMD_EFF_UUID_SEL   = 1 << 19,
> +};
> +
>  enum NvmeLogIdentifier {
>      NVME_LOG_ERROR_INFO     = 0x01,
>      NVME_LOG_SMART_INFO     = 0x02,
>      NVME_LOG_FW_SLOT_INFO   = 0x03,
> +    NVME_LOG_CMD_EFFECTS    = 0x05,
>  };
>  
>  typedef struct QEMU_PACKED NvmePSD {
> @@ -860,6 +877,7 @@ enum NvmeIdCtrlFrmw {
>  
>  enum NvmeIdCtrlLpa {
>      NVME_LPA_NS_SMART = 1 << 0,
> +    NVME_LPA_CSE      = 1 << 1,
>      NVME_LPA_EXTENDED = 1 << 2,
>  };
>  
> @@ -1059,6 +1077,7 @@ static inline void _nvme_check_size(void)
>      QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
> -- 
> 2.21.0
> 
> 

-- 
One of us - No more doubt, silence or taboo about mental illness.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 03/11] hw/block/nvme: Add support for Namespace Types
  2020-10-19  2:17 ` [PATCH v7 03/11] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
  2020-10-19 19:51   ` Keith Busch
@ 2020-10-19 20:53   ` Klaus Jensen
  2020-10-21  1:50     ` Dmitry Fomichev
  1 sibling, 1 reply; 36+ messages in thread
From: Klaus Jensen @ 2020-10-19 20:53 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 9879 bytes --]

On Oct 19 11:17, Dmitry Fomichev wrote:
> From: Niklas Cassel <niklas.cassel@wdc.com>
> 
> Define the structures and constants required to implement
> Namespace Types support.
> 
> Namespace Types introduce a new command set, "I/O Command Sets",
> that allows the host to retrieve the command sets associated with
> a namespace. Introduce support for the command set and enable
> detection for the NVM Command Set.
> 
> The new workflows for identify commands rely heavily on zero-filled
> identify structs. E.g., certain CNS commands are defined to return
> a zero-filled identify struct when an inactive namespace NSID
> is supplied.
> 
> Add a helper function in order to avoid code duplication when
> reporting zero-filled identify structures.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme-ns.c    |   2 +
>  hw/block/nvme-ns.h    |   1 +
>  hw/block/nvme.c       | 169 +++++++++++++++++++++++++++++++++++-------
>  hw/block/trace-events |   7 ++
>  include/block/nvme.h  |  65 ++++++++++++----
>  5 files changed, 202 insertions(+), 42 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index de735eb9f3..c0362426cc 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -41,6 +41,8 @@ static void nvme_ns_init(NvmeNamespace *ns)
>  
>      id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(ns));
>  
> +    ns->csi = NVME_CSI_NVM;
> +
>      /* no thin provisioning */
>      id_ns->ncap = id_ns->nsze;
>      id_ns->nuse = id_ns->ncap;
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index a38071884a..d795e44bab 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -31,6 +31,7 @@ typedef struct NvmeNamespace {
>      int64_t      size;
>      NvmeIdNs     id_ns;
>      const uint32_t *iocs;
> +    uint8_t      csi;
>  
>      NvmeNamespaceParams params;
>  } NvmeNamespace;
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 29139d8a17..ca0d0abf5c 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -1503,6 +1503,13 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
>      return NVME_SUCCESS;
>  }
>  
> +static uint16_t nvme_rpt_empty_id_struct(NvmeCtrl *n, NvmeRequest *req)
> +{
> +    uint8_t id[NVME_IDENTIFY_DATA_SIZE] = {};

[-pedantic] empty initializer list

> +
> +    return nvme_dma(n, id, sizeof(id), DMA_DIRECTION_FROM_DEVICE, req);
> +}
> +
>  static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
>  {
>      trace_pci_nvme_identify_ctrl();
> @@ -1511,11 +1518,23 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
>                      DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> +static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
> +{
> +    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> +
> +    trace_pci_nvme_identify_ctrl_csi(c->csi);
> +
> +    if (c->csi == NVME_CSI_NVM) {
> +        return nvme_rpt_empty_id_struct(n, req);
> +    }
> +
> +    return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
>  static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeNamespace *ns;
>      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> -    NvmeIdNs *id_ns, inactive = { 0 };
>      uint32_t nsid = le32_to_cpu(c->nsid);
>  
>      trace_pci_nvme_identify_ns(nsid);
> @@ -1526,23 +1545,46 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
>  
>      ns = nvme_ns(n, nsid);
>      if (unlikely(!ns)) {
> -        id_ns = &inactive;
> -    } else {
> -        id_ns = &ns->id_ns;
> +        return nvme_rpt_empty_id_struct(n, req);
>      }
>  
> -    return nvme_dma(n, (uint8_t *)id_ns, sizeof(NvmeIdNs),
> +    return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
>                      DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> +static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
> +{
> +    NvmeNamespace *ns;
> +    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> +    uint32_t nsid = le32_to_cpu(c->nsid);
> +
> +    trace_pci_nvme_identify_ns_csi(nsid, c->csi);
> +
> +    if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
> +        return NVME_INVALID_NSID | NVME_DNR;
> +    }
> +
> +    ns = nvme_ns(n, nsid);
> +    if (unlikely(!ns)) {
> +        return nvme_rpt_empty_id_struct(n, req);
> +    }
> +
> +    if (c->csi == NVME_CSI_NVM) {
> +        return nvme_rpt_empty_id_struct(n, req);
> +    }
> +
> +    return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
>  static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
>  {
> +    NvmeNamespace *ns;
>      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> -    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
>      uint32_t min_nsid = le32_to_cpu(c->nsid);
> -    uint32_t *list;
> -    uint16_t ret;
> -    int j = 0;
> +    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};

[-pedantic] empty initializer list

> +    static const int data_len = sizeof(list);
> +    uint32_t *list_ptr = (uint32_t *)list;
> +    int i, j = 0;
>  
>      trace_pci_nvme_identify_nslist(min_nsid);
>  
> @@ -1556,20 +1598,54 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
>          return NVME_INVALID_NSID | NVME_DNR;
>      }
>  
> -    list = g_malloc0(data_len);
> -    for (int i = 1; i <= n->num_namespaces; i++) {
> -        if (i <= min_nsid || !nvme_ns(n, i)) {
> +    for (i = 1; i <= n->num_namespaces; i++) {
> +        ns = nvme_ns(n, i);
> +        if (!ns) {
>              continue;
>          }
> -        list[j++] = cpu_to_le32(i);
> +        if (ns->params.nsid < min_nsid) {

Since i == ns->params.nsid, this should be '<=' like the code you
removed. It really shouldn't be called min_nsid, but oh well.

> +            continue;
> +        }
> +        list_ptr[j++] = cpu_to_le32(ns->params.nsid);
>          if (j == data_len / sizeof(uint32_t)) {
>              break;
>          }
>      }
> -    ret = nvme_dma(n, (uint8_t *)list, data_len, DMA_DIRECTION_FROM_DEVICE,
> -                   req);
> -    g_free(list);
> -    return ret;
> +
> +    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
> +}
> +
> +static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
> +{
> +    NvmeNamespace *ns;
> +    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> +    uint32_t min_nsid = le32_to_cpu(c->nsid);
> +    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
> +    static const int data_len = sizeof(list);
> +    uint32_t *list_ptr = (uint32_t *)list;
> +    int i, j = 0;
> +
> +    trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
> +
> +    if (c->csi != NVME_CSI_NVM) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +

This is missing the check for 0xffffffff and 0xfffffffe like above.

> +    for (i = 1; i <= n->num_namespaces; i++) {
> +        ns = nvme_ns(n, i);
> +        if (!ns) {
> +            continue;
> +        }
> +        if (ns->params.nsid < min_nsid) {

Should be '<='.

> +            continue;
> +        }
> +        list_ptr[j++] = cpu_to_le32(ns->params.nsid);
> +        if (j == data_len / sizeof(uint32_t)) {
> +            break;
> +        }
> +    }
> +
> +    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
>  static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
> @@ -1577,13 +1653,17 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
>      NvmeNamespace *ns;
>      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
>      uint32_t nsid = le32_to_cpu(c->nsid);
> -    uint8_t list[NVME_IDENTIFY_DATA_SIZE];
> +    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};

[-pedantic] empty initializer list

>  
>      struct data {
>          struct {
>              NvmeIdNsDescr hdr;
> -            uint8_t v[16];
> +            uint8_t v[NVME_NIDL_UUID];
>          } uuid;
> +        struct {
> +            NvmeIdNsDescr hdr;
> +            uint8_t v;
> +        } csi;
>      };
>  
>      struct data *ns_descrs = (struct data *)list;
> @@ -1599,19 +1679,31 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> -    memset(list, 0x0, sizeof(list));
> -
>      /*
>       * Because the NGUID and EUI64 fields are 0 in the Identify Namespace data
>       * structure, a Namespace UUID (nidt = 0x3) must be reported in the
>       * Namespace Identification Descriptor. Add the namespace UUID here.
>       */
>      ns_descrs->uuid.hdr.nidt = NVME_NIDT_UUID;
> -    ns_descrs->uuid.hdr.nidl = NVME_NIDT_UUID_LEN;
> -    memcpy(&ns_descrs->uuid.v, ns->params.uuid.data, NVME_NIDT_UUID_LEN);
> +    ns_descrs->uuid.hdr.nidl = NVME_NIDL_UUID;
> +    memcpy(&ns_descrs->uuid.v, ns->params.uuid.data, NVME_NIDL_UUID);
>  
> -    return nvme_dma(n, list, NVME_IDENTIFY_DATA_SIZE,
> -                    DMA_DIRECTION_FROM_DEVICE, req);
> +    ns_descrs->csi.hdr.nidt = NVME_NIDT_CSI;
> +    ns_descrs->csi.hdr.nidl = NVME_NIDL_CSI;
> +    ns_descrs->csi.v = ns->csi;
> +
> +    return nvme_dma(n, list, sizeof(list), DMA_DIRECTION_FROM_DEVICE, req);
> +}
> +
> +static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeRequest *req)
> +{
> +    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};

[-pedantic] empty initializer list

> +    static const int data_len = sizeof(list);
> +
> +    trace_pci_nvme_identify_cmd_set();
> +
> +    NVME_SET_CSI(*list, NVME_CSI_NVM);
> +    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  

-- 
One of us - No more doubt, silence or taboo about mental illness.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 04/11] hw/block/nvme: Support allocated CNS command variants
  2020-10-19  2:17 ` [PATCH v7 04/11] hw/block/nvme: Support allocated CNS command variants Dmitry Fomichev
  2020-10-19 20:07   ` Keith Busch
@ 2020-10-20  8:21   ` Klaus Jensen
  2020-10-20 23:09     ` Dmitry Fomichev
  1 sibling, 1 reply; 36+ messages in thread
From: Klaus Jensen @ 2020-10-20  8:21 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 2301 bytes --]

On Oct 19 11:17, Dmitry Fomichev wrote:

(snip)

> CAP.CSS (together with the I/O Command Set data structure) defines
> what command sets are supported by the controller.
> 
> CC.CSS (together with Set Profile) can be set to enable a subset of
> the available command sets.
> 
> Even if a user configures CC.CSS to e.g. Admin only, NVM namespaces
> will still be attached (and thus marked as active).
> Similarly, if a user configures CC.CSS to e.g. NVM, ZNS namespaces
> will still be attached (and thus marked as active).
> 
> However, any operation from a disabled command set will result in a
> Invalid Command Opcode.
> 

This part of the commit message seems irrelevant to the patch.

> Add a new Boolean namespace property, "attached", to provide the most
> basic namespace attachment support. The default value for this new
> property is true. Also, implement the logic in the new CNS values to
> include/exclude namespaces based on this new property. The only thing
> missing is hooking up the actual Namespace Attachment command opcode,
> which will allow a user to toggle the "attached" flag per namespace.
> 

Without Namespace Attachment support, the sole purpose of this parameter
is to allow unusable namespace IDs to be reported. I have no problems
with adding support for the additional CNS values. They will return
identical responses, but I think that is good enough for now.

When it is not really needed, we should be wary of adding a parameter
that is really hard to get rid of again.

> The reason for not hooking up this command completely is because the
> NVMe specification requires the namespace management command to be
> supported if the namespace attachment command is supported.
> 

There are many ways to support Namespace Management, and there are a lot
of quirks with each of them. Do we use a big blockdev and carve out
namespaces? Then, what are the semantics of an image resize operation?

Do we dynamically create blockdev devices - thats sounds pretty nice,
but might have other quirks and the attachment is not really persistent.

I think at least the "attached" parameter should be x-prefixed, but
better, leave it out for now until we know how we want Namespace
Attachment and Management to be implemented.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 10/11] hw/block/nvme: Separate read and write handlers
  2020-10-19  2:17 ` [PATCH v7 10/11] hw/block/nvme: Separate read and write handlers Dmitry Fomichev
@ 2020-10-20  8:28   ` Klaus Jensen
  2020-10-20 12:36     ` Keith Busch
  0 siblings, 1 reply; 36+ messages in thread
From: Klaus Jensen @ 2020-10-20  8:28 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 560 bytes --]

On Oct 19 11:17, Dmitry Fomichev wrote:
> With ZNS support in place, the majority of code in nvme_rw() has
> become read- or write-specific. Move these parts to two separate
> handlers, nvme_read() and nvme_write() to make the code more
> readable and to remove multiple is_write checks that so far existed
> in the i/o path.
> 
> This is a refactoring patch, no change in functionality.
> 

This makes a lot of sense, totally Acked, but it might be better to move
it ahead as a preparation patch? It would make the zoned patch easier on
the eye.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 11/11] hw/block/nvme: Merge nvme_write_zeroes() with nvme_write()
  2020-10-19  2:17 ` [PATCH v7 11/11] hw/block/nvme: Merge nvme_write_zeroes() with nvme_write() Dmitry Fomichev
@ 2020-10-20  8:29   ` Klaus Jensen
  0 siblings, 0 replies; 36+ messages in thread
From: Klaus Jensen @ 2020-10-20  8:29 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 191 bytes --]

On Oct 19 11:17, Dmitry Fomichev wrote:
> nvme_write() now handles WRITE, WRITE ZEROES and ZONE_APPEND.
> 

Same here, Acked, but maybe move it in front as a preparation patch as
well?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set
  2020-10-19  2:17 ` [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
  2020-10-19  9:50   ` Klaus Jensen
  2020-10-19 12:33   ` Klaus Jensen
@ 2020-10-20 11:08   ` Klaus Jensen
  2020-10-21 10:26   ` Klaus Jensen
  3 siblings, 0 replies; 36+ messages in thread
From: Klaus Jensen @ 2020-10-20 11:08 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 2906 bytes --]

On Oct 19 11:17, Dmitry Fomichev wrote:
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index 974aea33f7..fedfad595c 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -133,6 +320,12 @@ static Property nvme_ns_props[] = {
>      DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
>      DEFINE_PROP_UUID("uuid", NvmeNamespace, params.uuid),
>      DEFINE_PROP_BOOL("attached", NvmeNamespace, params.attached, true),
> +    DEFINE_PROP_BOOL("zoned", NvmeNamespace, params.zoned, false),

Instead of using a 'zoned' property here, can we add an 'iocs' or 'csi'
property in the namespace types patch? Then, in the future if we add
additional command sets we won't need another property (like 'kv').

> +    DEFINE_PROP_SIZE("zone_size", NvmeNamespace, params.zone_size_bs,
> +                     NVME_DEFAULT_ZONE_SIZE),
> +    DEFINE_PROP_SIZE("zone_capacity", NvmeNamespace, params.zone_cap_bs, 0),

I would like that the zone_size and zone_capacity were named zoned.zsze
and zoned.zcap and were in terms of logical blocks, like in the spec.
Putting them in a pseudo-namespace makes it clear that the options
affect the zoned command set and reduces the risk of anything clashing
with the addition of other command sets (like 'kv') in the future.

> +    DEFINE_PROP_BOOL("cross_zone_read", NvmeNamespace,
> +                     params.cross_zone_read, false),

Instead of cluttering the parameters with a bunch of these when others
zone operational characteristics are added, can we use a 'zoned.zoc'
parameter that matches the spec?

> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 93728e51b3..34d0d0250d 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -3079,6 +4001,9 @@ static Property nvme_props[] = {
>      DEFINE_PROP_UINT32("aer_max_queued", NvmeCtrl, params.aer_max_queued, 64),
>      DEFINE_PROP_UINT8("mdts", NvmeCtrl, params.mdts, 7),
>      DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false),
> +    DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
> +    DEFINE_PROP_SIZE32("zone_append_size_limit", NvmeCtrl, params.zasl_bs,
> +                       NVME_DEFAULT_MAX_ZA_SIZE),

Similar to my reasoning above, I would like this to be zoned.zasl and in
terms of logical blocks like the spec. Also, I think '0' is a better
default since zero values typically identify a default value in the spec
as well.

I know this might sound like bikeshedding, but I wanna make sure that we
get the parameters right since we cannot get rid of them once they are
there. Following the definitions of the spec makes it very clear what
their meaning are and should be. 'mdts' is currently the only other
parameter like this, but that is also specified as in the spec, and not
as an absolute value.

My preference also applies to subsequent patches, like using `zoned.mor`
and `zoned.mar` for the resource limits.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 10/11] hw/block/nvme: Separate read and write handlers
  2020-10-20  8:28   ` Klaus Jensen
@ 2020-10-20 12:36     ` Keith Busch
  2020-10-20 23:05       ` Dmitry Fomichev
  0 siblings, 1 reply; 36+ messages in thread
From: Keith Busch @ 2020-10-20 12:36 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Dmitry Fomichev, Klaus Jensen, qemu-devel, Maxim Levitsky,
	Alistair Francis, Philippe Mathieu-Daudé,
	Matias Bjorling

On Tue, Oct 20, 2020 at 10:28:22AM +0200, Klaus Jensen wrote:
> On Oct 19 11:17, Dmitry Fomichev wrote:
> > With ZNS support in place, the majority of code in nvme_rw() has
> > become read- or write-specific. Move these parts to two separate
> > handlers, nvme_read() and nvme_write() to make the code more
> > readable and to remove multiple is_write checks that so far existed
> > in the i/o path.
> > 
> > This is a refactoring patch, no change in functionality.
> > 
> 
> This makes a lot of sense, totally Acked, but it might be better to move
> it ahead as a preparation patch? It would make the zoned patch easier on
> the eye.

I agree with the suggestion.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH v7 08/11] hw/block/nvme: Add injection of Offline/Read-Only zones
  2020-10-19 11:42   ` Klaus Jensen
@ 2020-10-20 23:01     ` Dmitry Fomichev
  0 siblings, 0 replies; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-20 23:01 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

> -----Original Message-----
> From: Klaus Jensen <its@irrelevant.dk>
> Sent: Monday, October 19, 2020 7:43 AM
> To: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
> Cc: Keith Busch <kbusch@kernel.org>; Klaus Jensen
> <k.jensen@samsung.com>; Kevin Wolf <kwolf@redhat.com>; Philippe
> Mathieu-Daudé <philmd@redhat.com>; Maxim Levitsky
> <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>; Niklas Cassel
> <Niklas.Cassel@wdc.com>; Damien Le Moal <Damien.LeMoal@wdc.com>;
> qemu-block@nongnu.org; qemu-devel@nongnu.org; Alistair Francis
> <Alistair.Francis@wdc.com>; Matias Bjorling <Matias.Bjorling@wdc.com>
> Subject: Re: [PATCH v7 08/11] hw/block/nvme: Add injection of Offline/Read-
> Only zones
> 
> On Oct 19 11:17, Dmitry Fomichev wrote:
> > ZNS specification defines two zone conditions for the zones that no
> > longer can function properly, possibly because of flash wear or other
> > internal fault. It is useful to be able to "inject" a small number of
> > such zones for testing purposes.
> >
> > This commit defines two optional device properties, "offline_zones"
> > and "rdonly_zones". Users can assign non-zero values to these variables
> > to specify the number of zones to be initialized as Offline or
> > Read-Only. The actual number of injected zones may be smaller than the
> > requested amount - Read-Only and Offline counts are expected to be much
> > smaller than the total number of zones on a drive.
> >
> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  hw/block/nvme-ns.c | 64
> ++++++++++++++++++++++++++++++++++++++++++++++
> >  hw/block/nvme-ns.h |  2 ++
> >  2 files changed, 66 insertions(+)
> >
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index 255ded2b43..d050f97909 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -21,6 +21,7 @@
> >  #include "sysemu/sysemu.h"
> >  #include "sysemu/block-backend.h"
> >  #include "qapi/error.h"
> > +#include "crypto/random.h"
> >
> >  #include "hw/qdev-properties.h"
> >  #include "hw/qdev-core.h"
> > @@ -132,6 +133,32 @@ static int
> nvme_calc_zone_geometry(NvmeNamespace *ns, Error **errp)
> >          return -1;
> >      }
> >
> > +    if (ns->params.zd_extension_size) {
> > +        if (ns->params.zd_extension_size & 0x3f) {
> > +            error_setg(errp,
> > +                "zone descriptor extension size must be a multiple of 64B");
> > +            return -1;
> > +        }
> > +        if ((ns->params.zd_extension_size >> 6) > 0xff) {
> > +            error_setg(errp, "zone descriptor extension size is too large");
> > +            return -1;
> > +        }
> > +    }
> 
> Looks like this should have been added in the previous patch.

Right, this belongs to ZDE patch. 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH v7 01/11] hw/block/nvme: Add Commands Supported and Effects log
  2020-10-19 20:16   ` Klaus Jensen
@ 2020-10-20 23:04     ` Dmitry Fomichev
  0 siblings, 0 replies; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-20 23:04 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

> -----Original Message-----
> From: Klaus Jensen <its@irrelevant.dk>
> Sent: Monday, October 19, 2020 4:16 PM
> To: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
> Cc: Keith Busch <kbusch@kernel.org>; Klaus Jensen
> <k.jensen@samsung.com>; Kevin Wolf <kwolf@redhat.com>; Philippe
> Mathieu-Daudé <philmd@redhat.com>; Maxim Levitsky
> <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>; Niklas Cassel
> <Niklas.Cassel@wdc.com>; Damien Le Moal <Damien.LeMoal@wdc.com>;
> qemu-block@nongnu.org; qemu-devel@nongnu.org; Alistair Francis
> <Alistair.Francis@wdc.com>; Matias Bjorling <Matias.Bjorling@wdc.com>
> Subject: Re: [PATCH v7 01/11] hw/block/nvme: Add Commands Supported
> and Effects log
> 
> On Oct 19 11:17, Dmitry Fomichev wrote:
> > This log page becomes necessary to implement to allow checking for
> > Zone Append command support in Zoned Namespace Command Set.
> >
> > This commit adds the code to report this log page for NVM Command
> > Set only. The parts that are specific to zoned operation will be
> > added later in the series.
> >
> > All incoming admin and i/o commands are now only processed if their
> > corresponding support bits are set in this log. This provides an
> > easy way to control what commands to support and what not to
> > depending on set CC.CSS.
> >
> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  hw/block/nvme-ns.h    |  1 +
> >  hw/block/nvme.c       | 98 +++++++++++++++++++++++++++++++++++++++--
> --
> >  hw/block/trace-events |  2 +
> >  include/block/nvme.h  | 19 +++++++++
> >  4 files changed, 111 insertions(+), 9 deletions(-)
> >
> > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> > index 83734f4606..ea8c2f785d 100644
> > --- a/hw/block/nvme-ns.h
> > +++ b/hw/block/nvme-ns.h
> > @@ -29,6 +29,7 @@ typedef struct NvmeNamespace {
> >      int32_t      bootindex;
> >      int64_t      size;
> >      NvmeIdNs     id_ns;
> > +    const uint32_t *iocs;
> >
> >      NvmeNamespaceParams params;
> >  } NvmeNamespace;
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 9d30ca69dc..5a9493d89f 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -111,6 +111,28 @@ static const uint32_t
> nvme_feature_cap[NVME_FID_MAX] = {
> >      [NVME_TIMESTAMP]                = NVME_FEAT_CAP_CHANGE,
> >  };
> >
> > +static const uint32_t nvme_cse_acs[256] = {
> > +    [NVME_ADM_CMD_DELETE_SQ]        = NVME_CMD_EFF_CSUPP,
> > +    [NVME_ADM_CMD_CREATE_SQ]        = NVME_CMD_EFF_CSUPP,
> > +    [NVME_ADM_CMD_DELETE_CQ]        = NVME_CMD_EFF_CSUPP,
> > +    [NVME_ADM_CMD_CREATE_CQ]        = NVME_CMD_EFF_CSUPP,
> > +    [NVME_ADM_CMD_IDENTIFY]         = NVME_CMD_EFF_CSUPP,
> > +    [NVME_ADM_CMD_SET_FEATURES]     = NVME_CMD_EFF_CSUPP,
> > +    [NVME_ADM_CMD_GET_FEATURES]     = NVME_CMD_EFF_CSUPP,
> > +    [NVME_ADM_CMD_GET_LOG_PAGE]     = NVME_CMD_EFF_CSUPP,
> > +    [NVME_ADM_CMD_ASYNC_EV_REQ]     = NVME_CMD_EFF_CSUPP,
> > +};
> 
> NVME_ADM_CMD_ABORT is missing. And since you added a (redundant)
> check
> in nvme_admin_cmd that cheks this table, Abort is now an invalid
> command.

Adding the ABORT, thanks. I think this code was conceived before abort was
merged, this is why it is missing.

> 
> Also, can you reorder it according to opcode instead of
> pseudo-lexicographically?

Ok, will move ...GET_LOG_PAGE  that is now out or order.

> 
> > +
> > +static const uint32_t nvme_cse_iocs_none[256] = {
> > +};
> 
> [-pedantic] no need for the '= {}'

OK.

> 
> > +
> > +static const uint32_t nvme_cse_iocs_nvm[256] = {
> > +    [NVME_CMD_FLUSH]                = NVME_CMD_EFF_CSUPP |
> NVME_CMD_EFF_LBCC,
> > +    [NVME_CMD_WRITE_ZEROES]         = NVME_CMD_EFF_CSUPP |
> NVME_CMD_EFF_LBCC,
> > +    [NVME_CMD_WRITE]                = NVME_CMD_EFF_CSUPP |
> NVME_CMD_EFF_LBCC,
> > +    [NVME_CMD_READ]                 = NVME_CMD_EFF_CSUPP,
> > +};
> > +
> >  static void nvme_process_sq(void *opaque);
> >
> >  static uint16_t nvme_cid(NvmeRequest *req)
> > @@ -1032,10 +1054,6 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n,
> NvmeRequest *req)
> >      trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req),
> >                            req->cmd.opcode, nvme_io_opc_str(req->cmd.opcode));
> >
> > -    if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_ADMIN_ONLY) {
> > -        return NVME_INVALID_OPCODE | NVME_DNR;
> > -    }
> > -
> 
> I would assume the device to respond with invalid opcode before
> validating the nsid if it is an admin only device.

The host can't make any assumptions about the ordering of validation
checks performed by the controller. In the case of receiving an i/o
command with invalid NSID when CC.CSS == ADMIN_ONLY, both
Invalid Opcode and Invalid NSID error status codes should be acceptable.

> 
> >      if (!nvme_nsid_valid(n, nsid)) {
> >          return NVME_INVALID_NSID | NVME_DNR;
> >      }
> > @@ -1045,6 +1063,11 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n,
> NvmeRequest *req)
> >          return NVME_INVALID_FIELD | NVME_DNR;
> >      }
> >
> > +    if (!(req->ns->iocs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
> > +        trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
> > +        return NVME_INVALID_OPCODE | NVME_DNR;
> > +    }
> > +
> >      switch (req->cmd.opcode) {
> >      case NVME_CMD_FLUSH:
> >          return nvme_flush(n, req);
> > @@ -1054,8 +1077,7 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n,
> NvmeRequest *req)
> >      case NVME_CMD_READ:
> >          return nvme_rw(n, req);
> >      default:
> > -        trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
> > -        return NVME_INVALID_OPCODE | NVME_DNR;
> > +        assert(false);
> >      }
> >  }
> >
> > @@ -1291,6 +1313,39 @@ static uint16_t nvme_error_info(NvmeCtrl *n,
> uint8_t rae, uint32_t buf_len,
> >                      DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >
> > +static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
> > +                                 uint64_t off, NvmeRequest *req)
> > +{
> > +    NvmeEffectsLog log = {};
> 
> [-pedantic] and empty initializer list is not allowed, should be '{0}'.

Could you please point me to a document where it is not allowed?
I can see around 900 occurrences of this construct in the current QEMU
C code...

> 
> > +    const uint32_t *src_iocs = NULL;
> > +    uint32_t trans_len;
> > +
> > +    trace_pci_nvme_cmd_supp_and_effects_log_read();
> 
> This has just been traced in nvme_admin_cmd and this doesn't add any
> additional info.
> 

Ok, this one is not really needed, will remove.

> > +
> > +    if (off >= sizeof(log)) {
> > +        trace_pci_nvme_err_invalid_effects_log_offset(off);
> 
> Can we do `trace_pci_nvme_err_invalid_log_page_offset(off) instead? Then
> we can easily reuse it in the other log pages.

Will rename.

> 
> > +        return NVME_INVALID_FIELD | NVME_DNR;
> > +    }
> > +
> > +    switch (NVME_CC_CSS(n->bar.cc)) {
> > +    case NVME_CC_CSS_NVM:
> > +        src_iocs = nvme_cse_iocs_nvm;
> > +    case NVME_CC_CSS_ADMIN_ONLY:
> > +        break;
> > +    }
> > +
> > +    memcpy(log.acs, nvme_cse_acs, sizeof(nvme_cse_acs));
> > +
> > +    if (src_iocs) {
> > +        memcpy(log.iocs, src_iocs, sizeof(log.iocs));
> > +    }
> > +
> > +    trans_len = MIN(sizeof(log) - off, buf_len);
> > +
> > +    return nvme_dma(n, ((uint8_t *)&log) + off, trans_len,
> > +                    DMA_DIRECTION_FROM_DEVICE, req);
> > +}
> > +
> >  static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
> >  {
> >      NvmeCmd *cmd = &req->cmd;
> > @@ -1334,6 +1389,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n,
> NvmeRequest *req)
> >          return nvme_smart_info(n, rae, len, off, req);
> >      case NVME_LOG_FW_SLOT_INFO:
> >          return nvme_fw_log_info(n, len, off, req);
> > +    case NVME_LOG_CMD_EFFECTS:
> > +        return nvme_cmd_effects(n, len, off, req);
> >      default:
> >          trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
> >          return NVME_INVALID_FIELD | NVME_DNR;
> > @@ -1920,6 +1977,11 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n,
> NvmeRequest *req)
> >      trace_pci_nvme_admin_cmd(nvme_cid(req), nvme_sqid(req), req-
> >cmd.opcode,
> >                               nvme_adm_opc_str(req->cmd.opcode));
> >
> > +    if (!(nvme_cse_acs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
> > +        trace_pci_nvme_err_invalid_admin_opc(req->cmd.opcode);
> > +        return NVME_INVALID_OPCODE | NVME_DNR;
> > +    }
> > +
> 
> This is the (redundant) check that effectively makes Abort an invalid
> command.

This check not redundant - I think it is a better alternative to checking
for CC.CSS == ADMIN_ONLY. This way, the actual set of supported
commands is always in sync with what is advertised in CSE log.
And this approach makes it easy to support i/o Command Set Specific
admin commands in the future. For that, ns->acs can be introduced along
the same lines as ns->iocs. For now, of course, this is not necessary.

> 
> >      switch (req->cmd.opcode) {
> >      case NVME_ADM_CMD_DELETE_SQ:
> >          return nvme_del_sq(n, req);
> > @@ -1942,8 +2004,7 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n,
> NvmeRequest *req)
> >      case NVME_ADM_CMD_ASYNC_EV_REQ:
> >          return nvme_aer(n, req);
> >      default:
> > -        trace_pci_nvme_err_invalid_admin_opc(req->cmd.opcode);
> > -        return NVME_INVALID_OPCODE | NVME_DNR;
> > +        assert(false);
> >      }
> >  }
> >
> > @@ -2031,6 +2092,23 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
> >      n->bar.cc = 0;
> >  }
> >
> > +static void nvme_select_ns_iocs(NvmeCtrl *n)
> > +{
> > +    NvmeNamespace *ns;
> > +    int i;
> > +
> > +    for (i = 1; i <= n->num_namespaces; i++) {
> > +        ns = nvme_ns(n, i);
> > +        if (!ns) {
> > +            continue;
> > +        }
> > +        ns->iocs = nvme_cse_iocs_none;
> > +        if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) {
> > +            ns->iocs = nvme_cse_iocs_nvm;
> > +        }
> > +    }
> > +}
> > +
> >  static int nvme_start_ctrl(NvmeCtrl *n)
> >  {
> >      uint32_t page_bits = NVME_CC_MPS(n->bar.cc) + 12;
> > @@ -2129,6 +2207,8 @@ static int nvme_start_ctrl(NvmeCtrl *n)
> >
> >      QTAILQ_INIT(&n->aer_queue);
> >
> > +    nvme_select_ns_iocs(n);
> > +
> >      return 0;
> >  }
> >
> > @@ -2737,7 +2817,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice
> *pci_dev)
> >      id->acl = 3;
> >      id->aerl = n->params.aerl;
> >      id->frmw = (NVME_NUM_FW_SLOTS << 1) | NVME_FRMW_SLOT1_RO;
> > -    id->lpa = NVME_LPA_NS_SMART | NVME_LPA_EXTENDED;
> > +    id->lpa = NVME_LPA_NS_SMART | NVME_LPA_CSE |
> NVME_LPA_EXTENDED;
> >
> >      /* recommended default value (~70 C) */
> >      id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
> > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > index fac5995d94..0ae9cb0d35 100644
> > --- a/hw/block/trace-events
> > +++ b/hw/block/trace-events
> > @@ -85,6 +85,7 @@ pci_nvme_mmio_start_success(void) "setting
> controller enable bit succeeded"
> >  pci_nvme_mmio_stopped(void) "cleared controller enable bit"
> >  pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
> >  pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
> > +pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported
> and effects log read"
> >
> >  # nvme traces for error conditions
> >  pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
> > @@ -104,6 +105,7 @@ pci_nvme_err_invalid_prp(void) "invalid PRP"
> >  pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
> >  pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode
> 0x%"PRIx8""
> >  pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t
> limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
> > +pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands
> supported and effects log offset must be 0, got %"PRIu64""
> >  pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue
> deletion, sid=%"PRIu16""
> >  pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating
> submission queue, invalid cqid=%"PRIu16""
> >  pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating
> submission queue, invalid sqid=%"PRIu16""
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 6de2d5aa75..4779495b7d 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -744,10 +744,27 @@ enum NvmeSmartWarn {
> >      NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,
> >  };
> >
> > +typedef struct NvmeEffectsLog {
> > +    uint32_t    acs[256];
> > +    uint32_t    iocs[256];
> > +    uint8_t     resv[2048];
> > +} NvmeEffectsLog;
> > +
> > +enum {
> > +    NVME_CMD_EFF_CSUPP      = 1 << 0,
> > +    NVME_CMD_EFF_LBCC       = 1 << 1,
> > +    NVME_CMD_EFF_NCC        = 1 << 2,
> > +    NVME_CMD_EFF_NIC        = 1 << 3,
> > +    NVME_CMD_EFF_CCC        = 1 << 4,
> > +    NVME_CMD_EFF_CSE_MASK   = 3 << 16,
> > +    NVME_CMD_EFF_UUID_SEL   = 1 << 19,
> > +};
> > +
> >  enum NvmeLogIdentifier {
> >      NVME_LOG_ERROR_INFO     = 0x01,
> >      NVME_LOG_SMART_INFO     = 0x02,
> >      NVME_LOG_FW_SLOT_INFO   = 0x03,
> > +    NVME_LOG_CMD_EFFECTS    = 0x05,
> >  };
> >
> >  typedef struct QEMU_PACKED NvmePSD {
> > @@ -860,6 +877,7 @@ enum NvmeIdCtrlFrmw {
> >
> >  enum NvmeIdCtrlLpa {
> >      NVME_LPA_NS_SMART = 1 << 0,
> > +    NVME_LPA_CSE      = 1 << 1,
> >      NVME_LPA_EXTENDED = 1 << 2,
> >  };
> >
> > @@ -1059,6 +1077,7 @@ static inline void _nvme_check_size(void)
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64);
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
> > +    QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
> > --
> > 2.21.0
> >
> >
> 
> --
> One of us - No more doubt, silence or taboo about mental illness.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH v7 10/11] hw/block/nvme: Separate read and write handlers
  2020-10-20 12:36     ` Keith Busch
@ 2020-10-20 23:05       ` Dmitry Fomichev
  0 siblings, 0 replies; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-20 23:05 UTC (permalink / raw)
  To: Keith Busch, Klaus Jensen
  Cc: Kevin Wolf, Fam Zheng, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Philippe Mathieu-Daudé,
	Matias Bjorling

> -----Original Message-----
> From: Keith Busch <kbusch@kernel.org>
> Sent: Tuesday, October 20, 2020 8:36 AM
> To: Klaus Jensen <its@irrelevant.dk>
> Cc: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>; Klaus Jensen
> <k.jensen@samsung.com>; Kevin Wolf <kwolf@redhat.com>; Philippe
> Mathieu-Daudé <philmd@redhat.com>; Maxim Levitsky
> <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>; Niklas Cassel
> <Niklas.Cassel@wdc.com>; Damien Le Moal <Damien.LeMoal@wdc.com>;
> qemu-block@nongnu.org; qemu-devel@nongnu.org; Alistair Francis
> <Alistair.Francis@wdc.com>; Matias Bjorling <Matias.Bjorling@wdc.com>
> Subject: Re: [PATCH v7 10/11] hw/block/nvme: Separate read and write
> handlers
> 
> On Tue, Oct 20, 2020 at 10:28:22AM +0200, Klaus Jensen wrote:
> > On Oct 19 11:17, Dmitry Fomichev wrote:
> > > With ZNS support in place, the majority of code in nvme_rw() has
> > > become read- or write-specific. Move these parts to two separate
> > > handlers, nvme_read() and nvme_write() to make the code more
> > > readable and to remove multiple is_write checks that so far existed
> > > in the i/o path.
> > >
> > > This is a refactoring patch, no change in functionality.
> > >
> >
> > This makes a lot of sense, totally Acked, but it might be better to move
> > it ahead as a preparation patch? It would make the zoned patch easier on
> > the eye.
> 
> I agree with the suggestion.

Ok, will move them to the front of the series.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH v7 04/11] hw/block/nvme: Support allocated CNS command variants
  2020-10-20  8:21   ` Klaus Jensen
@ 2020-10-20 23:09     ` Dmitry Fomichev
  0 siblings, 0 replies; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-20 23:09 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

> -----Original Message-----
> From: Klaus Jensen <its@irrelevant.dk>
> Sent: Tuesday, October 20, 2020 4:21 AM
> To: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
> Cc: Keith Busch <kbusch@kernel.org>; Klaus Jensen
> <k.jensen@samsung.com>; Kevin Wolf <kwolf@redhat.com>; Philippe
> Mathieu-Daudé <philmd@redhat.com>; Maxim Levitsky
> <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>; Niklas Cassel
> <Niklas.Cassel@wdc.com>; Damien Le Moal <Damien.LeMoal@wdc.com>;
> qemu-block@nongnu.org; qemu-devel@nongnu.org; Alistair Francis
> <Alistair.Francis@wdc.com>; Matias Bjorling <Matias.Bjorling@wdc.com>
> Subject: Re: [PATCH v7 04/11] hw/block/nvme: Support allocated CNS
> command variants
> 
> On Oct 19 11:17, Dmitry Fomichev wrote:
> 
> (snip)
> 
> > CAP.CSS (together with the I/O Command Set data structure) defines
> > what command sets are supported by the controller.
> >
> > CC.CSS (together with Set Profile) can be set to enable a subset of
> > the available command sets.
> >
> > Even if a user configures CC.CSS to e.g. Admin only, NVM namespaces
> > will still be attached (and thus marked as active).
> > Similarly, if a user configures CC.CSS to e.g. NVM, ZNS namespaces
> > will still be attached (and thus marked as active).
> >
> > However, any operation from a disabled command set will result in a
> > Invalid Command Opcode.
> >
> 
> This part of the commit message seems irrelevant to the patch.
> 
> > Add a new Boolean namespace property, "attached", to provide the most
> > basic namespace attachment support. The default value for this new
> > property is true. Also, implement the logic in the new CNS values to
> > include/exclude namespaces based on this new property. The only thing
> > missing is hooking up the actual Namespace Attachment command opcode,
> > which will allow a user to toggle the "attached" flag per namespace.
> >
> 
> Without Namespace Attachment support, the sole purpose of this
> parameter
> is to allow unusable namespace IDs to be reported. I have no problems
> with adding support for the additional CNS values. They will return
> identical responses, but I think that is good enough for now.
> 
> When it is not really needed, we should be wary of adding a parameter
> that is really hard to get rid of again.
> 
> > The reason for not hooking up this command completely is because the
> > NVMe specification requires the namespace management command to be
> > supported if the namespace attachment command is supported.
> >
> 
> There are many ways to support Namespace Management, and there are a
> lot
> of quirks with each of them. Do we use a big blockdev and carve out
> namespaces? Then, what are the semantics of an image resize operation?
> 
> Do we dynamically create blockdev devices - thats sounds pretty nice,
> but might have other quirks and the attachment is not really persistent.
> 
> I think at least the "attached" parameter should be x-prefixed, but
> better, leave it out for now until we know how we want Namespace
> Attachment and Management to be implemented.

I don't mind leaving this property out. I used it for testing the patch and it
could, in theory, be manipulated by an external process doing NS
Management, but, as you said, there is no certainty about now NS
Management will be implemented and any related CLI interface should
better be added as a part of this future work, not now.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH v7 03/11] hw/block/nvme: Add support for Namespace Types
  2020-10-19 20:53   ` Klaus Jensen
@ 2020-10-21  1:50     ` Dmitry Fomichev
  0 siblings, 0 replies; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-21  1:50 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

> -----Original Message-----
> From: Klaus Jensen <its@irrelevant.dk>
> Sent: Monday, October 19, 2020 4:54 PM
> To: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
> Cc: Keith Busch <kbusch@kernel.org>; Klaus Jensen
> <k.jensen@samsung.com>; Kevin Wolf <kwolf@redhat.com>; Philippe
> Mathieu-Daudé <philmd@redhat.com>; Maxim Levitsky
> <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>; Niklas Cassel
> <Niklas.Cassel@wdc.com>; Damien Le Moal <Damien.LeMoal@wdc.com>;
> qemu-block@nongnu.org; qemu-devel@nongnu.org; Alistair Francis
> <Alistair.Francis@wdc.com>; Matias Bjorling <Matias.Bjorling@wdc.com>
> Subject: Re: [PATCH v7 03/11] hw/block/nvme: Add support for Namespace
> Types
> 
> On Oct 19 11:17, Dmitry Fomichev wrote:
> > From: Niklas Cassel <niklas.cassel@wdc.com>
> >
> > Define the structures and constants required to implement
> > Namespace Types support.
> >
> > Namespace Types introduce a new command set, "I/O Command Sets",
> > that allows the host to retrieve the command sets associated with
> > a namespace. Introduce support for the command set and enable
> > detection for the NVM Command Set.
> >
> > The new workflows for identify commands rely heavily on zero-filled
> > identify structs. E.g., certain CNS commands are defined to return
> > a zero-filled identify struct when an inactive namespace NSID
> > is supplied.
> >
> > Add a helper function in order to avoid code duplication when
> > reporting zero-filled identify structures.
> >
> > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  hw/block/nvme-ns.c    |   2 +
> >  hw/block/nvme-ns.h    |   1 +
> >  hw/block/nvme.c       | 169 +++++++++++++++++++++++++++++++++++-------
> >  hw/block/trace-events |   7 ++
> >  include/block/nvme.h  |  65 ++++++++++++----
> >  5 files changed, 202 insertions(+), 42 deletions(-)
> >
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index de735eb9f3..c0362426cc 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -41,6 +41,8 @@ static void nvme_ns_init(NvmeNamespace *ns)
> >
> >      id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(ns));
> >
> > +    ns->csi = NVME_CSI_NVM;
> > +
> >      /* no thin provisioning */
> >      id_ns->ncap = id_ns->nsze;
> >      id_ns->nuse = id_ns->ncap;
> > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> > index a38071884a..d795e44bab 100644
> > --- a/hw/block/nvme-ns.h
> > +++ b/hw/block/nvme-ns.h
> > @@ -31,6 +31,7 @@ typedef struct NvmeNamespace {
> >      int64_t      size;
> >      NvmeIdNs     id_ns;
> >      const uint32_t *iocs;
> > +    uint8_t      csi;
> >
> >      NvmeNamespaceParams params;
> >  } NvmeNamespace;
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 29139d8a17..ca0d0abf5c 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1503,6 +1503,13 @@ static uint16_t nvme_create_cq(NvmeCtrl *n,
> NvmeRequest *req)
> >      return NVME_SUCCESS;
> >  }
> >
> > +static uint16_t nvme_rpt_empty_id_struct(NvmeCtrl *n, NvmeRequest
> *req)
> > +{
> > +    uint8_t id[NVME_IDENTIFY_DATA_SIZE] = {};
> 
> [-pedantic] empty initializer list
> 
> > +
> > +    return nvme_dma(n, id, sizeof(id), DMA_DIRECTION_FROM_DEVICE,
> req);
> > +}
> > +
> >  static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
> >  {
> >      trace_pci_nvme_identify_ctrl();
> > @@ -1511,11 +1518,23 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl
> *n, NvmeRequest *req)
> >                      DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >
> > +static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
> > +{
> > +    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> > +
> > +    trace_pci_nvme_identify_ctrl_csi(c->csi);
> > +
> > +    if (c->csi == NVME_CSI_NVM) {
> > +        return nvme_rpt_empty_id_struct(n, req);
> > +    }
> > +
> > +    return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> >  static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
> >  {
> >      NvmeNamespace *ns;
> >      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> > -    NvmeIdNs *id_ns, inactive = { 0 };
> >      uint32_t nsid = le32_to_cpu(c->nsid);
> >
> >      trace_pci_nvme_identify_ns(nsid);
> > @@ -1526,23 +1545,46 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n,
> NvmeRequest *req)
> >
> >      ns = nvme_ns(n, nsid);
> >      if (unlikely(!ns)) {
> > -        id_ns = &inactive;
> > -    } else {
> > -        id_ns = &ns->id_ns;
> > +        return nvme_rpt_empty_id_struct(n, req);
> >      }
> >
> > -    return nvme_dma(n, (uint8_t *)id_ns, sizeof(NvmeIdNs),
> > +    return nvme_dma(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs),
> >                      DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >
> > +static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
> > +{
> > +    NvmeNamespace *ns;
> > +    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> > +    uint32_t nsid = le32_to_cpu(c->nsid);
> > +
> > +    trace_pci_nvme_identify_ns_csi(nsid, c->csi);
> > +
> > +    if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
> > +        return NVME_INVALID_NSID | NVME_DNR;
> > +    }
> > +
> > +    ns = nvme_ns(n, nsid);
> > +    if (unlikely(!ns)) {
> > +        return nvme_rpt_empty_id_struct(n, req);
> > +    }
> > +
> > +    if (c->csi == NVME_CSI_NVM) {
> > +        return nvme_rpt_empty_id_struct(n, req);
> > +    }
> > +
> > +    return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> >  static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
> >  {
> > +    NvmeNamespace *ns;
> >      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> > -    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
> >      uint32_t min_nsid = le32_to_cpu(c->nsid);
> > -    uint32_t *list;
> > -    uint16_t ret;
> > -    int j = 0;
> > +    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
> 
> [-pedantic] empty initializer list
> 
> > +    static const int data_len = sizeof(list);
> > +    uint32_t *list_ptr = (uint32_t *)list;
> > +    int i, j = 0;
> >
> >      trace_pci_nvme_identify_nslist(min_nsid);
> >
> > @@ -1556,20 +1598,54 @@ static uint16_t nvme_identify_nslist(NvmeCtrl
> *n, NvmeRequest *req)
> >          return NVME_INVALID_NSID | NVME_DNR;
> >      }
> >
> > -    list = g_malloc0(data_len);
> > -    for (int i = 1; i <= n->num_namespaces; i++) {
> > -        if (i <= min_nsid || !nvme_ns(n, i)) {
> > +    for (i = 1; i <= n->num_namespaces; i++) {
> > +        ns = nvme_ns(n, i);
> > +        if (!ns) {
> >              continue;
> >          }
> > -        list[j++] = cpu_to_le32(i);
> > +        if (ns->params.nsid < min_nsid) {
> 
> Since i == ns->params.nsid, this should be '<=' like the code you
> removed. It really shouldn't be called min_nsid, but oh well.

Right, needs to be <=. We can rename min_nsid to start_nsid or similer
since we are touching this code anyway.
> 
> > +            continue;
> > +        }
> > +        list_ptr[j++] = cpu_to_le32(ns->params.nsid);
> >          if (j == data_len / sizeof(uint32_t)) {
> >              break;
> >          }
> >      }
> > -    ret = nvme_dma(n, (uint8_t *)list, data_len,
> DMA_DIRECTION_FROM_DEVICE,
> > -                   req);
> > -    g_free(list);
> > -    return ret;
> > +
> > +    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE,
> req);
> > +}
> > +
> > +static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
> > +{
> > +    NvmeNamespace *ns;
> > +    NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> > +    uint32_t min_nsid = le32_to_cpu(c->nsid);
> > +    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
> > +    static const int data_len = sizeof(list);
> > +    uint32_t *list_ptr = (uint32_t *)list;
> > +    int i, j = 0;
> > +
> > +    trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
> > +
> > +    if (c->csi != NVME_CSI_NVM) {
> > +        return NVME_INVALID_FIELD | NVME_DNR;
> > +    }
> > +
> 
> This is missing the check for 0xffffffff and 0xfffffffe like above.

Will add the similar check here.

> 
> > +    for (i = 1; i <= n->num_namespaces; i++) {
> > +        ns = nvme_ns(n, i);
> > +        if (!ns) {
> > +            continue;
> > +        }
> > +        if (ns->params.nsid < min_nsid) {
> 
> Should be '<='.
> 
> > +            continue;
> > +        }
> > +        list_ptr[j++] = cpu_to_le32(ns->params.nsid);
> > +        if (j == data_len / sizeof(uint32_t)) {
> > +            break;
> > +        }
> > +    }
> > +
> > +    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE,
> req);
> >  }
> >
> >  static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest
> *req)
> > @@ -1577,13 +1653,17 @@ static uint16_t
> nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
> >      NvmeNamespace *ns;
> >      NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
> >      uint32_t nsid = le32_to_cpu(c->nsid);
> > -    uint8_t list[NVME_IDENTIFY_DATA_SIZE];
> > +    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
> 
> [-pedantic] empty initializer list
> 
> >
> >      struct data {
> >          struct {
> >              NvmeIdNsDescr hdr;
> > -            uint8_t v[16];
> > +            uint8_t v[NVME_NIDL_UUID];
> >          } uuid;
> > +        struct {
> > +            NvmeIdNsDescr hdr;
> > +            uint8_t v;
> > +        } csi;
> >      };
> >
> >      struct data *ns_descrs = (struct data *)list;
> > @@ -1599,19 +1679,31 @@ static uint16_t
> nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
> >          return NVME_INVALID_FIELD | NVME_DNR;
> >      }
> >
> > -    memset(list, 0x0, sizeof(list));
> > -
> >      /*
> >       * Because the NGUID and EUI64 fields are 0 in the Identify Namespace
> data
> >       * structure, a Namespace UUID (nidt = 0x3) must be reported in the
> >       * Namespace Identification Descriptor. Add the namespace UUID here.
> >       */
> >      ns_descrs->uuid.hdr.nidt = NVME_NIDT_UUID;
> > -    ns_descrs->uuid.hdr.nidl = NVME_NIDT_UUID_LEN;
> > -    memcpy(&ns_descrs->uuid.v, ns->params.uuid.data,
> NVME_NIDT_UUID_LEN);
> > +    ns_descrs->uuid.hdr.nidl = NVME_NIDL_UUID;
> > +    memcpy(&ns_descrs->uuid.v, ns->params.uuid.data,
> NVME_NIDL_UUID);
> >
> > -    return nvme_dma(n, list, NVME_IDENTIFY_DATA_SIZE,
> > -                    DMA_DIRECTION_FROM_DEVICE, req);
> > +    ns_descrs->csi.hdr.nidt = NVME_NIDT_CSI;
> > +    ns_descrs->csi.hdr.nidl = NVME_NIDL_CSI;
> > +    ns_descrs->csi.v = ns->csi;
> > +
> > +    return nvme_dma(n, list, sizeof(list), DMA_DIRECTION_FROM_DEVICE,
> req);
> > +}
> > +
> > +static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeRequest *req)
> > +{
> > +    uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
> 
> [-pedantic] empty initializer list
> 
> > +    static const int data_len = sizeof(list);
> > +
> > +    trace_pci_nvme_identify_cmd_set();
> > +
> > +    NVME_SET_CSI(*list, NVME_CSI_NVM);
> > +    return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE,
> req);
> >  }
> >
> 
> --
> One of us - No more doubt, silence or taboo about mental illness.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set
  2020-10-19  2:17 ` [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
                     ` (2 preceding siblings ...)
  2020-10-20 11:08   ` Klaus Jensen
@ 2020-10-21 10:26   ` Klaus Jensen
  2020-10-21 23:19     ` Dmitry Fomichev
  3 siblings, 1 reply; 36+ messages in thread
From: Klaus Jensen @ 2020-10-21 10:26 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

[-- Attachment #1: Type: text/plain, Size: 1985 bytes --]

On Oct 19 11:17, Dmitry Fomichev wrote:
> +/*
> + * Close or finish all the zones that are currently open.
> + */
> +static void nvme_zoned_clear_ns(NvmeNamespace *ns)
> +{
> +    NvmeZone *zone;
> +    uint32_t set_state;
> +    int i;
> +
> +    zone = ns->zone_array;
> +    for (i = 0; i < ns->num_zones; i++, zone++) {
> +        switch (nvme_get_zone_state(zone)) {
> +        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +            QTAILQ_REMOVE(&ns->imp_open_zones, zone, entry);
> +            break;
> +        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +            QTAILQ_REMOVE(&ns->exp_open_zones, zone, entry);
> +            break;
> +        case NVME_ZONE_STATE_CLOSED:
> +            /* fall through */
> +        default:
> +            continue;
> +        }
> +
> +        if (zone->d.wp == zone->d.zslba) {
> +            set_state = NVME_ZONE_STATE_EMPTY;
> +        } else {
> +            set_state = NVME_ZONE_STATE_CLOSED;
> +        }
> +
> +        switch (set_state) {
> +        case NVME_ZONE_STATE_CLOSED:
> +            trace_pci_nvme_clear_ns_close(nvme_get_zone_state(zone),
> +                                          zone->d.zslba);
> +            QTAILQ_INSERT_TAIL(&ns->closed_zones, zone, entry);
> +            break;
> +        case NVME_ZONE_STATE_EMPTY:
> +            trace_pci_nvme_clear_ns_reset(nvme_get_zone_state(zone),
> +                                          zone->d.zslba);
> +            break;
> +        case NVME_ZONE_STATE_FULL:
> +            trace_pci_nvme_clear_ns_full(nvme_get_zone_state(zone),
> +                                         zone->d.zslba);
> +            zone->d.wp = nvme_zone_wr_boundary(zone);
> +            QTAILQ_INSERT_TAIL(&ns->full_zones, zone, entry);
> +        }

No need for the switch here - just add to the closed list in the
conditional.

The NVME_ZONE_STATE_FULL case is unreachable.

> +
> +        zone->w_ptr = zone->d.wp;
> +        nvme_set_zone_state(zone, set_state);
> +    }
> +}

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set
  2020-10-21 10:26   ` Klaus Jensen
@ 2020-10-21 23:19     ` Dmitry Fomichev
  0 siblings, 0 replies; 36+ messages in thread
From: Dmitry Fomichev @ 2020-10-21 23:19 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, Damien Le Moal, qemu-block, Niklas Cassel,
	Klaus Jensen, qemu-devel, Maxim Levitsky, Alistair Francis,
	Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

> -----Original Message-----
> From: Klaus Jensen <its@irrelevant.dk>
> Sent: Wednesday, October 21, 2020 6:26 AM
> To: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
> Cc: Keith Busch <kbusch@kernel.org>; Klaus Jensen
> <k.jensen@samsung.com>; Kevin Wolf <kwolf@redhat.com>; Philippe
> Mathieu-Daudé <philmd@redhat.com>; Maxim Levitsky
> <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>; Niklas Cassel
> <Niklas.Cassel@wdc.com>; Damien Le Moal <Damien.LeMoal@wdc.com>;
> qemu-block@nongnu.org; qemu-devel@nongnu.org; Alistair Francis
> <Alistair.Francis@wdc.com>; Matias Bjorling <Matias.Bjorling@wdc.com>
> Subject: Re: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace
> Command Set
> 
> On Oct 19 11:17, Dmitry Fomichev wrote:
> > +/*
> > + * Close or finish all the zones that are currently open.
> > + */
> > +static void nvme_zoned_clear_ns(NvmeNamespace *ns)
> > +{
> > +    NvmeZone *zone;
> > +    uint32_t set_state;
> > +    int i;
> > +
> > +    zone = ns->zone_array;
> > +    for (i = 0; i < ns->num_zones; i++, zone++) {
> > +        switch (nvme_get_zone_state(zone)) {
> > +        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> > +            QTAILQ_REMOVE(&ns->imp_open_zones, zone, entry);
> > +            break;
> > +        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> > +            QTAILQ_REMOVE(&ns->exp_open_zones, zone, entry);
> > +            break;
> > +        case NVME_ZONE_STATE_CLOSED:
> > +            /* fall through */
> > +        default:
> > +            continue;
> > +        }
> > +
> > +        if (zone->d.wp == zone->d.zslba) {
> > +            set_state = NVME_ZONE_STATE_EMPTY;
> > +        } else {
> > +            set_state = NVME_ZONE_STATE_CLOSED;
> > +        }
> > +
> > +        switch (set_state) {
> > +        case NVME_ZONE_STATE_CLOSED:
> > +            trace_pci_nvme_clear_ns_close(nvme_get_zone_state(zone),
> > +                                          zone->d.zslba);
> > +            QTAILQ_INSERT_TAIL(&ns->closed_zones, zone, entry);
> > +            break;
> > +        case NVME_ZONE_STATE_EMPTY:
> > +            trace_pci_nvme_clear_ns_reset(nvme_get_zone_state(zone),
> > +                                          zone->d.zslba);
> > +            break;
> > +        case NVME_ZONE_STATE_FULL:
> > +            trace_pci_nvme_clear_ns_full(nvme_get_zone_state(zone),
> > +                                         zone->d.zslba);
> > +            zone->d.wp = nvme_zone_wr_boundary(zone);
> > +            QTAILQ_INSERT_TAIL(&ns->full_zones, zone, entry);
> > +        }
> 
> No need for the switch here - just add to the closed list in the
> conditional.

The switch becomes handy later in the series, particularly after adding
descriptor extensions. For easier reviewing, it makes sense to add it from
the beginning even though it is rudimentary at this point.

> 
> The NVME_ZONE_STATE_FULL case is unreachable.

Indeed. This should be introduced in the next patch.

Now, I've looked at this code again and the active/open counting in this
function ends up to be not quite right, I am fixing it.

> 
> > +
> > +        zone->w_ptr = zone->d.wp;
> > +        nvme_set_zone_state(zone, set_state);
> > +    }
> > +}

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2020-10-21 23:21 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-19  2:17 [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
2020-10-19  2:17 ` [PATCH v7 01/11] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
2020-10-19 19:22   ` Keith Busch
2020-10-19 20:16   ` Klaus Jensen
2020-10-20 23:04     ` Dmitry Fomichev
2020-10-19  2:17 ` [PATCH v7 02/11] hw/block/nvme: Generate namespace UUIDs Dmitry Fomichev
2020-10-19 19:24   ` Keith Busch
2020-10-19 19:30   ` Klaus Jensen
2020-10-19  2:17 ` [PATCH v7 03/11] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
2020-10-19 19:51   ` Keith Busch
2020-10-19 20:53   ` Klaus Jensen
2020-10-21  1:50     ` Dmitry Fomichev
2020-10-19  2:17 ` [PATCH v7 04/11] hw/block/nvme: Support allocated CNS command variants Dmitry Fomichev
2020-10-19 20:07   ` Keith Busch
2020-10-20  8:21   ` Klaus Jensen
2020-10-20 23:09     ` Dmitry Fomichev
2020-10-19  2:17 ` [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
2020-10-19  9:50   ` Klaus Jensen
2020-10-19 15:55     ` Klaus Jensen
2020-10-19 12:33   ` Klaus Jensen
2020-10-20 11:08   ` Klaus Jensen
2020-10-21 10:26   ` Klaus Jensen
2020-10-21 23:19     ` Dmitry Fomichev
2020-10-19  2:17 ` [PATCH v7 06/11] hw/block/nvme: Introduce max active and open zone limits Dmitry Fomichev
2020-10-19  2:17 ` [PATCH v7 07/11] hw/block/nvme: Support Zone Descriptor Extensions Dmitry Fomichev
2020-10-19  2:17 ` [PATCH v7 08/11] hw/block/nvme: Add injection of Offline/Read-Only zones Dmitry Fomichev
2020-10-19 11:42   ` Klaus Jensen
2020-10-20 23:01     ` Dmitry Fomichev
2020-10-19  2:17 ` [PATCH v7 09/11] hw/block/nvme: Document zoned parameters in usage text Dmitry Fomichev
2020-10-19  2:17 ` [PATCH v7 10/11] hw/block/nvme: Separate read and write handlers Dmitry Fomichev
2020-10-20  8:28   ` Klaus Jensen
2020-10-20 12:36     ` Keith Busch
2020-10-20 23:05       ` Dmitry Fomichev
2020-10-19  2:17 ` [PATCH v7 11/11] hw/block/nvme: Merge nvme_write_zeroes() with nvme_write() Dmitry Fomichev
2020-10-20  8:29   ` Klaus Jensen
2020-10-19  7:32 ` [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Niklas Cassel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.