qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
@ 2020-06-17 21:33 Dmitry Fomichev
  2020-06-17 21:33 ` [PATCH v2 01/18] hw/block/nvme: Move NvmeRequest has_sg field to a bit flag Dmitry Fomichev
                   ` (18 more replies)
  0 siblings, 19 replies; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:33 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

v2: rebased on top of block-next/block branch

Zoned Namespace (ZNS) Command Set is a newly introduced command set
published by the NVM Express, Inc. organization as TP 4053. The main
design goals of ZNS are to provide hardware designers the means to
reduce NVMe controller complexity and to allow achieving a better I/O
latency and throughput. SSDs that implement this interface are
commonly known as ZNS SSDs.

This command set is implementing a zoned storage model, similarly to
ZAC/ZBC. As such, there is already support in Linux, allowing one to
perform the majority of tasks needed for managing ZNS SSDs.

The Zoned Namespace Command Set relies on another TP, known as
Namespace Types (NVMe TP 4056), which introduces support for having
multiple command sets per namespace.

Both ZNS and Namespace Types specifications can be downloaded by
visiting the following link -

https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs.zip

This patch series adds Namespace Types support and zoned namespace
emulation capability to the existing NVMe PCI driver.

The patchset is organized as follows -

The first several patches are preparatory and are added to allow for
an easier review of the subsequent commits. The group of patches that
follows adds NS Types support with only NVM Command Set being
available. Finally, the last group of commits makes definitions and
adds new code to support Zoned Namespace Command Set.

Based-on: <20200609205944.3549240-1-eblake@redhat.com>

Ajay Joshi (1):
  hw/block/nvme: Define 64 bit cqe.result

Dmitry Fomichev (15):
  hw/block/nvme: Move NvmeRequest has_sg field to a bit flag
  hw/block/nvme: Clean up unused AER definitions
  hw/block/nvme: Add Commands Supported and Effects log
  hw/block/nvme: Define trace events related to NS Types
  hw/block/nvme: Make Zoned NS Command Set definitions
  hw/block/nvme: Define Zoned NS Command Set trace events
  hw/block/nvme: Support Zoned Namespace Command Set
  hw/block/nvme: Introduce max active and open zone limits
  hw/block/nvme: Simulate Zone Active excursions
  hw/block/nvme: Set Finish/Reset Zone Recommended attributes
  hw/block/nvme: Generate zone AENs
  hw/block/nvme: Support Zone Descriptor Extensions
  hw/block/nvme: Add injection of Offline/Read-Only zones
  hw/block/nvme: Use zone metadata file for persistence
  hw/block/nvme: Document zoned parameters in usage text

Niklas Cassel (2):
  hw/block/nvme: Introduce the Namespace Types definitions
  hw/block/nvme: Add support for Namespace Types

 block/nvme.c          |    2 +-
 block/trace-events    |    2 +-
 hw/block/nvme.c       | 2316 ++++++++++++++++++++++++++++++++++++++++-
 hw/block/nvme.h       |  228 +++-
 hw/block/trace-events |   56 +
 include/block/nvme.h  |  282 ++++-
 6 files changed, 2820 insertions(+), 66 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v2 01/18] hw/block/nvme: Move NvmeRequest has_sg field to a bit flag
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
@ 2020-06-17 21:33 ` Dmitry Fomichev
  2020-06-30  0:56   ` Alistair Francis
  2020-06-30  4:09   ` Klaus Jensen
  2020-06-17 21:33 ` [PATCH v2 02/18] hw/block/nvme: Define 64 bit cqe.result Dmitry Fomichev
                   ` (17 subsequent siblings)
  18 siblings, 2 replies; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:33 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

In addition to the existing has_sg flag, a few more Boolean
NvmeRequest flags are going to be introduced in subsequent patches.
Convert "has_sg" variable to "flags" and define NvmeRequestFlags
enum for individual flag values.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 8 +++-----
 hw/block/nvme.h | 6 +++++-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 1aee042d4c..3ed9f3d321 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -350,7 +350,7 @@ static void nvme_rw_cb(void *opaque, int ret)
         block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
         req->status = NVME_INTERNAL_DEV_ERROR;
     }
-    if (req->has_sg) {
+    if (req->flags & NVME_REQ_FLG_HAS_SG) {
         qemu_sglist_destroy(&req->qsg);
     }
     nvme_enqueue_req_completion(cq, req);
@@ -359,7 +359,6 @@ static void nvme_rw_cb(void *opaque, int ret)
 static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
     NvmeRequest *req)
 {
-    req->has_sg = false;
     block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
          BLOCK_ACCT_FLUSH);
     req->aiocb = blk_aio_flush(n->conf.blk, nvme_rw_cb, req);
@@ -383,7 +382,6 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
         return NVME_LBA_RANGE | NVME_DNR;
     }
 
-    req->has_sg = false;
     block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
                      BLOCK_ACCT_WRITE);
     req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
@@ -422,14 +420,13 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
 
     dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
     if (req->qsg.nsg > 0) {
-        req->has_sg = true;
+        req->flags |= NVME_REQ_FLG_HAS_SG;
         req->aiocb = is_write ?
             dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
                           nvme_rw_cb, req) :
             dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
                          nvme_rw_cb, req);
     } else {
-        req->has_sg = false;
         req->aiocb = is_write ?
             blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
                             req) :
@@ -917,6 +914,7 @@ static void nvme_process_sq(void *opaque)
         QTAILQ_REMOVE(&sq->req_list, req, entry);
         QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
         memset(&req->cqe, 0, sizeof(req->cqe));
+        req->flags = 0;
         req->cqe.cid = cmd.cid;
 
         status = sq->sqid ? nvme_io_cmd(n, &cmd, req) :
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 1d30c0bca2..0460cc0e62 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -16,11 +16,15 @@ typedef struct NvmeAsyncEvent {
     NvmeAerResult result;
 } NvmeAsyncEvent;
 
+enum NvmeRequestFlags {
+    NVME_REQ_FLG_HAS_SG   = 1 << 0,
+};
+
 typedef struct NvmeRequest {
     struct NvmeSQueue       *sq;
     BlockAIOCB              *aiocb;
     uint16_t                status;
-    bool                    has_sg;
+    uint16_t                flags;
     NvmeCqe                 cqe;
     BlockAcctCookie         acct;
     QEMUSGList              qsg;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 02/18] hw/block/nvme: Define 64 bit cqe.result
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
  2020-06-17 21:33 ` [PATCH v2 01/18] hw/block/nvme: Move NvmeRequest has_sg field to a bit flag Dmitry Fomichev
@ 2020-06-17 21:33 ` Dmitry Fomichev
  2020-06-30  0:58   ` Alistair Francis
  2020-06-30  4:15   ` Klaus Jensen
  2020-06-17 21:34 ` [PATCH v2 03/18] hw/block/nvme: Clean up unused AER definitions Dmitry Fomichev
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:33 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

From: Ajay Joshi <ajay.joshi@wdc.com>

A new write command, Zone Append, is added as a part of Zoned
Namespace Command Set. Upon successful completion of this command,
the controller returns the start LBA of the performed write operation
in cqe.result field. Therefore, the maximum size of this variable
needs to be changed from 32 to 64 bit, consuming the reserved 32 bit
field that follows the result in CQE struct. Since the existing
commands are expected to return a 32 bit LE value, two separate
variables, result32 and result64, are now kept in a union.

Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 block/nvme.c         | 2 +-
 block/trace-events   | 2 +-
 hw/block/nvme.c      | 6 +++---
 include/block/nvme.h | 6 ++++--
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index eb2f54dd9d..ca245ec574 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -287,7 +287,7 @@ static inline int nvme_translate_error(const NvmeCqe *c)
 {
     uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF;
     if (status) {
-        trace_nvme_error(le32_to_cpu(c->result),
+        trace_nvme_error(le64_to_cpu(c->result64),
                          le16_to_cpu(c->sq_head),
                          le16_to_cpu(c->sq_id),
                          le16_to_cpu(c->cid),
diff --git a/block/trace-events b/block/trace-events
index 29dff8881c..05c1393943 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -156,7 +156,7 @@ vxhs_get_creds(const char *cacert, const char *client_key, const char *client_ce
 # nvme.c
 nvme_kick(void *s, int queue) "s %p queue %d"
 nvme_dma_flush_queue_wait(void *s) "s %p"
-nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) "cmd_specific %d sq_head %d sqid %d cid %d status 0x%x"
+nvme_error(uint64_t cmd_specific, int sq_head, int sqid, int cid, int status) "cmd_specific %ld sq_head %d sqid %d cid %d status 0x%x"
 nvme_process_completion(void *s, int index, int inflight) "s %p queue %d inflight %d"
 nvme_process_completion_queue_busy(void *s, int index) "s %p queue %d"
 nvme_complete_command(void *s, int index, int cid) "s %p queue %d cid %d"
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3ed9f3d321..a1bbc9acde 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -823,7 +823,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
-    req->cqe.result = result;
+    req->cqe.result32 = result;
     return NVME_SUCCESS;
 }
 
@@ -859,8 +859,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
                                     ((dw11 >> 16) & 0xFFFF) + 1,
                                     n->params.max_ioqpairs,
                                     n->params.max_ioqpairs);
-        req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
-                                      ((n->params.max_ioqpairs - 1) << 16));
+        req->cqe.result32 = cpu_to_le32((n->params.max_ioqpairs - 1) |
+                                        ((n->params.max_ioqpairs - 1) << 16));
         break;
     case NVME_TIMESTAMP:
         return nvme_set_feature_timestamp(n, cmd);
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 1720ee1d51..9c3a04dcd7 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -577,8 +577,10 @@ typedef struct NvmeAerResult {
 } NvmeAerResult;
 
 typedef struct NvmeCqe {
-    uint32_t    result;
-    uint32_t    rsvd;
+    union {
+        uint64_t     result64;
+        uint32_t     result32;
+    };
     uint16_t    sq_head;
     uint16_t    sq_id;
     uint16_t    cid;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 03/18] hw/block/nvme: Clean up unused AER definitions
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
  2020-06-17 21:33 ` [PATCH v2 01/18] hw/block/nvme: Move NvmeRequest has_sg field to a bit flag Dmitry Fomichev
  2020-06-17 21:33 ` [PATCH v2 02/18] hw/block/nvme: Define 64 bit cqe.result Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-06-30  1:00   ` Alistair Francis
  2020-06-30  4:40   ` Klaus Jensen
  2020-06-17 21:34 ` [PATCH v2 04/18] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

Removed unused struct NvmeAerResult and SMART-related async event
codes. All other event codes are now categorized by their type.
This avoids having to define the same values in a single enum,
NvmeAsyncEventRequest, that is now removed.

Later commits in this series will define additional values in some
of these enums. No functional change.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.h      |  1 -
 include/block/nvme.h | 43 ++++++++++++++++++++++---------------------
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 0460cc0e62..4f0dac39ae 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -13,7 +13,6 @@ typedef struct NvmeParams {
 
 typedef struct NvmeAsyncEvent {
     QSIMPLEQ_ENTRY(NvmeAsyncEvent) entry;
-    NvmeAerResult result;
 } NvmeAsyncEvent;
 
 enum NvmeRequestFlags {
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 9c3a04dcd7..3099df99eb 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -553,28 +553,30 @@ typedef struct NvmeDsmRange {
     uint64_t    slba;
 } NvmeDsmRange;
 
-enum NvmeAsyncEventRequest {
-    NVME_AER_TYPE_ERROR                     = 0,
-    NVME_AER_TYPE_SMART                     = 1,
-    NVME_AER_TYPE_IO_SPECIFIC               = 6,
-    NVME_AER_TYPE_VENDOR_SPECIFIC           = 7,
-    NVME_AER_INFO_ERR_INVALID_SQ            = 0,
-    NVME_AER_INFO_ERR_INVALID_DB            = 1,
-    NVME_AER_INFO_ERR_DIAG_FAIL             = 2,
-    NVME_AER_INFO_ERR_PERS_INTERNAL_ERR     = 3,
-    NVME_AER_INFO_ERR_TRANS_INTERNAL_ERR    = 4,
-    NVME_AER_INFO_ERR_FW_IMG_LOAD_ERR       = 5,
-    NVME_AER_INFO_SMART_RELIABILITY         = 0,
-    NVME_AER_INFO_SMART_TEMP_THRESH         = 1,
-    NVME_AER_INFO_SMART_SPARE_THRESH        = 2,
+enum NvmeAsyncEventType {
+    NVME_AER_TYPE_ERROR                     = 0x00,
+    NVME_AER_TYPE_SMART                     = 0x01,
+    NVME_AER_TYPE_NOTICE                    = 0x02,
+    NVME_AER_TYPE_CMDSET_SPECIFIC           = 0x06,
+    NVME_AER_TYPE_VENDOR_SPECIFIC           = 0x07,
 };
 
-typedef struct NvmeAerResult {
-    uint8_t event_type;
-    uint8_t event_info;
-    uint8_t log_page;
-    uint8_t resv;
-} NvmeAerResult;
+enum NvmeAsyncErrorInfo {
+    NVME_AER_ERR_INVALID_SQ                 = 0x00,
+    NVME_AER_ERR_INVALID_DB                 = 0x01,
+    NVME_AER_ERR_DIAG_FAIL                  = 0x02,
+    NVME_AER_ERR_PERS_INTERNAL_ERR          = 0x03,
+    NVME_AER_ERR_TRANS_INTERNAL_ERR         = 0x04,
+    NVME_AER_ERR_FW_IMG_LOAD_ERR            = 0x05,
+};
+
+enum NvmeAsyncNoticeInfo {
+    NVME_AER_NOTICE_NS_CHANGED              = 0x00,
+};
+
+enum NvmeAsyncEventCfg {
+    NVME_AEN_CFG_NS_ATTR                    = 1 << 8,
+};
 
 typedef struct NvmeCqe {
     union {
@@ -881,7 +883,6 @@ enum NvmeIdNsDps {
 
 static inline void _nvme_check_size(void)
 {
-    QEMU_BUILD_BUG_ON(sizeof(NvmeAerResult) != 4);
     QEMU_BUILD_BUG_ON(sizeof(NvmeCqe) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeDsmRange) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeCmd) != 64);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 04/18] hw/block/nvme: Add Commands Supported and Effects log
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (2 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 03/18] hw/block/nvme: Clean up unused AER definitions Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-06-30  1:35   ` Alistair Francis
  2020-06-30  4:46   ` Klaus Jensen
  2020-06-17 21:34 ` [PATCH v2 05/18] hw/block/nvme: Introduce the Namespace Types definitions Dmitry Fomichev
                   ` (14 subsequent siblings)
  18 siblings, 2 replies; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

This log page becomes necessary to implement to allow checking for
Zone Append command support in Zoned Namespace Command Set.

This commit adds the code to report this log page for NVM Command
Set only. The parts that are specific to zoned operation will be
added later in the series.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c       | 62 +++++++++++++++++++++++++++++++++++++++++++
 hw/block/trace-events |  4 +++
 include/block/nvme.h  | 18 +++++++++++++
 3 files changed, 84 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a1bbc9acde..03b8deee85 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -871,6 +871,66 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     return NVME_SUCCESS;
 }
 
+static uint16_t nvme_handle_cmd_effects(NvmeCtrl *n, NvmeCmd *cmd,
+    uint64_t prp1, uint64_t prp2, uint64_t ofs, uint32_t len)
+{
+   NvmeEffectsLog cmd_eff_log = {};
+   uint32_t *iocs = cmd_eff_log.iocs;
+
+    trace_pci_nvme_cmd_supp_and_effects_log_read();
+
+    if (ofs != 0) {
+        trace_pci_nvme_err_invalid_effects_log_offset(ofs);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+    if (len != sizeof(cmd_eff_log)) {
+        trace_pci_nvme_err_invalid_effects_log_len(len);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    iocs[NVME_ADM_CMD_DELETE_SQ] = NVME_CMD_EFFECTS_CSUPP;
+    iocs[NVME_ADM_CMD_CREATE_SQ] = NVME_CMD_EFFECTS_CSUPP;
+    iocs[NVME_ADM_CMD_DELETE_CQ] = NVME_CMD_EFFECTS_CSUPP;
+    iocs[NVME_ADM_CMD_CREATE_CQ] = NVME_CMD_EFFECTS_CSUPP;
+    iocs[NVME_ADM_CMD_IDENTIFY] = NVME_CMD_EFFECTS_CSUPP;
+    iocs[NVME_ADM_CMD_SET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
+    iocs[NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
+    iocs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;
+
+    iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
+    iocs[NVME_CMD_WRITE_ZEROS] = NVME_CMD_EFFECTS_CSUPP |
+                                 NVME_CMD_EFFECTS_LBCC;
+    iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
+    iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
+
+    return nvme_dma_read_prp(n, (uint8_t *)&cmd_eff_log, len, prp1, prp2);
+}
+
+static uint16_t nvme_get_log_page(NvmeCtrl *n, NvmeCmd *cmd)
+{
+    uint64_t prp1 = le64_to_cpu(cmd->prp1);
+    uint64_t prp2 = le64_to_cpu(cmd->prp2);
+    uint32_t dw10 = le32_to_cpu(cmd->cdw10);
+    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
+    uint64_t dw12 = le32_to_cpu(cmd->cdw12);
+    uint64_t dw13 = le32_to_cpu(cmd->cdw13);
+    uint64_t ofs = (dw13 << 32) | dw12;
+    uint32_t numdl, numdu, len;
+    uint16_t lid = dw10 & 0xff;
+
+    numdl = dw10 >> 16;
+    numdu = dw11 & 0xffff;
+    len = (((numdu << 16) | numdl) + 1) << 2;
+
+    switch (lid) {
+    case NVME_LOG_CMD_EFFECTS:
+        return nvme_handle_cmd_effects(n, cmd, prp1, prp2, ofs, len);
+    }
+
+    trace_pci_nvme_unsupported_log_page(lid);
+    return NVME_INVALID_FIELD | NVME_DNR;
+}
+
 static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
     switch (cmd->opcode) {
@@ -888,6 +948,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         return nvme_set_feature(n, cmd, req);
     case NVME_ADM_CMD_GET_FEATURES:
         return nvme_get_feature(n, cmd, req);
+    case NVME_ADM_CMD_GET_LOG_PAGE:
+        return nvme_get_log_page(n, cmd);
     default:
         trace_pci_nvme_err_invalid_admin_opc(cmd->opcode);
         return NVME_INVALID_OPCODE | NVME_DNR;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 958fcc5508..423d491e27 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -58,6 +58,7 @@ pci_nvme_mmio_start_success(void) "setting controller enable bit succeeded"
 pci_nvme_mmio_stopped(void) "cleared controller enable bit"
 pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
 pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
+pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
 
 # nvme traces for error conditions
 pci_nvme_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
@@ -69,6 +70,8 @@ pci_nvme_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid namespace %u not w
 pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
 pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
 pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
+pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
+pci_nvme_err_invalid_effects_log_len(uint32_t len) "commands supported and effects log size is 4096, got %"PRIu32""
 pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
@@ -123,6 +126,7 @@ pci_nvme_ub_db_wr_invalid_cq(uint32_t qid) "completion queue doorbell write for
 pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t new_head) "completion queue doorbell write value beyond queue size, cqid=%"PRIu32", new_head=%"PRIu16", ignoring"
 pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write for nonexistent queue, sqid=%"PRIu32", ignoring"
 pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission queue doorbell write value beyond queue size, sqid=%"PRIu32", new_head=%"PRIu16", ignoring"
+pci_nvme_unsupported_log_page(uint16_t lid) "unsupported log page 0x%"PRIx16""
 
 # xen-block.c
 xen_block_realize(const char *type, uint32_t disk, uint32_t partition) "%s d%up%u"
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 3099df99eb..6a58bac0c2 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -691,10 +691,27 @@ enum NvmeSmartWarn {
     NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,
 };
 
+typedef struct NvmeEffectsLog {
+  uint32_t      acs[256];
+  uint32_t      iocs[256];
+  uint8_t       resv[2048];
+} NvmeEffectsLog;
+
+enum {
+   NVME_CMD_EFFECTS_CSUPP             = 1 << 0,
+   NVME_CMD_EFFECTS_LBCC              = 1 << 1,
+   NVME_CMD_EFFECTS_NCC               = 1 << 2,
+   NVME_CMD_EFFECTS_NIC               = 1 << 3,
+   NVME_CMD_EFFECTS_CCC               = 1 << 4,
+   NVME_CMD_EFFECTS_CSE_MASK          = 3 << 16,
+   NVME_CMD_EFFECTS_UUID_SEL          = 1 << 19,
+};
+
 enum LogIdentifier {
     NVME_LOG_ERROR_INFO     = 0x01,
     NVME_LOG_SMART_INFO     = 0x02,
     NVME_LOG_FW_SLOT_INFO   = 0x03,
+    NVME_LOG_CMD_EFFECTS    = 0x05,
 };
 
 typedef struct NvmePSD {
@@ -898,5 +915,6 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
 }
 #endif
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 05/18] hw/block/nvme: Introduce the Namespace Types definitions
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (3 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 04/18] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-06-30  2:12   ` Alistair Francis
  2020-06-30  4:57   ` Klaus Jensen
  2020-06-17 21:34 ` [PATCH v2 06/18] hw/block/nvme: Define trace events related to NS Types Dmitry Fomichev
                   ` (13 subsequent siblings)
  18 siblings, 2 replies; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

From: Niklas Cassel <niklas.cassel@wdc.com>

Define the structures and constants required to implement
Namespace Types support.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.h      |  3 ++
 include/block/nvme.h | 75 +++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 4f0dac39ae..4fd155c409 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -63,6 +63,9 @@ typedef struct NvmeCQueue {
 
 typedef struct NvmeNamespace {
     NvmeIdNs        id_ns;
+    uint32_t        nsid;
+    uint8_t         csi;
+    QemuUUID        uuid;
 } NvmeNamespace;
 
 static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 6a58bac0c2..5a1e5e137c 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -50,6 +50,11 @@ enum NvmeCapMask {
     CAP_PMR_MASK       = 0x1,
 };
 
+enum NvmeCapCssBits {
+    CAP_CSS_NVM        = 0x01,
+    CAP_CSS_CSI_SUPP   = 0x40,
+};
+
 #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
 #define NVME_CAP_CQR(cap)   (((cap) >> CAP_CQR_SHIFT)    & CAP_CQR_MASK)
 #define NVME_CAP_AMS(cap)   (((cap) >> CAP_AMS_SHIFT)    & CAP_AMS_MASK)
@@ -101,6 +106,12 @@ enum NvmeCcMask {
     CC_IOCQES_MASK  = 0xf,
 };
 
+enum NvmeCcCss {
+    CSS_NVM_ONLY        = 0,
+    CSS_ALL_NSTYPES     = 6,
+    CSS_ADMIN_ONLY      = 7,
+};
+
 #define NVME_CC_EN(cc)     ((cc >> CC_EN_SHIFT)     & CC_EN_MASK)
 #define NVME_CC_CSS(cc)    ((cc >> CC_CSS_SHIFT)    & CC_CSS_MASK)
 #define NVME_CC_MPS(cc)    ((cc >> CC_MPS_SHIFT)    & CC_MPS_MASK)
@@ -109,6 +120,21 @@ enum NvmeCcMask {
 #define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK)
 #define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK)
 
+#define NVME_SET_CC_EN(cc, val)     \
+    (cc |= (uint32_t)((val) & CC_EN_MASK) << CC_EN_SHIFT)
+#define NVME_SET_CC_CSS(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_CSS_MASK) << CC_CSS_SHIFT)
+#define NVME_SET_CC_MPS(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_MPS_MASK) << CC_MPS_SHIFT)
+#define NVME_SET_CC_AMS(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_AMS_MASK) << CC_AMS_SHIFT)
+#define NVME_SET_CC_SHN(cc, val)    \
+    (cc |= (uint32_t)((val) & CC_SHN_MASK) << CC_SHN_SHIFT)
+#define NVME_SET_CC_IOSQES(cc, val) \
+    (cc |= (uint32_t)((val) & CC_IOSQES_MASK) << CC_IOSQES_SHIFT)
+#define NVME_SET_CC_IOCQES(cc, val) \
+    (cc |= (uint32_t)((val) & CC_IOCQES_MASK) << CC_IOCQES_SHIFT)
+
 enum NvmeCstsShift {
     CSTS_RDY_SHIFT      = 0,
     CSTS_CFS_SHIFT      = 1,
@@ -482,10 +508,41 @@ typedef struct NvmeIdentify {
     uint64_t    rsvd2[2];
     uint64_t    prp1;
     uint64_t    prp2;
-    uint32_t    cns;
-    uint32_t    rsvd11[5];
+    uint8_t     cns;
+    uint8_t     rsvd4;
+    uint16_t    ctrlid;
+    uint16_t    nvmsetid;
+    uint8_t     rsvd3;
+    uint8_t     csi;
+    uint32_t    rsvd12[4];
 } NvmeIdentify;
 
+typedef struct NvmeNsIdDesc {
+    uint8_t     nidt;
+    uint8_t     nidl;
+    uint16_t    rsvd2;
+} NvmeNsIdDesc;
+
+enum NvmeNidType {
+    NVME_NIDT_EUI64             = 0x01,
+    NVME_NIDT_NGUID             = 0x02,
+    NVME_NIDT_UUID              = 0x03,
+    NVME_NIDT_CSI               = 0x04,
+};
+
+enum NvmeNidLength {
+    NVME_NIDL_EUI64             = 8,
+    NVME_NIDL_NGUID             = 16,
+    NVME_NIDL_UUID              = 16,
+    NVME_NIDL_CSI               = 1,
+};
+
+enum NvmeCsi {
+    NVME_CSI_NVM                = 0x00,
+};
+
+#define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
+
 typedef struct NvmeRwCmd {
     uint8_t     opcode;
     uint8_t     flags;
@@ -603,6 +660,7 @@ enum NvmeStatusCodes {
     NVME_CMD_ABORT_MISSING_FUSE = 0x000a,
     NVME_INVALID_NSID           = 0x000b,
     NVME_CMD_SEQ_ERROR          = 0x000c,
+    NVME_CMD_SET_CMB_REJECTED   = 0x002b,
     NVME_LBA_RANGE              = 0x0080,
     NVME_CAP_EXCEEDED           = 0x0081,
     NVME_NS_NOT_READY           = 0x0082,
@@ -729,9 +787,14 @@ typedef struct NvmePSD {
 #define NVME_IDENTIFY_DATA_SIZE 4096
 
 enum {
-    NVME_ID_CNS_NS             = 0x0,
-    NVME_ID_CNS_CTRL           = 0x1,
-    NVME_ID_CNS_NS_ACTIVE_LIST = 0x2,
+    NVME_ID_CNS_NS                = 0x0,
+    NVME_ID_CNS_CTRL              = 0x1,
+    NVME_ID_CNS_NS_ACTIVE_LIST    = 0x2,
+    NVME_ID_CNS_NS_DESC_LIST      = 0x03,
+    NVME_ID_CNS_CS_NS             = 0x05,
+    NVME_ID_CNS_CS_CTRL           = 0x06,
+    NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
+    NVME_ID_CNS_IO_COMMAND_SET    = 0x1c,
 };
 
 typedef struct NvmeIdCtrl {
@@ -825,6 +888,7 @@ enum NvmeFeatureIds {
     NVME_WRITE_ATOMICITY            = 0xa,
     NVME_ASYNCHRONOUS_EVENT_CONF    = 0xb,
     NVME_TIMESTAMP                  = 0xe,
+    NVME_COMMAND_SET_PROFILE        = 0x19,
     NVME_SOFTWARE_PROGRESS_MARKER   = 0x80
 };
 
@@ -914,6 +978,7 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeNsIdDesc) != 4);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
 }
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 06/18] hw/block/nvme: Define trace events related to NS Types
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (4 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 05/18] hw/block/nvme: Introduce the Namespace Types definitions Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-06-30 10:20   ` Klaus Jensen
  2020-06-30 20:18   ` Alistair Francis
  2020-06-17 21:34 ` [PATCH v2 07/18] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
                   ` (12 subsequent siblings)
  18 siblings, 2 replies; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

A few trace events are defined that are relevant to implementing
Namespace Types (NVMe TP 4056).

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/trace-events | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/hw/block/trace-events b/hw/block/trace-events
index 423d491e27..3f3323fe38 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -39,8 +39,13 @@ pci_nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size,
 pci_nvme_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
 pci_nvme_del_cq(uint16_t cqid) "deleted completion queue, cqid=%"PRIu16""
 pci_nvme_identify_ctrl(void) "identify controller"
+pci_nvme_identify_ctrl_csi(uint8_t csi) "identify controller, csi=0x%"PRIx8""
 pci_nvme_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
+pci_nvme_identify_ns_csi(uint16_t ns, uint8_t csi) "identify namespace, nsid=%"PRIu16", csi=0x%"PRIx8""
 pci_nvme_identify_nslist(uint16_t ns) "identify namespace list, nsid=%"PRIu16""
+pci_nvme_identify_nslist_csi(uint16_t ns, uint8_t csi) "identify namespace list, nsid=%"PRIu16", csi=0x%"PRIx8""
+pci_nvme_list_ns_descriptors(void) "identify namespace descriptors"
+pci_nvme_identify_cmd_set(void) "identify i/o command set"
 pci_nvme_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
 pci_nvme_getfeat_numq(int result) "get feature number of queues, result=%d"
 pci_nvme_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
@@ -59,6 +64,8 @@ pci_nvme_mmio_stopped(void) "cleared controller enable bit"
 pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
 pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
 pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
+pci_nvme_css_nvm_cset_selected_by_host(uint32_t cc) "NVM command set selected by host, bar.cc=0x%"PRIx32""
+pci_nvme_css_all_csets_sel_by_host(uint32_t cc) "all supported command sets selected by host, bar.cc=0x%"PRIx32""
 
 # nvme traces for error conditions
 pci_nvme_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
@@ -72,6 +79,9 @@ pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
 pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
 pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
 pci_nvme_err_invalid_effects_log_len(uint32_t len) "commands supported and effects log size is 4096, got %"PRIu32""
+pci_nvme_err_change_css_when_enabled(void) "changing CC.CSS while controller is enabled"
+pci_nvme_err_only_nvm_cmd_set_avail(void) "setting 110b CC.CSS, but only NVM command set is enabled"
+pci_nvme_err_invalid_iocsci(uint32_t idx) "unsupported command set combination index %"PRIu32""
 pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
@@ -127,6 +137,7 @@ pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t new_head) "completion qu
 pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write for nonexistent queue, sqid=%"PRIu32", ignoring"
 pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission queue doorbell write value beyond queue size, sqid=%"PRIu32", new_head=%"PRIu16", ignoring"
 pci_nvme_unsupported_log_page(uint16_t lid) "unsupported log page 0x%"PRIx16""
+pci_nvme_ub_unknown_css_value(void) "unknown value in cc.css field"
 
 # xen-block.c
 xen_block_realize(const char *type, uint32_t disk, uint32_t partition) "%s d%up%u"
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 07/18] hw/block/nvme: Add support for Namespace Types
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (5 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 06/18] hw/block/nvme: Define trace events related to NS Types Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-06-30 11:31   ` Klaus Jensen
  2020-06-17 21:34 ` [PATCH v2 08/18] hw/block/nvme: Make Zoned NS Command Set definitions Dmitry Fomichev
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

From: Niklas Cassel <niklas.cassel@wdc.com>

Namespace Types introduce a new command set, "I/O Command Sets",
that allows the host to retrieve the command sets associated with
a namespace. Introduce support for the command set, and enable
detection for the NVM Command Set.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 210 ++++++++++++++++++++++++++++++++++++++++++++++--
 hw/block/nvme.h |  11 +++
 2 files changed, 216 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 03b8deee85..453f4747a5 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -686,6 +686,26 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c)
         prp1, prp2);
 }
 
+static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeIdentify *c)
+{
+    uint64_t prp1 = le64_to_cpu(c->prp1);
+    uint64_t prp2 = le64_to_cpu(c->prp2);
+    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
+    uint32_t *list;
+    uint16_t ret;
+
+    trace_pci_nvme_identify_ctrl_csi(c->csi);
+
+    if (c->csi == NVME_CSI_NVM) {
+        list = g_malloc0(data_len);
+        ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
+        g_free(list);
+        return ret;
+    } else {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+}
+
 static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
 {
     NvmeNamespace *ns;
@@ -701,11 +721,42 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
     }
 
     ns = &n->namespaces[nsid - 1];
+    assert(nsid == ns->nsid);
 
     return nvme_dma_read_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
         prp1, prp2);
 }
 
+static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeIdentify *c)
+{
+    NvmeNamespace *ns;
+    uint32_t nsid = le32_to_cpu(c->nsid);
+    uint64_t prp1 = le64_to_cpu(c->prp1);
+    uint64_t prp2 = le64_to_cpu(c->prp2);
+    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
+    uint32_t *list;
+    uint16_t ret;
+
+    trace_pci_nvme_identify_ns_csi(nsid, c->csi);
+
+    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
+        trace_pci_nvme_err_invalid_ns(nsid, n->num_namespaces);
+        return NVME_INVALID_NSID | NVME_DNR;
+    }
+
+    ns = &n->namespaces[nsid - 1];
+    assert(nsid == ns->nsid);
+
+    if (c->csi == NVME_CSI_NVM) {
+        list = g_malloc0(data_len);
+        ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
+        g_free(list);
+        return ret;
+    } else {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+}
+
 static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
 {
     static const int data_len = NVME_IDENTIFY_DATA_SIZE;
@@ -733,6 +784,99 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
     return ret;
 }
 
+static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeIdentify *c)
+{
+    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
+    uint32_t min_nsid = le32_to_cpu(c->nsid);
+    uint64_t prp1 = le64_to_cpu(c->prp1);
+    uint64_t prp2 = le64_to_cpu(c->prp2);
+    uint32_t *list;
+    uint16_t ret;
+    int i, j = 0;
+
+    trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
+
+    if (c->csi != NVME_CSI_NVM) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    list = g_malloc0(data_len);
+    for (i = 0; i < n->num_namespaces; i++) {
+        if (i < min_nsid) {
+            continue;
+        }
+        list[j++] = cpu_to_le32(i + 1);
+        if (j == data_len / sizeof(uint32_t)) {
+            break;
+        }
+    }
+    ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
+    g_free(list);
+    return ret;
+}
+
+static uint16_t nvme_list_ns_descriptors(NvmeCtrl *n, NvmeIdentify *c)
+{
+    NvmeNamespace *ns;
+    uint32_t nsid = le32_to_cpu(c->nsid);
+    uint64_t prp1 = le64_to_cpu(c->prp1);
+    uint64_t prp2 = le64_to_cpu(c->prp2);
+    void *buf_ptr;
+    NvmeNsIdDesc *desc;
+    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
+    uint8_t *buf;
+    uint16_t status;
+
+    trace_pci_nvme_list_ns_descriptors();
+
+    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
+        trace_pci_nvme_err_invalid_ns(nsid, n->num_namespaces);
+        return NVME_INVALID_NSID | NVME_DNR;
+    }
+
+    ns = &n->namespaces[nsid - 1];
+    assert(nsid == ns->nsid);
+
+    buf = g_malloc0(data_len);
+    buf_ptr = buf;
+
+    desc = buf_ptr;
+    desc->nidt = NVME_NIDT_UUID;
+    desc->nidl = NVME_NIDL_UUID;
+    buf_ptr += sizeof(*desc);
+    memcpy(buf_ptr, ns->uuid.data, NVME_NIDL_UUID);
+    buf_ptr += NVME_NIDL_UUID;
+
+    desc = buf_ptr;
+    desc->nidt = NVME_NIDT_CSI;
+    desc->nidl = NVME_NIDL_CSI;
+    buf_ptr += sizeof(*desc);
+    *(uint8_t *)buf_ptr = NVME_CSI_NVM;
+
+    status = nvme_dma_read_prp(n, buf, data_len, prp1, prp2);
+    g_free(buf);
+    return status;
+}
+
+static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeIdentify *c)
+{
+    uint64_t prp1 = le64_to_cpu(c->prp1);
+    uint64_t prp2 = le64_to_cpu(c->prp2);
+    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
+    uint32_t *list;
+    uint8_t *ptr;
+    uint16_t status;
+
+    trace_pci_nvme_identify_cmd_set();
+
+    list = g_malloc0(data_len);
+    ptr = (uint8_t *)list;
+    NVME_SET_CSI(*ptr, NVME_CSI_NVM);
+    status = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
+    g_free(list);
+    return status;
+}
+
 static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
 {
     NvmeIdentify *c = (NvmeIdentify *)cmd;
@@ -740,10 +884,20 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
     switch (le32_to_cpu(c->cns)) {
     case NVME_ID_CNS_NS:
         return nvme_identify_ns(n, c);
+    case NVME_ID_CNS_CS_NS:
+        return nvme_identify_ns_csi(n, c);
     case NVME_ID_CNS_CTRL:
         return nvme_identify_ctrl(n, c);
+    case NVME_ID_CNS_CS_CTRL:
+        return nvme_identify_ctrl_csi(n, c);
     case NVME_ID_CNS_NS_ACTIVE_LIST:
         return nvme_identify_nslist(n, c);
+    case NVME_ID_CNS_CS_NS_ACTIVE_LIST:
+        return nvme_identify_nslist_csi(n, c);
+    case NVME_ID_CNS_NS_DESC_LIST:
+        return nvme_list_ns_descriptors(n, c);
+    case NVME_ID_CNS_IO_COMMAND_SET:
+        return nvme_identify_cmd_set(n, c);
     default:
         trace_pci_nvme_err_invalid_identify_cns(le32_to_cpu(c->cns));
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -818,6 +972,9 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         break;
     case NVME_TIMESTAMP:
         return nvme_get_feature_timestamp(n, cmd);
+    case NVME_COMMAND_SET_PROFILE:
+        result = 0;
+        break;
     default:
         trace_pci_nvme_err_invalid_getfeat(dw10);
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -864,6 +1021,15 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         break;
     case NVME_TIMESTAMP:
         return nvme_set_feature_timestamp(n, cmd);
+        break;
+
+    case NVME_COMMAND_SET_PROFILE:
+        if (dw11 & 0x1ff) {
+            trace_pci_nvme_err_invalid_iocsci(dw11 & 0x1ff);
+            return NVME_CMD_SET_CMB_REJECTED | NVME_DNR;
+        }
+        break;
+
     default:
         trace_pci_nvme_err_invalid_setfeat(dw10);
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1149,6 +1315,29 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
         break;
     case 0x14:  /* CC */
         trace_pci_nvme_mmio_cfg(data & 0xffffffff);
+
+        if (NVME_CC_CSS(data) != NVME_CC_CSS(n->bar.cc)) {
+            if (NVME_CC_EN(n->bar.cc)) {
+                NVME_GUEST_ERR(pci_nvme_err_change_css_when_enabled,
+                               "changing selected command set when enabled");
+                break;
+            }
+            switch (NVME_CC_CSS(data)) {
+            case CSS_NVM_ONLY:
+                trace_pci_nvme_css_nvm_cset_selected_by_host(data & 0xffffffff);
+                break;
+            case CSS_ALL_NSTYPES:
+                NVME_SET_CC_CSS(n->bar.cc, CSS_ALL_NSTYPES);
+                trace_pci_nvme_css_all_csets_sel_by_host(data & 0xffffffff);
+                break;
+            case CSS_ADMIN_ONLY:
+                break;
+            default:
+                NVME_GUEST_ERR(pci_nvme_ub_unknown_css_value,
+                               "unknown value in CC.CSS field");
+            }
+        }
+
         /* Windows first sends data, then sends enable bit */
         if (!NVME_CC_EN(data) && !NVME_CC_EN(n->bar.cc) &&
             !NVME_CC_SHN(data) && !NVME_CC_SHN(n->bar.cc))
@@ -1496,6 +1685,7 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
 {
     int64_t bs_size;
     NvmeIdNs *id_ns = &ns->id_ns;
+    int lba_index;
 
     bs_size = blk_getlength(n->conf.blk);
     if (bs_size < 0) {
@@ -1505,7 +1695,10 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
 
     n->ns_size = bs_size;
 
-    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+    ns->csi = NVME_CSI_NVM;
+    qemu_uuid_generate(&ns->uuid); /* TODO make UUIDs persistent */
+    lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
+    id_ns->lbaf[lba_index].ds = nvme_ilog2(n->conf.logical_block_size);
     id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(n, ns));
 
     /* no thin provisioning */
@@ -1616,7 +1809,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
     id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
     strpadcpy((char *)id->mn, sizeof(id->mn), "QEMU NVMe Ctrl", ' ');
-    strpadcpy((char *)id->fr, sizeof(id->fr), "1.0", ' ');
+    strpadcpy((char *)id->fr, sizeof(id->fr), "2.0", ' ');
     strpadcpy((char *)id->sn, sizeof(id->sn), n->params.serial, ' ');
     id->rab = 6;
     id->ieee[0] = 0x00;
@@ -1640,7 +1833,11 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
     NVME_CAP_SET_CQR(n->bar.cap, 1);
     NVME_CAP_SET_TO(n->bar.cap, 0xf);
-    NVME_CAP_SET_CSS(n->bar.cap, 1);
+    /*
+     * The driver now always supports NS Types, but all commands that
+     * support CSI field will only handle NVM Command Set.
+     */
+    NVME_CAP_SET_CSS(n->bar.cap, (CAP_CSS_NVM | CAP_CSS_CSI_SUPP));
     NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
 
     n->bar.vs = 0x00010200;
@@ -1650,6 +1847,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
 static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 {
     NvmeCtrl *n = NVME(pci_dev);
+    NvmeNamespace *ns;
     Error *local_err = NULL;
 
     int i;
@@ -1675,8 +1873,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 
     nvme_init_ctrl(n, pci_dev);
 
-    for (i = 0; i < n->num_namespaces; i++) {
-        nvme_init_namespace(n, &n->namespaces[i], &local_err);
+    ns = n->namespaces;
+    for (i = 0; i < n->num_namespaces; i++, ns++) {
+        ns->nsid = i + 1;
+        nvme_init_namespace(n, ns, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
             return;
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 4fd155c409..0d29f75475 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -121,4 +121,15 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
     return n->ns_size >> nvme_ns_lbads(ns);
 }
 
+static inline int nvme_ilog2(uint64_t i)
+{
+    int log = -1;
+
+    while (i) {
+        i >>= 1;
+        log++;
+    }
+    return log;
+}
+
 #endif /* HW_NVME_H */
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 08/18] hw/block/nvme: Make Zoned NS Command Set definitions
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (6 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 07/18] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-06-30 11:44   ` Klaus Jensen
  2020-06-30 22:11   ` Alistair Francis
  2020-06-17 21:34 ` [PATCH v2 09/18] hw/block/nvme: Define Zoned NS Command Set trace events Dmitry Fomichev
                   ` (10 subsequent siblings)
  18 siblings, 2 replies; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

Define values and structures that are needed to support Zoned
Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.

All new protocol definitions are located in include/block/nvme.h
and everything added that is specific to this implementation is kept
in hw/block/nvme.h.

In order to improve scalability, all open, closed and full zones
are organized in separate linked lists. Consequently, almost all
zone operations don't require scanning of the entire zone array
(which potentially can be quite large) - it is only necessary to
enumerate one or more zone lists. Zone lists are designed to be
position-independent as they can be persisted to the backing file
as a part of zone metadata. NvmeZoneList struct defined in this patch
serves as a head of every zone list.

NvmeZone structure encapsulates NvmeZoneDescriptor defined in Zoned
Command Set specification and adds a few more fields that are
internal to this implementation.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.h      | 130 +++++++++++++++++++++++++++++++++++++++++++
 include/block/nvme.h | 119 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 248 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 0d29f75475..2c932b5e29 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -3,12 +3,22 @@
 
 #include "block/nvme.h"
 
+#define NVME_DEFAULT_ZONE_SIZE   128 /* MiB */
+#define NVME_DEFAULT_MAX_ZA_SIZE 128 /* KiB */
+
 typedef struct NvmeParams {
     char     *serial;
     uint32_t num_queues; /* deprecated since 5.1 */
     uint32_t max_ioqpairs;
     uint16_t msix_qsize;
     uint32_t cmb_size_mb;
+
+    bool        zoned;
+    bool        cross_zone_read;
+    uint8_t     fill_pattern;
+    uint32_t    zamds_bs;
+    uint64_t    zone_size;
+    uint64_t    zone_capacity;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
@@ -17,6 +27,8 @@ typedef struct NvmeAsyncEvent {
 
 enum NvmeRequestFlags {
     NVME_REQ_FLG_HAS_SG   = 1 << 0,
+    NVME_REQ_FLG_FILL     = 1 << 1,
+    NVME_REQ_FLG_APPEND   = 1 << 2,
 };
 
 typedef struct NvmeRequest {
@@ -24,6 +36,7 @@ typedef struct NvmeRequest {
     BlockAIOCB              *aiocb;
     uint16_t                status;
     uint16_t                flags;
+    uint64_t                fill_ofs;
     NvmeCqe                 cqe;
     BlockAcctCookie         acct;
     QEMUSGList              qsg;
@@ -61,11 +74,35 @@ typedef struct NvmeCQueue {
     QTAILQ_HEAD(, NvmeRequest) req_list;
 } NvmeCQueue;
 
+typedef struct NvmeZone {
+    NvmeZoneDescr   d;
+    uint64_t        tstamp;
+    uint32_t        next;
+    uint32_t        prev;
+    uint8_t         rsvd80[8];
+} NvmeZone;
+
+#define NVME_ZONE_LIST_NIL    UINT_MAX
+
+typedef struct NvmeZoneList {
+    uint32_t        head;
+    uint32_t        tail;
+    uint32_t        size;
+    uint8_t         rsvd12[4];
+} NvmeZoneList;
+
 typedef struct NvmeNamespace {
     NvmeIdNs        id_ns;
     uint32_t        nsid;
     uint8_t         csi;
     QemuUUID        uuid;
+
+    NvmeIdNsZoned   *id_ns_zoned;
+    NvmeZone        *zone_array;
+    NvmeZoneList    *exp_open_zones;
+    NvmeZoneList    *imp_open_zones;
+    NvmeZoneList    *closed_zones;
+    NvmeZoneList    *full_zones;
 } NvmeNamespace;
 
 static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
@@ -100,6 +137,7 @@ typedef struct NvmeCtrl {
     uint32_t    num_namespaces;
     uint32_t    max_q_ents;
     uint64_t    ns_size;
+
     uint8_t     *cmbuf;
     uint32_t    irq_status;
     uint64_t    host_timestamp;                 /* Timestamp sent by the host */
@@ -107,6 +145,12 @@ typedef struct NvmeCtrl {
 
     HostMemoryBackend *pmrdev;
 
+    int             zone_file_fd;
+    uint32_t        num_zones;
+    uint64_t        zone_size_bs;
+    uint64_t        zone_array_size;
+    uint8_t         zamds;
+
     NvmeNamespace   *namespaces;
     NvmeSQueue      **sq;
     NvmeCQueue      **cq;
@@ -121,6 +165,86 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
     return n->ns_size >> nvme_ns_lbads(ns);
 }
 
+static inline uint8_t nvme_get_zone_state(NvmeZone *zone)
+{
+    return zone->d.zs >> 4;
+}
+
+static inline void nvme_set_zone_state(NvmeZone *zone, enum NvmeZoneState state)
+{
+    zone->d.zs = state << 4;
+}
+
+static inline uint64_t nvme_zone_rd_boundary(NvmeCtrl *n, NvmeZone *zone)
+{
+    return zone->d.zslba + n->params.zone_size;
+}
+
+static inline uint64_t nvme_zone_wr_boundary(NvmeZone *zone)
+{
+    return zone->d.zslba + zone->d.zcap;
+}
+
+static inline bool nvme_wp_is_valid(NvmeZone *zone)
+{
+    uint8_t st = nvme_get_zone_state(zone);
+
+    return st != NVME_ZONE_STATE_FULL &&
+           st != NVME_ZONE_STATE_READ_ONLY &&
+           st != NVME_ZONE_STATE_OFFLINE;
+}
+
+/*
+ * Initialize a zone list head.
+ */
+static inline void nvme_init_zone_list(NvmeZoneList *zl)
+{
+    zl->head = NVME_ZONE_LIST_NIL;
+    zl->tail = NVME_ZONE_LIST_NIL;
+    zl->size = 0;
+}
+
+/*
+ * Initialize the number of entries contained in a zone list.
+ */
+static inline uint32_t nvme_zone_list_size(NvmeZoneList *zl)
+{
+    return zl->size;
+}
+
+/*
+ * Check if the zone is not currently included into any zone list.
+ */
+static inline bool nvme_zone_not_in_list(NvmeZone *zone)
+{
+    return (bool)(zone->prev == 0 && zone->next == 0);
+}
+
+/*
+ * Return the zone at the head of zone list or NULL if the list is empty.
+ */
+static inline NvmeZone *nvme_peek_zone_head(NvmeNamespace *ns, NvmeZoneList *zl)
+{
+    if (zl->head == NVME_ZONE_LIST_NIL) {
+        return NULL;
+    }
+    return &ns->zone_array[zl->head];
+}
+
+/*
+ * Return the next zone in the list.
+ */
+static inline NvmeZone *nvme_next_zone_in_list(NvmeNamespace *ns, NvmeZone *z,
+    NvmeZoneList *zl)
+{
+    assert(!nvme_zone_not_in_list(z));
+
+    if (z->next == NVME_ZONE_LIST_NIL) {
+        return NULL;
+    }
+    return &ns->zone_array[z->next];
+}
+
 static inline int nvme_ilog2(uint64_t i)
 {
     int log = -1;
@@ -132,4 +256,10 @@ static inline int nvme_ilog2(uint64_t i)
     return log;
 }
 
+static inline void _hw_nvme_check_size(void)
+{
+    QEMU_BUILD_BUG_ON(sizeof(NvmeZoneList) != 16);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeZone) != 88);
+}
+
 #endif /* HW_NVME_H */
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 5a1e5e137c..596c39162b 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -446,6 +446,9 @@ enum NvmeIoCommands {
     NVME_CMD_COMPARE            = 0x05,
     NVME_CMD_WRITE_ZEROS        = 0x08,
     NVME_CMD_DSM                = 0x09,
+    NVME_CMD_ZONE_MGMT_SEND     = 0x79,
+    NVME_CMD_ZONE_MGMT_RECV     = 0x7a,
+    NVME_CMD_ZONE_APND          = 0x7d,
 };
 
 typedef struct NvmeDeleteQ {
@@ -539,6 +542,7 @@ enum NvmeNidLength {
 
 enum NvmeCsi {
     NVME_CSI_NVM                = 0x00,
+    NVME_CSI_ZONED              = 0x02,
 };
 
 #define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
@@ -661,6 +665,7 @@ enum NvmeStatusCodes {
     NVME_INVALID_NSID           = 0x000b,
     NVME_CMD_SEQ_ERROR          = 0x000c,
     NVME_CMD_SET_CMB_REJECTED   = 0x002b,
+    NVME_INVALID_CMD_SET        = 0x002c,
     NVME_LBA_RANGE              = 0x0080,
     NVME_CAP_EXCEEDED           = 0x0081,
     NVME_NS_NOT_READY           = 0x0082,
@@ -684,6 +689,14 @@ enum NvmeStatusCodes {
     NVME_CONFLICTING_ATTRS      = 0x0180,
     NVME_INVALID_PROT_INFO      = 0x0181,
     NVME_WRITE_TO_RO            = 0x0182,
+    NVME_ZONE_BOUNDARY_ERROR    = 0x01b8,
+    NVME_ZONE_FULL              = 0x01b9,
+    NVME_ZONE_READ_ONLY         = 0x01ba,
+    NVME_ZONE_OFFLINE           = 0x01bb,
+    NVME_ZONE_INVALID_WRITE     = 0x01bc,
+    NVME_ZONE_TOO_MANY_ACTIVE   = 0x01bd,
+    NVME_ZONE_TOO_MANY_OPEN     = 0x01be,
+    NVME_ZONE_INVAL_TRANSITION  = 0x01bf,
     NVME_WRITE_FAULT            = 0x0280,
     NVME_UNRECOVERED_READ       = 0x0281,
     NVME_E2E_GUARD_ERROR        = 0x0282,
@@ -807,7 +820,17 @@ typedef struct NvmeIdCtrl {
     uint8_t     ieee[3];
     uint8_t     cmic;
     uint8_t     mdts;
-    uint8_t     rsvd255[178];
+    uint16_t    cntlid;
+    uint32_t    ver;
+    uint32_t    rtd3r;
+    uint32_t    rtd3e;
+    uint32_t    oaes;
+    uint32_t    ctratt;
+    uint8_t     rsvd100[28];
+    uint16_t    crdt1;
+    uint16_t    crdt2;
+    uint16_t    crdt3;
+    uint8_t     rsvd134[122];
     uint16_t    oacs;
     uint8_t     acl;
     uint8_t     aerl;
@@ -832,6 +855,11 @@ typedef struct NvmeIdCtrl {
     uint8_t     vs[1024];
 } NvmeIdCtrl;
 
+typedef struct NvmeIdCtrlZoned {
+    uint8_t     zamds;
+    uint8_t     rsvd1[4095];
+} NvmeIdCtrlZoned;
+
 enum NvmeIdCtrlOacs {
     NVME_OACS_SECURITY  = 1 << 0,
     NVME_OACS_FORMAT    = 1 << 1,
@@ -908,6 +936,12 @@ typedef struct NvmeLBAF {
     uint8_t     rp;
 } NvmeLBAF;
 
+typedef struct NvmeLBAFE {
+    uint64_t    zsze;
+    uint8_t     zdes;
+    uint8_t     rsvd9[7];
+} NvmeLBAFE;
+
 typedef struct NvmeIdNs {
     uint64_t    nsze;
     uint64_t    ncap;
@@ -930,6 +964,19 @@ typedef struct NvmeIdNs {
     uint8_t     vs[3712];
 } NvmeIdNs;
 
+typedef struct NvmeIdNsZoned {
+    uint16_t    zoc;
+    uint16_t    ozcs;
+    uint32_t    mar;
+    uint32_t    mor;
+    uint32_t    rrl;
+    uint32_t    frl;
+    uint8_t     rsvd20[2796];
+    NvmeLBAFE   lbafe[16];
+    uint8_t     rsvd3072[768];
+    uint8_t     vs[256];
+} NvmeIdNsZoned;
+
 
 /*Deallocate Logical Block Features*/
 #define NVME_ID_NS_DLFEAT_GUARD_CRC(dlfeat)       ((dlfeat) & 0x10)
@@ -962,6 +1009,71 @@ enum NvmeIdNsDps {
     DPS_FIRST_EIGHT = 8,
 };
 
+enum NvmeZoneAttr {
+    NVME_ZA_FINISHED_BY_CTLR         = 1 << 0,
+    NVME_ZA_FINISH_RECOMMENDED       = 1 << 1,
+    NVME_ZA_RESET_RECOMMENDED        = 1 << 2,
+    NVME_ZA_ZD_EXT_VALID             = 1 << 7,
+};
+
+typedef struct NvmeZoneReportHeader {
+    uint64_t    nr_zones;
+    uint8_t     rsvd[56];
+} NvmeZoneReportHeader;
+
+enum NvmeZoneReceiveAction {
+    NVME_ZONE_REPORT                 = 0,
+    NVME_ZONE_REPORT_EXTENDED        = 1,
+};
+
+enum NvmeZoneReportType {
+    NVME_ZONE_REPORT_ALL             = 0,
+    NVME_ZONE_REPORT_EMPTY           = 1,
+    NVME_ZONE_REPORT_IMPLICITLY_OPEN = 2,
+    NVME_ZONE_REPORT_EXPLICITLY_OPEN = 3,
+    NVME_ZONE_REPORT_CLOSED          = 4,
+    NVME_ZONE_REPORT_FULL            = 5,
+    NVME_ZONE_REPORT_READ_ONLY       = 6,
+    NVME_ZONE_REPORT_OFFLINE         = 7,
+};
+
+typedef struct NvmeZoneDescr {
+    uint8_t     zt;
+    uint8_t     zs;
+    uint8_t     za;
+    uint8_t     rsvd3[5];
+    uint64_t    zcap;
+    uint64_t    zslba;
+    uint64_t    wp;
+    uint8_t     rsvd32[32];
+} NvmeZoneDescr;
+
+enum NvmeZoneState {
+    NVME_ZONE_STATE_RESERVED         = 0x00,
+    NVME_ZONE_STATE_EMPTY            = 0x01,
+    NVME_ZONE_STATE_IMPLICITLY_OPEN  = 0x02,
+    NVME_ZONE_STATE_EXPLICITLY_OPEN  = 0x03,
+    NVME_ZONE_STATE_CLOSED           = 0x04,
+    NVME_ZONE_STATE_READ_ONLY        = 0x0D,
+    NVME_ZONE_STATE_FULL             = 0x0E,
+    NVME_ZONE_STATE_OFFLINE          = 0x0F,
+};
+
+enum NvmeZoneType {
+    NVME_ZONE_TYPE_RESERVED          = 0x00,
+    NVME_ZONE_TYPE_SEQ_WRITE         = 0x02,
+};
+
+enum NvmeZoneSendAction {
+    NVME_ZONE_ACTION_RSD             = 0x00,
+    NVME_ZONE_ACTION_CLOSE           = 0x01,
+    NVME_ZONE_ACTION_FINISH          = 0x02,
+    NVME_ZONE_ACTION_OPEN            = 0x03,
+    NVME_ZONE_ACTION_RESET           = 0x04,
+    NVME_ZONE_ACTION_OFFLINE         = 0x05,
+    NVME_ZONE_ACTION_SET_ZD_EXT      = 0x10,
+};
+
 static inline void _nvme_check_size(void)
 {
     QEMU_BUILD_BUG_ON(sizeof(NvmeCqe) != 16);
@@ -978,8 +1090,13 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrlZoned) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeNsIdDesc) != 4);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeLBAF) != 4);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeLBAFE) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsZoned) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeZoneDescr) != 64);
 }
 #endif
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 09/18] hw/block/nvme: Define Zoned NS Command Set trace events
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (7 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 08/18] hw/block/nvme: Make Zoned NS Command Set definitions Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-06-30 12:14   ` Klaus Jensen
  2020-06-17 21:34 ` [PATCH v2 10/18] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

The Zoned Namespace Command Set / Namespace Types implementation that
is being introduced in this series adds a good number of trace events.
Combine all tracepoint definitions into a separate patch to make
reviewing more convenient.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/trace-events | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/hw/block/trace-events b/hw/block/trace-events
index 3f3323fe38..984db8a20c 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -66,6 +66,31 @@ pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
 pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
 pci_nvme_css_nvm_cset_selected_by_host(uint32_t cc) "NVM command set selected by host, bar.cc=0x%"PRIx32""
 pci_nvme_css_all_csets_sel_by_host(uint32_t cc) "all supported command sets selected by host, bar.cc=0x%"PRIx32""
+pci_nvme_open_zone(uint64_t slba, uint32_t zone_idx, int all) "open zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_close_zone(uint64_t slba, uint32_t zone_idx, int all) "close zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_finish_zone(uint64_t slba, uint32_t zone_idx, int all) "finish zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_reset_zone(uint64_t slba, uint32_t zone_idx, int all) "reset zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_offline_zone(uint64_t slba, uint32_t zone_idx, int all) "offline zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
+pci_nvme_set_descriptor_extension(uint64_t slba, uint32_t zone_idx) "set zone descriptor extension, slba=%"PRIu64", idx=%"PRIu32""
+pci_nvme_zone_reset_recommended(uint64_t slba) "slba=%"PRIu64""
+pci_nvme_zone_reset_internal_op(uint64_t slba) "slba=%"PRIu64""
+pci_nvme_zone_finish_recommended(uint64_t slba) "slba=%"PRIu64""
+pci_nvme_zone_finish_internal_op(uint64_t slba) "slba=%"PRIu64""
+pci_nvme_zone_finished_by_controller(uint64_t slba) "slba=%"PRIu64""
+pci_nvme_zd_extension_set(uint32_t zone_idx) "set descriptor extension for zone_idx=%"PRIu32""
+pci_nvme_power_on_close(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Closed state"
+pci_nvme_power_on_reset(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Empty state"
+pci_nvme_power_on_full(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Full state"
+pci_nvme_zone_ae_not_enabled(int info, int log_page, int nsid) "zone async event not enabled, info=0x%"PRIx32", lp=0x%"PRIx32", nsid=%"PRIu32""
+pci_nvme_zone_ae_not_cleared(int info, int log_page, int nsid) "zoned async event not cleared, info=0x%"PRIx32", lp=0x%"PRIx32", nsid=%"PRIu32""
+pci_nvme_zone_aen_not_requested(uint32_t oaes) "zone descriptor AEN are not requested by host, oaes=0x%"PRIx32""
+pci_nvme_getfeat_aen_cfg(uint64_t res) "reporting async event config res=%"PRIu64""
+pci_nvme_setfeat_zone_info_aer_on(void) "zone info change notices enabled"
+pci_nvme_setfeat_zone_info_aer_off(void) "zone info change notices disabled"
+pci_nvme_changed_zone_log_read(uint16_t nsid) "changed zone list log of ns %"PRIu16""
+pci_nvme_reporting_changed_zone(uint64_t zslba, uint8_t za) "zslba=%"PRIu64", attr=0x%"PRIx8""
+pci_nvme_empty_changed_zone_list(void) "no changes zones to report"
+pci_nvme_mapped_zone_file(char *zfile_name, int ret) "mapped zone file %s, error %d"
 
 # nvme traces for error conditions
 pci_nvme_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
@@ -77,10 +102,25 @@ pci_nvme_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid namespace %u not w
 pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
 pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
 pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
+pci_nvme_err_capacity_exceeded(uint64_t zone_id, uint64_t nr_zones) "zone capacity exceeded, zone_id=%"PRIu64", nr_zones=%"PRIu64""
+pci_nvme_err_unaligned_zone_cmd(uint8_t action, uint64_t slba, uint64_t zslba) "unaligned zone op 0x%"PRIx32", got slba=%"PRIu64", zslba=%"PRIu64""
+pci_nvme_err_invalid_zone_state_transition(uint8_t state, uint8_t action, uint64_t slba, uint8_t attrs) "0x%"PRIx32"->0x%"PRIx32", slba=%"PRIu64", attrs=0x%"PRIx32""
+pci_nvme_err_write_not_at_wp(uint64_t slba, uint64_t zone, uint64_t wp) "writing at slba=%"PRIu64", zone=%"PRIu64", but wp=%"PRIu64""
+pci_nvme_err_append_not_at_start(uint64_t slba, uint64_t zone) "appending at slba=%"PRIu64", but zone=%"PRIu64""
+pci_nvme_err_zone_write_not_ok(uint64_t slba, uint32_t nlb, uint32_t status) "slba=%"PRIu64", nlb=%"PRIu32", status=0x%"PRIx16""
+pci_nvme_err_zone_read_not_ok(uint64_t slba, uint32_t nlb, uint32_t status) "slba=%"PRIu64", nlb=%"PRIu32", status=0x%"PRIx16""
+pci_nvme_err_append_too_large(uint64_t slba, uint32_t nlb, uint8_t zamds) "slba=%"PRIu64", nlb=%"PRIu32", zamds=%"PRIu8""
+pci_nvme_err_insuff_active_res(uint32_t max_active) "max_active=%"PRIu32" zone limit exceeded"
+pci_nvme_err_insuff_open_res(uint32_t max_open) "max_open=%"PRIu32" zone limit exceeded"
+pci_nvme_err_zone_file_invalid(int error) "validation error=%"PRIi32""
+pci_nvme_err_zd_extension_map_error(uint32_t zone_idx) "can't map descriptor extension for zone_idx=%"PRIu32""
+pci_nvme_err_invalid_changed_zone_list_offset(uint64_t ofs) "changed zone list log offset must be 0, got %"PRIu64""
+pci_nvme_err_invalid_changed_zone_list_len(uint32_t len) "changed zone list log size is 4096, got %"PRIu32""
 pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
 pci_nvme_err_invalid_effects_log_len(uint32_t len) "commands supported and effects log size is 4096, got %"PRIu32""
 pci_nvme_err_change_css_when_enabled(void) "changing CC.CSS while controller is enabled"
 pci_nvme_err_only_nvm_cmd_set_avail(void) "setting 110b CC.CSS, but only NVM command set is enabled"
+pci_nvme_err_only_zoned_cmd_set_avail(void) "setting 001b CC.CSS, but only ZONED+NVM command set is enabled"
 pci_nvme_err_invalid_iocsci(uint32_t idx) "unsupported command set combination index %"PRIu32""
 pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
 pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
@@ -113,6 +153,7 @@ pci_nvme_err_startfail_sqent_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_
 pci_nvme_err_startfail_asqent_sz_zero(void) "nvme_start_ctrl failed because the admin submission queue size is zero"
 pci_nvme_err_startfail_acqent_sz_zero(void) "nvme_start_ctrl failed because the admin completion queue size is zero"
 pci_nvme_err_startfail(void) "setting controller enable bit failed"
+pci_nvme_err_invalid_mgmt_action(int action) "action=0x%"PRIx32""
 
 # Traces for undefined behavior
 pci_nvme_ub_mmiowr_misaligned32(uint64_t offset) "MMIO write not 32-bit aligned, offset=0x%"PRIx64""
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 10/18] hw/block/nvme: Support Zoned Namespace Command Set
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (8 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 09/18] hw/block/nvme: Define Zoned NS Command Set trace events Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-06-30 13:31   ` Klaus Jensen
  2020-06-17 21:34 ` [PATCH v2 11/18] hw/block/nvme: Introduce max active and open zone limits Dmitry Fomichev
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

The driver has been changed to advertise NVM Command Set when "zoned"
driver property is not set (default) and Zoned Namespace Command Set
otherwise.

Handlers for three new NVMe commands introduced in Zoned Namespace
Command Set specification are added, namely for Zone Management
Receive, Zone Management Send and Zone Append.

Driver initialization code has been extended to create a proper
configuration for zoned operation using driver properties.

Read/Write command handler is modified to only allow writes at the
write pointer if the namespace is zoned. For Zone Append command,
writes implicitly happen at the write pointer and the starting write
pointer value is returned as the result of the command. Read Zeroes
handler is modified to add zoned checks that are identical to those
done as a part of Write flow.

The code to support for Zone Descriptor Extensions is not included in
this commit and the driver always reports ZDES 0. A later commit in
this series will add ZDE support.

This commit doesn't yet include checks for active and open zone
limits. It is assumed that there are no limits on either active or
open zones.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 963 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 933 insertions(+), 30 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 453f4747a5..2e03b0b6ed 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -37,6 +37,7 @@
 #include "qemu/osdep.h"
 #include "qemu/units.h"
 #include "qemu/error-report.h"
+#include "crypto/random.h"
 #include "hw/block/block.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pci.h"
@@ -69,6 +70,98 @@
 
 static void nvme_process_sq(void *opaque);
 
+/*
+ * Add a zone to the tail of a zone list.
+ */
+static void nvme_add_zone_tail(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
+    NvmeZone *zone)
+{
+    uint32_t idx = (uint32_t)(zone - ns->zone_array);
+
+    assert(nvme_zone_not_in_list(zone));
+
+    if (!zl->size) {
+        zl->head = zl->tail = idx;
+        zone->next = zone->prev = NVME_ZONE_LIST_NIL;
+    } else {
+        ns->zone_array[zl->tail].next = idx;
+        zone->prev = zl->tail;
+        zone->next = NVME_ZONE_LIST_NIL;
+        zl->tail = idx;
+    }
+    zl->size++;
+}
+
+/*
+ * Remove a zone from a zone list. The zone must be linked in the list.
+ */
+static void nvme_remove_zone(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
+    NvmeZone *zone)
+{
+    uint32_t idx = (uint32_t)(zone - ns->zone_array);
+
+    assert(!nvme_zone_not_in_list(zone));
+
+    --zl->size;
+    if (zl->size == 0) {
+        zl->head = NVME_ZONE_LIST_NIL;
+        zl->tail = NVME_ZONE_LIST_NIL;
+    } else if (idx == zl->head) {
+        zl->head = zone->next;
+        ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
+    } else if (idx == zl->tail) {
+        zl->tail = zone->prev;
+        ns->zone_array[zl->tail].next = NVME_ZONE_LIST_NIL;
+    } else {
+        ns->zone_array[zone->next].prev = zone->prev;
+        ns->zone_array[zone->prev].next = zone->next;
+    }
+
+    zone->prev = zone->next = 0;
+}
+
+static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone, uint8_t state)
+{
+    if (!nvme_zone_not_in_list(zone)) {
+        switch (nvme_get_zone_state(zone)) {
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+            nvme_remove_zone(n, ns, ns->exp_open_zones, zone);
+            break;
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+            nvme_remove_zone(n, ns, ns->imp_open_zones, zone);
+            break;
+        case NVME_ZONE_STATE_CLOSED:
+            nvme_remove_zone(n, ns, ns->closed_zones, zone);
+            break;
+        case NVME_ZONE_STATE_FULL:
+            nvme_remove_zone(n, ns, ns->full_zones, zone);
+        }
+   }
+
+    nvme_set_zone_state(zone, state);
+
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+        nvme_add_zone_tail(n, ns, ns->exp_open_zones, zone);
+        break;
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_add_zone_tail(n, ns, ns->imp_open_zones, zone);
+        break;
+    case NVME_ZONE_STATE_CLOSED:
+        nvme_add_zone_tail(n, ns, ns->closed_zones, zone);
+        break;
+    case NVME_ZONE_STATE_FULL:
+        nvme_add_zone_tail(n, ns, ns->full_zones, zone);
+        break;
+    default:
+        zone->d.za = 0;
+        /* fall through */
+    case NVME_ZONE_STATE_READ_ONLY:
+        zone->tstamp = 0;
+    }
+}
+
 static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
 {
     hwaddr low = n->ctrl_mem.addr;
@@ -314,6 +407,7 @@ static void nvme_post_cqes(void *opaque)
 
         QTAILQ_REMOVE(&cq->req_list, req, entry);
         sq = req->sq;
+
         req->cqe.status = cpu_to_le16((req->status << 1) | cq->phase);
         req->cqe.sq_id = cpu_to_le16(sq->sqid);
         req->cqe.sq_head = cpu_to_le16(sq->head);
@@ -328,6 +422,30 @@ static void nvme_post_cqes(void *opaque)
     }
 }
 
+static void nvme_fill_data(QEMUSGList *qsg, QEMUIOVector *iov,
+    uint64_t offset, uint8_t pattern)
+{
+    ScatterGatherEntry *entry;
+    uint32_t len, ent_len;
+
+    if (qsg->nsg > 0) {
+        entry = qsg->sg;
+        for (len = qsg->size; len > 0; len -= ent_len) {
+            ent_len = MIN(len, entry->len);
+            if (offset > ent_len) {
+                offset -= ent_len;
+            } else if (offset != 0) {
+                dma_memory_set(qsg->as, entry->base + offset,
+                               pattern, ent_len - offset);
+                offset = 0;
+            } else {
+                dma_memory_set(qsg->as, entry->base, pattern, ent_len);
+            }
+            entry++;
+        }
+    }
+}
+
 static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
 {
     assert(cq->cqid == req->sq->cqid);
@@ -336,6 +454,114 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
     timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
 }
 
+static uint16_t nvme_check_zone_write(NvmeZone *zone, uint64_t slba,
+    uint32_t nlb)
+{
+    uint16_t status;
+
+    if (unlikely((slba + nlb) > nvme_zone_wr_boundary(zone))) {
+        return NVME_ZONE_BOUNDARY_ERROR;
+    }
+
+    switch (nvme_get_zone_state(zone)) {
+    case NVME_ZONE_STATE_EMPTY:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_CLOSED:
+        status = NVME_SUCCESS;
+        break;
+    case NVME_ZONE_STATE_FULL:
+        status = NVME_ZONE_FULL;
+        break;
+    case NVME_ZONE_STATE_OFFLINE:
+        status = NVME_ZONE_OFFLINE;
+        break;
+    case NVME_ZONE_STATE_READ_ONLY:
+        status = NVME_ZONE_READ_ONLY;
+        break;
+    default:
+        assert(false);
+    }
+    return status;
+}
+
+static uint16_t nvme_check_zone_read(NvmeCtrl *n, NvmeZone *zone, uint64_t slba,
+    uint32_t nlb, bool zone_x_ok)
+{
+    uint64_t lba = slba, count;
+    uint16_t status;
+    uint8_t zs;
+
+    do {
+        if (!zone_x_ok && (lba + nlb > nvme_zone_rd_boundary(n, zone))) {
+            return NVME_ZONE_BOUNDARY_ERROR | NVME_DNR;
+        }
+
+        zs = nvme_get_zone_state(zone);
+        switch (zs) {
+        case NVME_ZONE_STATE_EMPTY:
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+        case NVME_ZONE_STATE_FULL:
+        case NVME_ZONE_STATE_CLOSED:
+        case NVME_ZONE_STATE_READ_ONLY:
+            status = NVME_SUCCESS;
+            break;
+        case NVME_ZONE_STATE_OFFLINE:
+            status = NVME_ZONE_OFFLINE | NVME_DNR;
+            break;
+        default:
+            assert(false);
+        }
+        if (status != NVME_SUCCESS) {
+            break;
+        }
+
+        if (lba + nlb > nvme_zone_rd_boundary(n, zone)) {
+            count = nvme_zone_rd_boundary(n, zone) - lba;
+        } else {
+            count = nlb;
+        }
+
+        lba += count;
+        nlb -= count;
+        zone++;
+    } while (nlb);
+
+    return status;
+}
+
+static uint64_t nvme_finalize_zone_write(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone, uint32_t nlb)
+{
+    uint64_t result = cpu_to_le64(zone->d.wp);
+    uint8_t zs = nvme_get_zone_state(zone);
+
+    zone->d.wp += nlb;
+
+    if (zone->d.wp == nvme_zone_wr_boundary(zone)) {
+        switch (zs) {
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+        case NVME_ZONE_STATE_CLOSED:
+        case NVME_ZONE_STATE_EMPTY:
+            break;
+        default:
+            assert(false);
+        }
+        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_FULL);
+    } else {
+        switch (zs) {
+        case NVME_ZONE_STATE_EMPTY:
+        case NVME_ZONE_STATE_CLOSED:
+            nvme_assign_zone_state(n, ns, zone,
+                                   NVME_ZONE_STATE_IMPLICITLY_OPEN);
+        }
+    }
+
+    return result;
+}
+
 static void nvme_rw_cb(void *opaque, int ret)
 {
     NvmeRequest *req = opaque;
@@ -344,6 +570,10 @@ static void nvme_rw_cb(void *opaque, int ret)
     NvmeCQueue *cq = n->cq[sq->cqid];
 
     if (!ret) {
+        if (req->flags & NVME_REQ_FLG_FILL) {
+            nvme_fill_data(&req->qsg, &req->iov, req->fill_ofs,
+                           n->params.fill_pattern);
+        }
         block_acct_done(blk_get_stats(n->conf.blk), &req->acct);
         req->status = NVME_SUCCESS;
     } else {
@@ -370,22 +600,53 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
     NvmeRequest *req)
 {
     NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
+    NvmeZone *zone = NULL;
     const uint8_t lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
     const uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
     uint64_t slba = le64_to_cpu(rw->slba);
     uint32_t nlb  = le16_to_cpu(rw->nlb) + 1;
+    uint64_t zone_idx;
     uint64_t offset = slba << data_shift;
     uint32_t count = nlb << data_shift;
+    uint16_t status;
 
     if (unlikely(slba + nlb > ns->id_ns.nsze)) {
         trace_pci_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
         return NVME_LBA_RANGE | NVME_DNR;
     }
 
+    if (n->params.zoned) {
+        zone_idx = slba / n->params.zone_size;
+        if (unlikely(zone_idx >= n->num_zones)) {
+            trace_pci_nvme_err_capacity_exceeded(zone_idx, n->num_zones);
+            return NVME_CAP_EXCEEDED | NVME_DNR;
+        }
+
+        zone = &ns->zone_array[zone_idx];
+
+        status = nvme_check_zone_write(zone, slba, nlb);
+        if (status != NVME_SUCCESS) {
+            trace_pci_nvme_err_zone_write_not_ok(slba, nlb, status);
+            return status | NVME_DNR;
+        }
+
+        assert(nvme_wp_is_valid(zone));
+        if (unlikely(slba != zone->d.wp)) {
+            trace_pci_nvme_err_write_not_at_wp(slba, zone->d.zslba,
+                                               zone->d.wp);
+            return NVME_ZONE_INVALID_WRITE | NVME_DNR;
+        }
+    }
+
     block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
                      BLOCK_ACCT_WRITE);
     req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
                                         BDRV_REQ_MAY_UNMAP, nvme_rw_cb, req);
+
+    if (n->params.zoned) {
+        req->cqe.result64 = nvme_finalize_zone_write(n, ns, zone, nlb);
+    }
+
     return NVME_NO_COMPLETE;
 }
 
@@ -393,16 +654,19 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
     NvmeRequest *req)
 {
     NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
+    NvmeZone *zone = NULL;
     uint32_t nlb  = le32_to_cpu(rw->nlb) + 1;
     uint64_t slba = le64_to_cpu(rw->slba);
     uint64_t prp1 = le64_to_cpu(rw->prp1);
     uint64_t prp2 = le64_to_cpu(rw->prp2);
-
+    uint64_t zone_idx = 0;
+    uint16_t status;
     uint8_t lba_index  = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
     uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
     uint64_t data_size = (uint64_t)nlb << data_shift;
-    uint64_t data_offset = slba << data_shift;
-    int is_write = rw->opcode == NVME_CMD_WRITE ? 1 : 0;
+    uint64_t data_offset;
+    bool is_write = rw->opcode == NVME_CMD_WRITE ||
+                    (req->flags & NVME_REQ_FLG_APPEND);
     enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
 
     trace_pci_nvme_rw(is_write ? "write" : "read", nlb, data_size, slba);
@@ -413,11 +677,79 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
         return NVME_LBA_RANGE | NVME_DNR;
     }
 
+    if (n->params.zoned) {
+        zone_idx = slba / n->params.zone_size;
+        if (unlikely(zone_idx >= n->num_zones)) {
+            trace_pci_nvme_err_capacity_exceeded(zone_idx, n->num_zones);
+            return NVME_CAP_EXCEEDED | NVME_DNR;
+        }
+
+        zone = &ns->zone_array[zone_idx];
+
+        if (is_write) {
+            status = nvme_check_zone_write(zone, slba, nlb);
+            if (status != NVME_SUCCESS) {
+                trace_pci_nvme_err_zone_write_not_ok(slba, nlb, status);
+                return status | NVME_DNR;
+            }
+
+            assert(nvme_wp_is_valid(zone));
+            if (req->flags & NVME_REQ_FLG_APPEND) {
+                if (unlikely(slba != zone->d.zslba)) {
+                    trace_pci_nvme_err_append_not_at_start(slba, zone->d.zslba);
+                    return NVME_ZONE_INVALID_WRITE | NVME_DNR;
+                }
+                if (data_size > (n->page_size << n->zamds)) {
+                    trace_pci_nvme_err_append_too_large(slba, nlb, n->zamds);
+                    return NVME_INVALID_FIELD | NVME_DNR;
+                }
+                slba = zone->d.wp;
+            } else if (unlikely(slba != zone->d.wp)) {
+                trace_pci_nvme_err_write_not_at_wp(slba, zone->d.zslba,
+                                                   zone->d.wp);
+                return NVME_ZONE_INVALID_WRITE | NVME_DNR;
+            }
+        } else {
+            status = nvme_check_zone_read(n, zone, slba, nlb,
+                                          n->params.cross_zone_read);
+            if (status != NVME_SUCCESS) {
+                trace_pci_nvme_err_zone_read_not_ok(slba, nlb, status);
+                return status | NVME_DNR;
+            }
+
+            if (slba + nlb > zone->d.wp) {
+                /*
+                 * All or some data is read above the WP. Need to
+                 * fill out the buffer area that has no backing data
+                 * with a predefined data pattern (zeros by default)
+                 */
+                req->flags |= NVME_REQ_FLG_FILL;
+                if (slba >= zone->d.wp) {
+                    req->fill_ofs = 0;
+                } else {
+                    req->fill_ofs = ((zone->d.wp - slba) << data_shift);
+                }
+            }
+        }
+    } else if (req->flags & NVME_REQ_FLG_APPEND) {
+        trace_pci_nvme_err_invalid_opc(cmd->opcode);
+        return NVME_INVALID_OPCODE | NVME_DNR;
+    }
+
     if (nvme_map_prp(&req->qsg, &req->iov, prp1, prp2, data_size, n)) {
         block_acct_invalid(blk_get_stats(n->conf.blk), acct);
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
+    if (unlikely(!is_write && (req->flags & NVME_REQ_FLG_FILL) &&
+                 (req->fill_ofs == 0))) {
+        /* No backend I/O necessary, only need to fill the buffer */
+        nvme_fill_data(&req->qsg, &req->iov, 0, n->params.fill_pattern);
+        req->status = NVME_SUCCESS;
+        return NVME_SUCCESS;
+    }
+
+    data_offset = slba << data_shift;
     dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
     if (req->qsg.nsg > 0) {
         req->flags |= NVME_REQ_FLG_HAS_SG;
@@ -434,9 +766,383 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
                            req);
     }
 
+    if (is_write && n->params.zoned) {
+        req->cqe.result64 = nvme_finalize_zone_write(n, ns, zone, nlb);
+    }
+
     return NVME_NO_COMPLETE;
 }
 
+static uint16_t nvme_get_mgmt_zone_slba_idx(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeCmd *c, uint64_t *slba, uint64_t *zone_idx)
+{
+    uint32_t dw10 = le32_to_cpu(c->cdw10);
+    uint32_t dw11 = le32_to_cpu(c->cdw11);
+
+    if (!n->params.zoned) {
+        trace_pci_nvme_err_invalid_opc(c->opcode);
+        return NVME_INVALID_OPCODE | NVME_DNR;
+    }
+
+    *slba = ((uint64_t)dw11) << 32 | dw10;
+    if (unlikely(*slba >= ns->id_ns.nsze)) {
+        trace_pci_nvme_err_invalid_lba_range(*slba, 0, ns->id_ns.nsze);
+        *slba = 0;
+        return NVME_LBA_RANGE | NVME_DNR;
+    }
+
+    *zone_idx = *slba / n->params.zone_size;
+    if (unlikely(*zone_idx >= n->num_zones)) {
+        trace_pci_nvme_err_capacity_exceeded(*zone_idx, n->num_zones);
+        *zone_idx = 0;
+        return NVME_CAP_EXCEEDED | NVME_DNR;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_open_zone(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone, uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EMPTY:
+    case NVME_ZONE_STATE_CLOSED:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_EXPLICITLY_OPEN);
+        /* fall through */
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_open_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_CLOSED;
+}
+
+static uint16_t nvme_close_zone(NvmeCtrl *n,  NvmeNamespace *ns,
+    NvmeZone *zone, uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
+        /* fall through */
+    case NVME_ZONE_STATE_CLOSED:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_close_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_EXPLICITLY_OPEN;
+}
+
+static uint16_t nvme_finish_zone(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone, uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+    case NVME_ZONE_STATE_CLOSED:
+    case NVME_ZONE_STATE_EMPTY:
+        zone->d.wp = nvme_zone_wr_boundary(zone);
+        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_FULL);
+        /* fall through */
+    case NVME_ZONE_STATE_FULL:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_finish_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_EXPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_CLOSED;
+}
+
+static uint16_t nvme_reset_zone(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone, uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+    case NVME_ZONE_STATE_CLOSED:
+    case NVME_ZONE_STATE_FULL:
+        zone->d.wp = zone->d.zslba;
+        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_EMPTY);
+        /* fall through */
+    case NVME_ZONE_STATE_EMPTY:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_reset_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_EXPLICITLY_OPEN ||
+           state == NVME_ZONE_STATE_CLOSED ||
+           state == NVME_ZONE_STATE_FULL;
+}
+
+static uint16_t nvme_offline_zone(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone, uint8_t state)
+{
+    switch (state) {
+    case NVME_ZONE_STATE_READ_ONLY:
+        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_OFFLINE);
+        /* fall through */
+    case NVME_ZONE_STATE_OFFLINE:
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
+static bool nvme_cond_offline_all(uint8_t state)
+{
+    return state == NVME_ZONE_STATE_READ_ONLY;
+}
+
+static uint16_t name_do_zone_op(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone, uint8_t state, bool all,
+    uint16_t (*op_hndlr)(NvmeCtrl *, NvmeNamespace *, NvmeZone *,
+                         uint8_t), bool (*proc_zone)(uint8_t))
+{
+    int i;
+    uint16_t status = 0;
+
+    if (!all) {
+        status = op_hndlr(n, ns, zone, state);
+    } else {
+        for (i = 0; i < n->num_zones; i++, zone++) {
+            state = nvme_get_zone_state(zone);
+            if (proc_zone(state)) {
+                status = op_hndlr(n, ns, zone, state);
+                if (status != NVME_SUCCESS) {
+                    break;
+                }
+            }
+        }
+    }
+
+    return status;
+}
+
+static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeCmd *cmd, NvmeRequest *req)
+{
+    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
+    uint64_t slba = 0;
+    uint64_t zone_idx = 0;
+    uint16_t status;
+    uint8_t action, state;
+    bool all;
+    NvmeZone *zone;
+
+    action = dw13 & 0xff;
+    all = dw13 & 0x100;
+
+    req->status = NVME_SUCCESS;
+
+    if (!all) {
+        status = nvme_get_mgmt_zone_slba_idx(n, ns, cmd, &slba, &zone_idx);
+        if (status) {
+            return status;
+        }
+    }
+
+    zone = &ns->zone_array[zone_idx];
+    if (slba != zone->d.zslba) {
+        trace_pci_nvme_err_unaligned_zone_cmd(action, slba, zone->d.zslba);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+    state = nvme_get_zone_state(zone);
+
+    switch (action) {
+
+    case NVME_ZONE_ACTION_OPEN:
+        trace_pci_nvme_open_zone(slba, zone_idx, all);
+        status = name_do_zone_op(n, ns, zone, state, all,
+                                 nvme_open_zone, nvme_cond_open_all);
+        break;
+
+    case NVME_ZONE_ACTION_CLOSE:
+        trace_pci_nvme_close_zone(slba, zone_idx, all);
+        status = name_do_zone_op(n, ns, zone, state, all,
+                                 nvme_close_zone, nvme_cond_close_all);
+        break;
+
+    case NVME_ZONE_ACTION_FINISH:
+        trace_pci_nvme_finish_zone(slba, zone_idx, all);
+        status = name_do_zone_op(n, ns, zone, state, all,
+                                 nvme_finish_zone, nvme_cond_finish_all);
+        break;
+
+    case NVME_ZONE_ACTION_RESET:
+        trace_pci_nvme_reset_zone(slba, zone_idx, all);
+        status = name_do_zone_op(n, ns, zone, state, all,
+                                 nvme_reset_zone, nvme_cond_reset_all);
+        break;
+
+    case NVME_ZONE_ACTION_OFFLINE:
+        trace_pci_nvme_offline_zone(slba, zone_idx, all);
+        status = name_do_zone_op(n, ns, zone, state, all,
+                                 nvme_offline_zone, nvme_cond_offline_all);
+        break;
+
+    case NVME_ZONE_ACTION_SET_ZD_EXT:
+        trace_pci_nvme_set_descriptor_extension(slba, zone_idx);
+        return NVME_INVALID_FIELD | NVME_DNR;
+        break;
+
+    default:
+        trace_pci_nvme_err_invalid_mgmt_action(action);
+        status = NVME_INVALID_FIELD;
+    }
+
+    if (status == NVME_ZONE_INVAL_TRANSITION) {
+        trace_pci_nvme_err_invalid_zone_state_transition(state, action, slba,
+                                                         zone->d.za);
+    }
+    if (status) {
+        status |= NVME_DNR;
+    }
+
+    return status;
+}
+
+static bool nvme_zone_matches_filter(uint32_t zafs, NvmeZone *zl)
+{
+    int zs = nvme_get_zone_state(zl);
+
+    switch (zafs) {
+    case NVME_ZONE_REPORT_ALL:
+        return true;
+    case NVME_ZONE_REPORT_EMPTY:
+        return (zs == NVME_ZONE_STATE_EMPTY);
+    case NVME_ZONE_REPORT_IMPLICITLY_OPEN:
+        return (zs == NVME_ZONE_STATE_IMPLICITLY_OPEN);
+    case NVME_ZONE_REPORT_EXPLICITLY_OPEN:
+        return (zs == NVME_ZONE_STATE_EXPLICITLY_OPEN);
+    case NVME_ZONE_REPORT_CLOSED:
+        return (zs == NVME_ZONE_STATE_CLOSED);
+    case NVME_ZONE_REPORT_FULL:
+        return (zs == NVME_ZONE_STATE_FULL);
+    case NVME_ZONE_REPORT_READ_ONLY:
+        return (zs == NVME_ZONE_STATE_READ_ONLY);
+    case NVME_ZONE_REPORT_OFFLINE:
+        return (zs == NVME_ZONE_STATE_OFFLINE);
+    default:
+        return false;
+    }
+}
+
+static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeCmd *cmd, NvmeRequest *req)
+{
+    uint64_t prp1 = le64_to_cpu(cmd->prp1);
+    uint64_t prp2 = le64_to_cpu(cmd->prp2);
+    /* cdw12 is zero-based number of dwords to return. Convert to bytes */
+    uint32_t len = (le32_to_cpu(cmd->cdw12) + 1) << 2;
+    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
+    uint32_t zra, zrasf, partial;
+    uint64_t max_zones, zone_index, nr_zones = 0;
+    uint16_t ret;
+    uint64_t slba;
+    NvmeZoneDescr *z;
+    NvmeZone *zs;
+    NvmeZoneReportHeader *header;
+    void *buf, *buf_p;
+    size_t zone_entry_sz;
+
+    req->status = NVME_SUCCESS;
+
+    ret = nvme_get_mgmt_zone_slba_idx(n, ns, cmd, &slba, &zone_index);
+    if (ret) {
+        return ret;
+    }
+
+    if (len < sizeof(NvmeZoneReportHeader)) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    zra = dw13 & 0xff;
+    if (!(zra == NVME_ZONE_REPORT || zra == NVME_ZONE_REPORT_EXTENDED)) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    if (zra == NVME_ZONE_REPORT_EXTENDED) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    zrasf = (dw13 >> 8) & 0xff;
+    if (zrasf > NVME_ZONE_REPORT_OFFLINE) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    partial = (dw13 >> 16) & 0x01;
+
+    zone_entry_sz = sizeof(NvmeZoneDescr);
+
+    max_zones = (len - sizeof(NvmeZoneReportHeader)) / zone_entry_sz;
+    buf = g_malloc0(len);
+
+    header = (NvmeZoneReportHeader *)buf;
+    buf_p = buf + sizeof(NvmeZoneReportHeader);
+
+    while (zone_index < n->num_zones && nr_zones < max_zones) {
+        zs = &ns->zone_array[zone_index];
+
+        if (!nvme_zone_matches_filter(zrasf, zs)) {
+            zone_index++;
+            continue;
+        }
+
+        z = (NvmeZoneDescr *)buf_p;
+        buf_p += sizeof(NvmeZoneDescr);
+        nr_zones++;
+
+        z->zt = zs->d.zt;
+        z->zs = zs->d.zs;
+        z->zcap = cpu_to_le64(zs->d.zcap);
+        z->zslba = cpu_to_le64(zs->d.zslba);
+        z->za = zs->d.za;
+
+        if (nvme_wp_is_valid(zs)) {
+            z->wp = cpu_to_le64(zs->d.wp);
+        } else {
+            z->wp = cpu_to_le64(~0ULL);
+        }
+
+        zone_index++;
+    }
+
+    if (!partial) {
+        for (; zone_index < n->num_zones; zone_index++) {
+            zs = &ns->zone_array[zone_index];
+            if (nvme_zone_matches_filter(zrasf, zs)) {
+                nr_zones++;
+            }
+        }
+    }
+    header->nr_zones = cpu_to_le64(nr_zones);
+
+    ret = nvme_dma_read_prp(n, (uint8_t *)buf, len, prp1, prp2);
+    g_free(buf);
+
+    return ret;
+}
+
 static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
     NvmeNamespace *ns;
@@ -453,9 +1159,16 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         return nvme_flush(n, ns, cmd, req);
     case NVME_CMD_WRITE_ZEROS:
         return nvme_write_zeros(n, ns, cmd, req);
+    case NVME_CMD_ZONE_APND:
+        req->flags |= NVME_REQ_FLG_APPEND;
+        /* fall through */
     case NVME_CMD_WRITE:
     case NVME_CMD_READ:
         return nvme_rw(n, ns, cmd, req);
+    case NVME_CMD_ZONE_MGMT_SEND:
+        return nvme_zone_mgmt_send(n, ns, cmd, req);
+    case NVME_CMD_ZONE_MGMT_RECV:
+        return nvme_zone_mgmt_recv(n, ns, cmd, req);
     default:
         trace_pci_nvme_err_invalid_opc(cmd->opcode);
         return NVME_INVALID_OPCODE | NVME_DNR;
@@ -675,6 +1388,16 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
     return NVME_SUCCESS;
 }
 
+static inline bool nvme_csi_has_nvm_support(NvmeNamespace *ns)
+{
+    switch (ns->csi) {
+    case NVME_CSI_NVM:
+    case NVME_CSI_ZONED:
+        return true;
+    }
+    return false;
+}
+
 static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c)
 {
     uint64_t prp1 = le64_to_cpu(c->prp1);
@@ -701,6 +1424,12 @@ static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeIdentify *c)
         ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
         g_free(list);
         return ret;
+    } else if (c->csi == NVME_CSI_ZONED && n->params.zoned) {
+        NvmeIdCtrlZoned *id = g_malloc0(sizeof(*id));
+        id->zamds = n->zamds;
+        ret = nvme_dma_read_prp(n, (uint8_t *)id, sizeof(*id), prp1, prp2);
+        g_free(id);
+        return ret;
     } else {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
@@ -723,8 +1452,12 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
     ns = &n->namespaces[nsid - 1];
     assert(nsid == ns->nsid);
 
-    return nvme_dma_read_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
-        prp1, prp2);
+    if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
+        return nvme_dma_read_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
+            prp1, prp2);
+    }
+
+    return NVME_INVALID_CMD_SET | NVME_DNR;
 }
 
 static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeIdentify *c)
@@ -747,14 +1480,17 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeIdentify *c)
     ns = &n->namespaces[nsid - 1];
     assert(nsid == ns->nsid);
 
-    if (c->csi == NVME_CSI_NVM) {
+    if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
         list = g_malloc0(data_len);
         ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
         g_free(list);
         return ret;
-    } else {
-        return NVME_INVALID_FIELD | NVME_DNR;
+    } else if (c->csi == NVME_CSI_ZONED && ns->csi == NVME_CSI_ZONED) {
+        return nvme_dma_read_prp(n, (uint8_t *)ns->id_ns_zoned,
+                                 sizeof(*ns->id_ns_zoned), prp1, prp2);
     }
+
+    return NVME_INVALID_FIELD | NVME_DNR;
 }
 
 static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
@@ -796,13 +1532,13 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeIdentify *c)
 
     trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
 
-    if (c->csi != NVME_CSI_NVM) {
+    if (c->csi != NVME_CSI_NVM && c->csi != NVME_CSI_ZONED) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
     list = g_malloc0(data_len);
     for (i = 0; i < n->num_namespaces; i++) {
-        if (i < min_nsid) {
+        if (i < min_nsid || c->csi != n->namespaces[i].csi) {
             continue;
         }
         list[j++] = cpu_to_le32(i + 1);
@@ -851,7 +1587,7 @@ static uint16_t nvme_list_ns_descriptors(NvmeCtrl *n, NvmeIdentify *c)
     desc->nidt = NVME_NIDT_CSI;
     desc->nidl = NVME_NIDL_CSI;
     buf_ptr += sizeof(*desc);
-    *(uint8_t *)buf_ptr = NVME_CSI_NVM;
+    *(uint8_t *)buf_ptr = ns->csi;
 
     status = nvme_dma_read_prp(n, buf, data_len, prp1, prp2);
     g_free(buf);
@@ -872,6 +1608,9 @@ static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeIdentify *c)
     list = g_malloc0(data_len);
     ptr = (uint8_t *)list;
     NVME_SET_CSI(*ptr, NVME_CSI_NVM);
+    if (n->params.zoned) {
+        NVME_SET_CSI(*ptr, NVME_CSI_ZONED);
+    }
     status = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
     g_free(list);
     return status;
@@ -1038,7 +1777,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 }
 
 static uint16_t nvme_handle_cmd_effects(NvmeCtrl *n, NvmeCmd *cmd,
-    uint64_t prp1, uint64_t prp2, uint64_t ofs, uint32_t len)
+    uint64_t prp1, uint64_t prp2, uint64_t ofs, uint32_t len, uint8_t csi)
 {
    NvmeEffectsLog cmd_eff_log = {};
    uint32_t *iocs = cmd_eff_log.iocs;
@@ -1063,11 +1802,19 @@ static uint16_t nvme_handle_cmd_effects(NvmeCtrl *n, NvmeCmd *cmd,
     iocs[NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
     iocs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;
 
-    iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
-    iocs[NVME_CMD_WRITE_ZEROS] = NVME_CMD_EFFECTS_CSUPP |
-                                 NVME_CMD_EFFECTS_LBCC;
-    iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
-    iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
+    if (NVME_CC_CSS(n->bar.cc) != CSS_ADMIN_ONLY) {
+        iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
+        iocs[NVME_CMD_WRITE_ZEROS] = NVME_CMD_EFFECTS_CSUPP |
+                                     NVME_CMD_EFFECTS_LBCC;
+        iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
+        iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
+    }
+    if (csi == NVME_CSI_ZONED && NVME_CC_CSS(n->bar.cc) == CSS_ALL_NSTYPES) {
+        iocs[NVME_CMD_ZONE_APND] = NVME_CMD_EFFECTS_CSUPP |
+                                   NVME_CMD_EFFECTS_LBCC;
+        iocs[NVME_CMD_ZONE_MGMT_SEND] = NVME_CMD_EFFECTS_CSUPP;
+        iocs[NVME_CMD_ZONE_MGMT_RECV] = NVME_CMD_EFFECTS_CSUPP;
+    }
 
     return nvme_dma_read_prp(n, (uint8_t *)&cmd_eff_log, len, prp1, prp2);
 }
@@ -1083,6 +1830,7 @@ static uint16_t nvme_get_log_page(NvmeCtrl *n, NvmeCmd *cmd)
     uint64_t ofs = (dw13 << 32) | dw12;
     uint32_t numdl, numdu, len;
     uint16_t lid = dw10 & 0xff;
+    uint8_t csi = le32_to_cpu(cmd->cdw14) >> 24;
 
     numdl = dw10 >> 16;
     numdu = dw11 & 0xffff;
@@ -1090,8 +1838,8 @@ static uint16_t nvme_get_log_page(NvmeCtrl *n, NvmeCmd *cmd)
 
     switch (lid) {
     case NVME_LOG_CMD_EFFECTS:
-        return nvme_handle_cmd_effects(n, cmd, prp1, prp2, ofs, len);
-    }
+        return nvme_handle_cmd_effects(n, cmd, prp1, prp2, ofs, len, csi);
+     }
 
     trace_pci_nvme_unsupported_log_page(lid);
     return NVME_INVALID_FIELD | NVME_DNR;
@@ -1255,6 +2003,14 @@ static int nvme_start_ctrl(NvmeCtrl *n)
         return -1;
     }
 
+    if (n->params.zoned) {
+        if (!n->params.zamds_bs) {
+            n->params.zamds_bs = NVME_DEFAULT_MAX_ZA_SIZE;
+        }
+        n->params.zamds_bs *= KiB;
+        n->zamds = nvme_ilog2(n->params.zamds_bs / page_size);
+    }
+
     n->page_bits = page_bits;
     n->page_size = page_size;
     n->max_prp_ents = n->page_size / sizeof(uint64_t);
@@ -1324,6 +2080,11 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
             }
             switch (NVME_CC_CSS(data)) {
             case CSS_NVM_ONLY:
+                if (n->params.zoned) {
+                    NVME_GUEST_ERR(pci_nvme_err_only_zoned_cmd_set_avail,
+                                   "only NVM+ZONED command set can be selected");
+                    break;
+                }
                 trace_pci_nvme_css_nvm_cset_selected_by_host(data & 0xffffffff);
                 break;
             case CSS_ALL_NSTYPES:
@@ -1609,6 +2370,120 @@ static const MemoryRegionOps nvme_cmb_ops = {
     },
 };
 
+static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
+    uint64_t capacity)
+{
+    NvmeZone *zone;
+    uint64_t start = 0, zone_size = n->params.zone_size;
+    int i;
+
+    ns->zone_array = g_malloc0(n->zone_array_size);
+    ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
+    ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
+    ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
+    ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
+    zone = ns->zone_array;
+
+    nvme_init_zone_list(ns->exp_open_zones);
+    nvme_init_zone_list(ns->imp_open_zones);
+    nvme_init_zone_list(ns->closed_zones);
+    nvme_init_zone_list(ns->full_zones);
+
+    for (i = 0; i < n->num_zones; i++, zone++) {
+        if (start + zone_size > capacity) {
+            zone_size = capacity - start;
+        }
+        zone->d.zt = NVME_ZONE_TYPE_SEQ_WRITE;
+        nvme_set_zone_state(zone, NVME_ZONE_STATE_EMPTY);
+        zone->d.za = 0;
+        zone->d.zcap = n->params.zone_capacity;
+        zone->d.zslba = start;
+        zone->d.wp = start;
+        zone->prev = 0;
+        zone->next = 0;
+        start += zone_size;
+    }
+
+    return 0;
+}
+
+static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
+{
+    uint64_t zone_size = 0, capacity;
+    uint32_t nz;
+
+    if (n->params.zone_size) {
+        zone_size = n->params.zone_size;
+    } else {
+        zone_size = NVME_DEFAULT_ZONE_SIZE;
+    }
+    if (!n->params.zone_capacity) {
+        n->params.zone_capacity = zone_size;
+    }
+    n->zone_size_bs = zone_size * MiB;
+    n->params.zone_size = n->zone_size_bs / n->conf.logical_block_size;
+    capacity = n->params.zone_capacity * MiB;
+    n->params.zone_capacity = capacity / n->conf.logical_block_size;
+    if (n->params.zone_capacity > n->params.zone_size) {
+        error_setg(errp, "zone capacity exceeds zone size");
+        return;
+    }
+    zone_size = n->params.zone_size;
+
+    capacity = n->ns_size / n->conf.logical_block_size;
+    nz = DIV_ROUND_UP(capacity, zone_size);
+    n->num_zones = nz;
+    n->zone_array_size = sizeof(NvmeZone) * nz;
+
+    return;
+}
+
+static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
+    Error **errp)
+{
+    int ret;
+
+    ret = nvme_init_zone_meta(n, ns, n->num_zones * n->params.zone_size);
+    if (ret) {
+        error_setg(errp, "could not init zone metadata");
+        return -1;
+    }
+
+    ns->id_ns_zoned = g_malloc0(sizeof(*ns->id_ns_zoned));
+
+    /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
+    ns->id_ns_zoned->mar = 0xffffffff;
+    ns->id_ns_zoned->mor = 0xffffffff;
+    ns->id_ns_zoned->zoc = 0;
+    ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
+
+    ns->id_ns_zoned->lbafe[lba_index].zsze = cpu_to_le64(n->params.zone_size);
+    ns->id_ns_zoned->lbafe[lba_index].zdes = 0;
+
+    if (n->params.fill_pattern == 0) {
+        ns->id_ns.dlfeat = 0x01;
+    } else if (n->params.fill_pattern == 0xff) {
+        ns->id_ns.dlfeat = 0x02;
+    }
+
+    return 0;
+}
+
+static void nvme_zoned_clear(NvmeCtrl *n)
+{
+    int i;
+
+    for (i = 0; i < n->num_namespaces; i++) {
+        NvmeNamespace *ns = &n->namespaces[i];
+        g_free(ns->id_ns_zoned);
+        g_free(ns->zone_array);
+        g_free(ns->exp_open_zones);
+        g_free(ns->imp_open_zones);
+        g_free(ns->closed_zones);
+        g_free(ns->full_zones);
+    }
+}
+
 static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
 {
     NvmeParams *params = &n->params;
@@ -1674,18 +2549,13 @@ static void nvme_init_state(NvmeCtrl *n)
 
 static void nvme_init_blk(NvmeCtrl *n, Error **errp)
 {
+    int64_t bs_size;
+
     if (!blkconf_blocksizes(&n->conf, errp)) {
         return;
     }
     blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
                                   false, errp);
-}
-
-static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
-{
-    int64_t bs_size;
-    NvmeIdNs *id_ns = &ns->id_ns;
-    int lba_index;
 
     bs_size = blk_getlength(n->conf.blk);
     if (bs_size < 0) {
@@ -1694,6 +2564,12 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
     }
 
     n->ns_size = bs_size;
+}
+
+static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
+{
+    NvmeIdNs *id_ns = &ns->id_ns;
+    int lba_index;
 
     ns->csi = NVME_CSI_NVM;
     qemu_uuid_generate(&ns->uuid); /* TODO make UUIDs persistent */
@@ -1701,8 +2577,18 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
     id_ns->lbaf[lba_index].ds = nvme_ilog2(n->conf.logical_block_size);
     id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(n, ns));
 
+    if (n->params.zoned) {
+        ns->csi = NVME_CSI_ZONED;
+        id_ns->ncap = cpu_to_le64(n->params.zone_capacity * n->num_zones);
+        if (nvme_zoned_init_ns(n, ns, lba_index, errp) != 0) {
+            return;
+        }
+    } else {
+        ns->csi = NVME_CSI_NVM;
+        id_ns->ncap = id_ns->nsze;
+    }
+
     /* no thin provisioning */
-    id_ns->ncap = id_ns->nsze;
     id_ns->nuse = id_ns->ncap;
 }
 
@@ -1817,7 +2703,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     id->ieee[2] = 0xb3;
     id->oacs = cpu_to_le16(0);
     id->frmw = 7 << 1;
-    id->lpa = 1 << 0;
+    id->lpa = 1 << 1;
     id->sqes = (0x6 << 4) | 0x6;
     id->cqes = (0x4 << 4) | 0x4;
     id->nn = cpu_to_le32(n->num_namespaces);
@@ -1834,8 +2720,9 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     NVME_CAP_SET_CQR(n->bar.cap, 1);
     NVME_CAP_SET_TO(n->bar.cap, 0xf);
     /*
-     * The driver now always supports NS Types, but all commands that
-     * support CSI field will only handle NVM Command Set.
+     * The driver now always supports NS Types, even when "zoned" property
+     * is set to zero. If this is the case, all commands that support CSI field
+     * only handle NVM Command Set.
      */
     NVME_CAP_SET_CSS(n->bar.cap, (CAP_CSS_NVM | CAP_CSS_CSI_SUPP));
     NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
@@ -1871,6 +2758,13 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
         return;
     }
 
+    if (n->params.zoned) {
+        nvme_zoned_init_ctrl(n, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+    }
     nvme_init_ctrl(n, pci_dev);
 
     ns = n->namespaces;
@@ -1889,6 +2783,9 @@ static void nvme_exit(PCIDevice *pci_dev)
     NvmeCtrl *n = NVME(pci_dev);
 
     nvme_clear_ctrl(n);
+    if (n->params.zoned) {
+        nvme_zoned_clear(n);
+    }
     g_free(n->namespaces);
     g_free(n->cq);
     g_free(n->sq);
@@ -1912,6 +2809,12 @@ static Property nvme_props[] = {
     DEFINE_PROP_UINT32("num_queues", NvmeCtrl, params.num_queues, 0),
     DEFINE_PROP_UINT32("max_ioqpairs", NvmeCtrl, params.max_ioqpairs, 64),
     DEFINE_PROP_UINT16("msix_qsize", NvmeCtrl, params.msix_qsize, 65),
+    DEFINE_PROP_BOOL("zoned", NvmeCtrl, params.zoned, false),
+    DEFINE_PROP_UINT64("zone_size", NvmeCtrl, params.zone_size, 512),
+    DEFINE_PROP_UINT64("zone_capacity", NvmeCtrl, params.zone_capacity, 512),
+    DEFINE_PROP_UINT32("zone_append_max_size", NvmeCtrl, params.zamds_bs, 0),
+    DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
+    DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 11/18] hw/block/nvme: Introduce max active and open zone limits
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (9 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 10/18] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-07-01  0:26   ` Alistair Francis
  2020-07-01  6:41   ` Klaus Jensen
  2020-06-17 21:34 ` [PATCH v2 12/18] hw/block/nvme: Simulate Zone Active excursions Dmitry Fomichev
                   ` (7 subsequent siblings)
  18 siblings, 2 replies; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

Added two module properties, "max_active" and "max_open" to control
the maximum number of zones that can be active or open. Once these
variables are set to non-default values, the driver checks these
limits during I/O and returns Too Many Active or Too Many Open
command status if they are exceeded.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 183 +++++++++++++++++++++++++++++++++++++++++++++++-
 hw/block/nvme.h |   4 ++
 2 files changed, 185 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 2e03b0b6ed..05a7cbcfcc 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -120,6 +120,87 @@ static void nvme_remove_zone(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
     zone->prev = zone->next = 0;
 }
 
+/*
+ * Take the first zone out from a list, return NULL if the list is empty.
+ */
+static NvmeZone *nvme_remove_zone_head(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZoneList *zl)
+{
+    NvmeZone *zone = nvme_peek_zone_head(ns, zl);
+
+    if (zone) {
+        --zl->size;
+        if (zl->size == 0) {
+            zl->head = NVME_ZONE_LIST_NIL;
+            zl->tail = NVME_ZONE_LIST_NIL;
+        } else {
+            zl->head = zone->next;
+            ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
+        }
+        zone->prev = zone->next = 0;
+    }
+
+    return zone;
+}
+
+/*
+ * Check if we can open a zone without exceeding open/active limits.
+ * AOR stands for "Active and Open Resources" (see TP 4053 section 2.5).
+ */
+static int nvme_aor_check(NvmeCtrl *n, NvmeNamespace *ns,
+     uint32_t act, uint32_t opn)
+{
+    if (n->params.max_active_zones != 0 &&
+        ns->nr_active_zones + act > n->params.max_active_zones) {
+        trace_pci_nvme_err_insuff_active_res(n->params.max_active_zones);
+        return NVME_ZONE_TOO_MANY_ACTIVE | NVME_DNR;
+    }
+    if (n->params.max_open_zones != 0 &&
+        ns->nr_open_zones + opn > n->params.max_open_zones) {
+        trace_pci_nvme_err_insuff_open_res(n->params.max_open_zones);
+        return NVME_ZONE_TOO_MANY_OPEN | NVME_DNR;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static inline void nvme_aor_inc_open(NvmeCtrl *n, NvmeNamespace *ns)
+{
+    assert(ns->nr_open_zones >= 0);
+    if (n->params.max_open_zones) {
+        ns->nr_open_zones++;
+        assert(ns->nr_open_zones <= n->params.max_open_zones);
+    }
+}
+
+static inline void nvme_aor_dec_open(NvmeCtrl *n, NvmeNamespace *ns)
+{
+    if (n->params.max_open_zones) {
+        assert(ns->nr_open_zones > 0);
+        ns->nr_open_zones--;
+    }
+    assert(ns->nr_open_zones >= 0);
+}
+
+static inline void nvme_aor_inc_active(NvmeCtrl *n, NvmeNamespace *ns)
+{
+    assert(ns->nr_active_zones >= 0);
+    if (n->params.max_active_zones) {
+        ns->nr_active_zones++;
+        assert(ns->nr_active_zones <= n->params.max_active_zones);
+    }
+}
+
+static inline void nvme_aor_dec_active(NvmeCtrl *n, NvmeNamespace *ns)
+{
+    if (n->params.max_active_zones) {
+        assert(ns->nr_active_zones > 0);
+        ns->nr_active_zones--;
+        assert(ns->nr_active_zones >= ns->nr_open_zones);
+    }
+    assert(ns->nr_active_zones >= 0);
+}
+
 static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
     NvmeZone *zone, uint8_t state)
 {
@@ -454,6 +535,24 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
     timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
 }
 
+static void nvme_auto_transition_zone(NvmeCtrl *n, NvmeNamespace *ns,
+    bool implicit, bool adding_active)
+{
+    NvmeZone *zone;
+
+    if (implicit && n->params.max_open_zones &&
+        ns->nr_open_zones == n->params.max_open_zones) {
+        zone = nvme_remove_zone_head(n, ns, ns->imp_open_zones);
+        if (zone) {
+            /*
+             * Automatically close this implicitly open zone.
+             */
+            nvme_aor_dec_open(n, ns);
+            nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
+        }
+    }
+}
+
 static uint16_t nvme_check_zone_write(NvmeZone *zone, uint64_t slba,
     uint32_t nlb)
 {
@@ -531,6 +630,23 @@ static uint16_t nvme_check_zone_read(NvmeCtrl *n, NvmeZone *zone, uint64_t slba,
     return status;
 }
 
+static uint16_t nvme_auto_open_zone(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone)
+{
+    uint16_t status = NVME_SUCCESS;
+    uint8_t zs = nvme_get_zone_state(zone);
+
+    if (zs == NVME_ZONE_STATE_EMPTY) {
+        nvme_auto_transition_zone(n, ns, true, true);
+        status = nvme_aor_check(n, ns, 1, 1);
+    } else if (zs == NVME_ZONE_STATE_CLOSED) {
+        nvme_auto_transition_zone(n, ns, true, false);
+        status = nvme_aor_check(n, ns, 0, 1);
+    }
+
+    return status;
+}
+
 static uint64_t nvme_finalize_zone_write(NvmeCtrl *n, NvmeNamespace *ns,
     NvmeZone *zone, uint32_t nlb)
 {
@@ -543,7 +659,11 @@ static uint64_t nvme_finalize_zone_write(NvmeCtrl *n, NvmeNamespace *ns,
         switch (zs) {
         case NVME_ZONE_STATE_IMPLICITLY_OPEN:
         case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+            nvme_aor_dec_open(n, ns);
+            /* fall through */
         case NVME_ZONE_STATE_CLOSED:
+            nvme_aor_dec_active(n, ns);
+            /* fall through */
         case NVME_ZONE_STATE_EMPTY:
             break;
         default:
@@ -553,7 +673,10 @@ static uint64_t nvme_finalize_zone_write(NvmeCtrl *n, NvmeNamespace *ns,
     } else {
         switch (zs) {
         case NVME_ZONE_STATE_EMPTY:
+            nvme_aor_inc_active(n, ns);
+            /* fall through */
         case NVME_ZONE_STATE_CLOSED:
+            nvme_aor_inc_open(n, ns);
             nvme_assign_zone_state(n, ns, zone,
                                    NVME_ZONE_STATE_IMPLICITLY_OPEN);
         }
@@ -636,6 +759,11 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
                                                zone->d.wp);
             return NVME_ZONE_INVALID_WRITE | NVME_DNR;
         }
+
+        status = nvme_auto_open_zone(n, ns, zone);
+        if (status != NVME_SUCCESS) {
+            return status;
+        }
     }
 
     block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
@@ -709,6 +837,11 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
                                                    zone->d.wp);
                 return NVME_ZONE_INVALID_WRITE | NVME_DNR;
             }
+
+            status = nvme_auto_open_zone(n, ns, zone);
+            if (status != NVME_SUCCESS) {
+                return status;
+            }
         } else {
             status = nvme_check_zone_read(n, zone, slba, nlb,
                                           n->params.cross_zone_read);
@@ -804,9 +937,27 @@ static uint16_t nvme_get_mgmt_zone_slba_idx(NvmeCtrl *n, NvmeNamespace *ns,
 static uint16_t nvme_open_zone(NvmeCtrl *n, NvmeNamespace *ns,
     NvmeZone *zone, uint8_t state)
 {
+    uint16_t status;
+
     switch (state) {
     case NVME_ZONE_STATE_EMPTY:
+        nvme_auto_transition_zone(n, ns, false, true);
+        status = nvme_aor_check(n, ns, 1, 0);
+        if (status != NVME_SUCCESS) {
+            return status;
+        }
+        nvme_aor_inc_active(n, ns);
+        /* fall through */
     case NVME_ZONE_STATE_CLOSED:
+        status = nvme_aor_check(n, ns, 0, 1);
+        if (status != NVME_SUCCESS) {
+            if (state == NVME_ZONE_STATE_EMPTY) {
+                nvme_aor_dec_active(n, ns);
+            }
+            return status;
+        }
+        nvme_aor_inc_open(n, ns);
+        /* fall through */
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
         nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_EXPLICITLY_OPEN);
         /* fall through */
@@ -828,6 +979,7 @@ static uint16_t nvme_close_zone(NvmeCtrl *n,  NvmeNamespace *ns,
     switch (state) {
     case NVME_ZONE_STATE_EXPLICITLY_OPEN:
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_aor_dec_open(n, ns);
         nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
         /* fall through */
     case NVME_ZONE_STATE_CLOSED:
@@ -849,7 +1001,11 @@ static uint16_t nvme_finish_zone(NvmeCtrl *n, NvmeNamespace *ns,
     switch (state) {
     case NVME_ZONE_STATE_EXPLICITLY_OPEN:
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_aor_dec_open(n, ns);
+        /* fall through */
     case NVME_ZONE_STATE_CLOSED:
+        nvme_aor_dec_active(n, ns);
+        /* fall through */
     case NVME_ZONE_STATE_EMPTY:
         zone->d.wp = nvme_zone_wr_boundary(zone);
         nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_FULL);
@@ -874,7 +1030,11 @@ static uint16_t nvme_reset_zone(NvmeCtrl *n, NvmeNamespace *ns,
     switch (state) {
     case NVME_ZONE_STATE_EXPLICITLY_OPEN:
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        nvme_aor_dec_open(n, ns);
+        /* fall through */
     case NVME_ZONE_STATE_CLOSED:
+        nvme_aor_dec_active(n, ns);
+        /* fall through */
     case NVME_ZONE_STATE_FULL:
         zone->d.wp = zone->d.zslba;
         nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_EMPTY);
@@ -2412,6 +2572,15 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
     uint64_t zone_size = 0, capacity;
     uint32_t nz;
 
+    if (n->params.max_open_zones < 0) {
+        error_setg(errp, "invalid max_open_zones value");
+        return;
+    }
+    if (n->params.max_active_zones < 0) {
+        error_setg(errp, "invalid max_active_zones value");
+        return;
+    }
+
     if (n->params.zone_size) {
         zone_size = n->params.zone_size;
     } else {
@@ -2435,6 +2604,14 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
     n->num_zones = nz;
     n->zone_array_size = sizeof(NvmeZone) * nz;
 
+    /* Make sure that the values of all Zoned Command Set properties are sane */
+    if (n->params.max_open_zones > nz) {
+        n->params.max_open_zones = nz;
+    }
+    if (n->params.max_active_zones > nz) {
+        n->params.max_active_zones = nz;
+    }
+
     return;
 }
 
@@ -2452,8 +2629,8 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
     ns->id_ns_zoned = g_malloc0(sizeof(*ns->id_ns_zoned));
 
     /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
-    ns->id_ns_zoned->mar = 0xffffffff;
-    ns->id_ns_zoned->mor = 0xffffffff;
+    ns->id_ns_zoned->mar = cpu_to_le32(n->params.max_active_zones - 1);
+    ns->id_ns_zoned->mor = cpu_to_le32(n->params.max_open_zones - 1);
     ns->id_ns_zoned->zoc = 0;
     ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
 
@@ -2813,6 +2990,8 @@ static Property nvme_props[] = {
     DEFINE_PROP_UINT64("zone_size", NvmeCtrl, params.zone_size, 512),
     DEFINE_PROP_UINT64("zone_capacity", NvmeCtrl, params.zone_capacity, 512),
     DEFINE_PROP_UINT32("zone_append_max_size", NvmeCtrl, params.zamds_bs, 0),
+    DEFINE_PROP_INT32("max_active", NvmeCtrl, params.max_active_zones, 0),
+    DEFINE_PROP_INT32("max_open", NvmeCtrl, params.max_open_zones, 0),
     DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
     DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
     DEFINE_PROP_END_OF_LIST(),
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 2c932b5e29..f5a4679702 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -19,6 +19,8 @@ typedef struct NvmeParams {
     uint32_t    zamds_bs;
     uint64_t    zone_size;
     uint64_t    zone_capacity;
+    int32_t     max_active_zones;
+    int32_t     max_open_zones;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
@@ -103,6 +105,8 @@ typedef struct NvmeNamespace {
     NvmeZoneList    *imp_open_zones;
     NvmeZoneList    *closed_zones;
     NvmeZoneList    *full_zones;
+    int32_t         nr_open_zones;
+    int32_t         nr_active_zones;
 } NvmeNamespace;
 
 static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 12/18] hw/block/nvme: Simulate Zone Active excursions
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (10 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 11/18] hw/block/nvme: Introduce max active and open zone limits Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-07-01  0:30   ` Alistair Francis
  2020-07-01  6:12   ` Klaus Jensen
  2020-06-17 21:34 ` [PATCH v2 13/18] hw/block/nvme: Set Finish/Reset Zone Recommended attributes Dmitry Fomichev
                   ` (6 subsequent siblings)
  18 siblings, 2 replies; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

Added a Boolean flag to turn on simulation of Zone Active Excursions.
If the flag, "active_excursions", is set to true, the driver will try
to finish one of the currently open zone if max active zones limit is
going to get exceeded.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 24 +++++++++++++++++++++++-
 hw/block/nvme.h |  1 +
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 05a7cbcfcc..a29cbfcc96 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -540,6 +540,26 @@ static void nvme_auto_transition_zone(NvmeCtrl *n, NvmeNamespace *ns,
 {
     NvmeZone *zone;
 
+    if (n->params.active_excursions && adding_active &&
+        n->params.max_active_zones &&
+        ns->nr_active_zones == n->params.max_active_zones) {
+        zone = nvme_peek_zone_head(ns, ns->closed_zones);
+        if (zone) {
+            /*
+             * The namespace is at the limit of active zones.
+             * Try to finish one of the currently active zones
+             * to make the needed active zone resource available.
+             */
+            nvme_aor_dec_active(n, ns);
+            nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_FULL);
+            zone->d.za &= ~(NVME_ZA_FINISH_RECOMMENDED |
+                            NVME_ZA_RESET_RECOMMENDED);
+            zone->d.za |= NVME_ZA_FINISHED_BY_CTLR;
+            zone->tstamp = 0;
+            trace_pci_nvme_zone_finished_by_controller(zone->d.zslba);
+        }
+    }
+
     if (implicit && n->params.max_open_zones &&
         ns->nr_open_zones == n->params.max_open_zones) {
         zone = nvme_remove_zone_head(n, ns, ns->imp_open_zones);
@@ -2631,7 +2651,7 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
     /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
     ns->id_ns_zoned->mar = cpu_to_le32(n->params.max_active_zones - 1);
     ns->id_ns_zoned->mor = cpu_to_le32(n->params.max_open_zones - 1);
-    ns->id_ns_zoned->zoc = 0;
+    ns->id_ns_zoned->zoc = cpu_to_le16(n->params.active_excursions ? 0x2 : 0);
     ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
 
     ns->id_ns_zoned->lbafe[lba_index].zsze = cpu_to_le64(n->params.zone_size);
@@ -2993,6 +3013,8 @@ static Property nvme_props[] = {
     DEFINE_PROP_INT32("max_active", NvmeCtrl, params.max_active_zones, 0),
     DEFINE_PROP_INT32("max_open", NvmeCtrl, params.max_open_zones, 0),
     DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
+    DEFINE_PROP_BOOL("active_excursions", NvmeCtrl, params.active_excursions,
+                     false),
     DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index f5a4679702..8a0aaeb09a 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -15,6 +15,7 @@ typedef struct NvmeParams {
 
     bool        zoned;
     bool        cross_zone_read;
+    bool        active_excursions;
     uint8_t     fill_pattern;
     uint32_t    zamds_bs;
     uint64_t    zone_size;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 13/18] hw/block/nvme: Set Finish/Reset Zone Recommended attributes
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (11 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 12/18] hw/block/nvme: Simulate Zone Active excursions Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-07-01 16:23   ` Klaus Jensen
  2020-06-17 21:34 ` [PATCH v2 14/18] hw/block/nvme: Generate zone AENs Dmitry Fomichev
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

Added logic to set and reset FZR and RZR zone attributes. Four new
driver properties are added to control the timing of setting and
resetting these attributes. FZR/RZR delay lasts from the zone
operation and until when the corresponding zone attribute is set.
FZR/RZR limits set the time period between setting FZR or RZR
attribute and resetting it simulating the internal controller action
on that zone.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++
 hw/block/nvme.h | 13 ++++++-
 2 files changed, 111 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a29cbfcc96..c3898448c7 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -201,6 +201,84 @@ static inline void nvme_aor_dec_active(NvmeCtrl *n, NvmeNamespace *ns)
     assert(ns->nr_active_zones >= 0);
 }
 
+static void nvme_set_rzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
+{
+    assert(zone->flags & NVME_ZFLAGS_SET_RZR);
+    zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+    zone->flags &= ~NVME_ZFLAGS_TS_DELAY;
+    zone->d.za |= NVME_ZA_RESET_RECOMMENDED;
+    zone->flags &= ~NVME_ZFLAGS_SET_RZR;
+    trace_pci_nvme_zone_reset_recommended(zone->d.zslba);
+}
+
+static void nvme_clear_rzr(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone, bool notify)
+{
+    if (n->params.rrl_usec) {
+        zone->flags &= ~(NVME_ZFLAGS_SET_RZR | NVME_ZFLAGS_TS_DELAY);
+        notify = notify && (zone->d.za & NVME_ZA_RESET_RECOMMENDED);
+        zone->d.za &= ~NVME_ZA_RESET_RECOMMENDED;
+        zone->tstamp = 0;
+    }
+}
+
+static void nvme_set_fzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
+{
+    assert(zone->flags & NVME_ZFLAGS_SET_FZR);
+    zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+    zone->flags &= ~NVME_ZFLAGS_TS_DELAY;
+    zone->d.za |= NVME_ZA_FINISH_RECOMMENDED;
+    zone->flags &= ~NVME_ZFLAGS_SET_FZR;
+    trace_pci_nvme_zone_finish_recommended(zone->d.zslba);
+}
+
+static void nvme_clear_fzr(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone, bool notify)
+{
+    if (n->params.frl_usec) {
+        zone->flags &= ~(NVME_ZFLAGS_SET_FZR | NVME_ZFLAGS_TS_DELAY);
+        notify = notify && (zone->d.za & NVME_ZA_FINISH_RECOMMENDED);
+        zone->d.za &= ~NVME_ZA_FINISH_RECOMMENDED;
+        zone->tstamp = 0;
+    }
+}
+
+static void nvme_schedule_rzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
+{
+    if (n->params.frl_usec) {
+        zone->flags &= ~(NVME_ZFLAGS_SET_FZR | NVME_ZFLAGS_TS_DELAY);
+        zone->d.za &= ~NVME_ZA_FINISH_RECOMMENDED;
+        zone->tstamp = 0;
+    }
+    if (n->params.rrl_usec) {
+        zone->flags |= NVME_ZFLAGS_SET_RZR;
+        if (n->params.rzr_delay_usec) {
+            zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+            zone->flags |= NVME_ZFLAGS_TS_DELAY;
+        } else {
+            nvme_set_rzr(n, ns, zone);
+        }
+    }
+}
+
+static void nvme_schedule_fzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
+{
+    if (n->params.rrl_usec) {
+        zone->flags &= ~(NVME_ZFLAGS_SET_RZR | NVME_ZFLAGS_TS_DELAY);
+        zone->d.za &= ~NVME_ZA_RESET_RECOMMENDED;
+        zone->tstamp = 0;
+    }
+    if (n->params.frl_usec) {
+        zone->flags |= NVME_ZFLAGS_SET_FZR;
+        if (n->params.fzr_delay_usec) {
+            zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+            zone->flags |= NVME_ZFLAGS_TS_DELAY;
+        } else {
+            nvme_set_fzr(n, ns, zone);
+        }
+    }
+}
+
 static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
     NvmeZone *zone, uint8_t state)
 {
@@ -208,15 +286,19 @@ static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
         switch (nvme_get_zone_state(zone)) {
         case NVME_ZONE_STATE_EXPLICITLY_OPEN:
             nvme_remove_zone(n, ns, ns->exp_open_zones, zone);
+            nvme_clear_fzr(n, ns, zone, false);
             break;
         case NVME_ZONE_STATE_IMPLICITLY_OPEN:
             nvme_remove_zone(n, ns, ns->imp_open_zones, zone);
+            nvme_clear_fzr(n, ns, zone, false);
             break;
         case NVME_ZONE_STATE_CLOSED:
             nvme_remove_zone(n, ns, ns->closed_zones, zone);
+            nvme_clear_fzr(n, ns, zone, false);
             break;
         case NVME_ZONE_STATE_FULL:
             nvme_remove_zone(n, ns, ns->full_zones, zone);
+            nvme_clear_rzr(n, ns, zone, false);
         }
    }
 
@@ -225,15 +307,19 @@ static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
     switch (state) {
     case NVME_ZONE_STATE_EXPLICITLY_OPEN:
         nvme_add_zone_tail(n, ns, ns->exp_open_zones, zone);
+        nvme_schedule_fzr(n, ns, zone);
         break;
     case NVME_ZONE_STATE_IMPLICITLY_OPEN:
         nvme_add_zone_tail(n, ns, ns->imp_open_zones, zone);
+        nvme_schedule_fzr(n, ns, zone);
         break;
     case NVME_ZONE_STATE_CLOSED:
         nvme_add_zone_tail(n, ns, ns->closed_zones, zone);
+        nvme_schedule_fzr(n, ns, zone);
         break;
     case NVME_ZONE_STATE_FULL:
         nvme_add_zone_tail(n, ns, ns->full_zones, zone);
+        nvme_schedule_rzr(n, ns, zone);
         break;
     default:
         zone->d.za = 0;
@@ -555,6 +641,7 @@ static void nvme_auto_transition_zone(NvmeCtrl *n, NvmeNamespace *ns,
             zone->d.za &= ~(NVME_ZA_FINISH_RECOMMENDED |
                             NVME_ZA_RESET_RECOMMENDED);
             zone->d.za |= NVME_ZA_FINISHED_BY_CTLR;
+            zone->flags = 0;
             zone->tstamp = 0;
             trace_pci_nvme_zone_finished_by_controller(zone->d.zslba);
         }
@@ -2624,6 +2711,11 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
     n->num_zones = nz;
     n->zone_array_size = sizeof(NvmeZone) * nz;
 
+    n->params.rzr_delay_usec *= SCALE_MS;
+    n->params.rrl_usec *= SCALE_MS;
+    n->params.fzr_delay_usec *= SCALE_MS;
+    n->params.frl_usec *= SCALE_MS;
+
     /* Make sure that the values of all Zoned Command Set properties are sane */
     if (n->params.max_open_zones > nz) {
         n->params.max_open_zones = nz;
@@ -2651,6 +2743,8 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
     /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
     ns->id_ns_zoned->mar = cpu_to_le32(n->params.max_active_zones - 1);
     ns->id_ns_zoned->mor = cpu_to_le32(n->params.max_open_zones - 1);
+    ns->id_ns_zoned->rrl = cpu_to_le32(n->params.rrl_usec / (1000 * SCALE_MS));
+    ns->id_ns_zoned->frl = cpu_to_le32(n->params.frl_usec / (1000 * SCALE_MS));
     ns->id_ns_zoned->zoc = cpu_to_le16(n->params.active_excursions ? 0x2 : 0);
     ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
 
@@ -3012,6 +3106,11 @@ static Property nvme_props[] = {
     DEFINE_PROP_UINT32("zone_append_max_size", NvmeCtrl, params.zamds_bs, 0),
     DEFINE_PROP_INT32("max_active", NvmeCtrl, params.max_active_zones, 0),
     DEFINE_PROP_INT32("max_open", NvmeCtrl, params.max_open_zones, 0),
+    DEFINE_PROP_UINT64("reset_rcmnd_delay", NvmeCtrl, params.rzr_delay_usec, 0),
+    DEFINE_PROP_UINT64("reset_rcmnd_limit", NvmeCtrl, params.rrl_usec, 0),
+    DEFINE_PROP_UINT64("finish_rcmnd_delay", NvmeCtrl,
+                       params.fzr_delay_usec, 0),
+    DEFINE_PROP_UINT64("finish_rcmnd_limit", NvmeCtrl, params.frl_usec, 0),
     DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
     DEFINE_PROP_BOOL("active_excursions", NvmeCtrl, params.active_excursions,
                      false),
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 8a0aaeb09a..be1920f1ef 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -22,6 +22,10 @@ typedef struct NvmeParams {
     uint64_t    zone_capacity;
     int32_t     max_active_zones;
     int32_t     max_open_zones;
+    uint64_t    rzr_delay_usec;
+    uint64_t    rrl_usec;
+    uint64_t    fzr_delay_usec;
+    uint64_t    frl_usec;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
@@ -77,12 +81,19 @@ typedef struct NvmeCQueue {
     QTAILQ_HEAD(, NvmeRequest) req_list;
 } NvmeCQueue;
 
+enum NvmeZoneFlags {
+    NVME_ZFLAGS_TS_DELAY = 1 << 0,
+    NVME_ZFLAGS_SET_RZR  = 1 << 1,
+    NVME_ZFLAGS_SET_FZR  = 1 << 2,
+};
+
 typedef struct NvmeZone {
     NvmeZoneDescr   d;
     uint64_t        tstamp;
+    uint32_t        flags;
     uint32_t        next;
     uint32_t        prev;
-    uint8_t         rsvd80[8];
+    uint8_t         rsvd84[4];
 } NvmeZone;
 
 #define NVME_ZONE_LIST_NIL    UINT_MAX
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 14/18] hw/block/nvme: Generate zone AENs
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (12 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 13/18] hw/block/nvme: Set Finish/Reset Zone Recommended attributes Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-07-01 11:44   ` Klaus Jensen
  2020-06-17 21:34 ` [PATCH v2 15/18] hw/block/nvme: Support Zone Descriptor Extensions Dmitry Fomichev
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

Added an optional Boolean "zone_async_events" property to the driver.
Once it's turned on, the namespace will be sending "Zone Descriptor
Changed" asynchronous events to the host in particular situations
defined by the protocol. In order to clear these AENs, the host needs
to read the newly added Changed Zones Log.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c      | 300 ++++++++++++++++++++++++++++++++++++++++++-
 hw/block/nvme.h      |  13 +-
 include/block/nvme.h |  23 +++-
 3 files changed, 328 insertions(+), 8 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index c3898448c7..b9135a6b1f 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -201,12 +201,66 @@ static inline void nvme_aor_dec_active(NvmeCtrl *n, NvmeNamespace *ns)
     assert(ns->nr_active_zones >= 0);
 }
 
+static bool nvme_complete_async_req(NvmeCtrl *n, NvmeNamespace *ns,
+    enum NvmeAsyncEventType type, uint8_t info)
+{
+    NvmeAsyncEvent *ae;
+    uint32_t nsid = 0;
+    uint8_t log_page = 0;
+
+    switch (type) {
+    case NVME_AER_TYPE_ERROR:
+    case NVME_AER_TYPE_SMART:
+        break;
+    case NVME_AER_TYPE_NOTICE:
+        switch (info) {
+        case NVME_AER_NOTICE_ZONE_DESCR_CHANGED:
+            log_page = NVME_LOG_ZONE_CHANGED_LIST;
+            nsid = ns->nsid;
+            if (!(n->ae_cfg & NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES)) {
+                trace_pci_nvme_zone_ae_not_enabled(info, log_page, nsid);
+                return false;
+            }
+            if (ns->aen_pending) {
+                trace_pci_nvme_zone_ae_not_cleared(info, log_page, nsid);
+                return false;
+            }
+            ns->aen_pending = true;
+        }
+        break;
+    case NVME_AER_TYPE_CMDSET_SPECIFIC:
+    case NVME_AER_TYPE_VENDOR_SPECIFIC:
+        break;
+    }
+
+    ae = g_malloc0(sizeof(*ae));
+    ae->res = type;
+    ae->res |= (info << 8) & 0xff00;
+    ae->res |= (log_page << 16) & 0xff0000;
+    ae->nsid = nsid;
+
+    QTAILQ_INSERT_TAIL(&n->async_reqs, ae, entry);
+    timer_mod(n->admin_cq.timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+    return true;
+}
+
+static inline void nvme_notify_zone_changed(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone)
+{
+    if (n->ae_cfg) {
+        zone->flags |= NVME_ZFLAGS_AEN_PEND;
+        nvme_complete_async_req(n, ns, NVME_AER_TYPE_NOTICE,
+                                NVME_AER_NOTICE_ZONE_DESCR_CHANGED);
+    }
+}
+
 static void nvme_set_rzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
 {
     assert(zone->flags & NVME_ZFLAGS_SET_RZR);
     zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     zone->flags &= ~NVME_ZFLAGS_TS_DELAY;
     zone->d.za |= NVME_ZA_RESET_RECOMMENDED;
+    nvme_notify_zone_changed(n, ns, zone);
     zone->flags &= ~NVME_ZFLAGS_SET_RZR;
     trace_pci_nvme_zone_reset_recommended(zone->d.zslba);
 }
@@ -215,10 +269,14 @@ static void nvme_clear_rzr(NvmeCtrl *n, NvmeNamespace *ns,
     NvmeZone *zone, bool notify)
 {
     if (n->params.rrl_usec) {
-        zone->flags &= ~(NVME_ZFLAGS_SET_RZR | NVME_ZFLAGS_TS_DELAY);
+        zone->flags &= ~(NVME_ZFLAGS_SET_RZR | NVME_ZFLAGS_TS_DELAY |
+                         NVME_ZFLAGS_AEN_PEND);
         notify = notify && (zone->d.za & NVME_ZA_RESET_RECOMMENDED);
         zone->d.za &= ~NVME_ZA_RESET_RECOMMENDED;
         zone->tstamp = 0;
+        if (notify) {
+            nvme_notify_zone_changed(n, ns, zone);
+        }
     }
 }
 
@@ -228,6 +286,7 @@ static void nvme_set_fzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
     zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     zone->flags &= ~NVME_ZFLAGS_TS_DELAY;
     zone->d.za |= NVME_ZA_FINISH_RECOMMENDED;
+    nvme_notify_zone_changed(n, ns, zone);
     zone->flags &= ~NVME_ZFLAGS_SET_FZR;
     trace_pci_nvme_zone_finish_recommended(zone->d.zslba);
 }
@@ -236,13 +295,61 @@ static void nvme_clear_fzr(NvmeCtrl *n, NvmeNamespace *ns,
     NvmeZone *zone, bool notify)
 {
     if (n->params.frl_usec) {
-        zone->flags &= ~(NVME_ZFLAGS_SET_FZR | NVME_ZFLAGS_TS_DELAY);
+        zone->flags &= ~(NVME_ZFLAGS_SET_FZR | NVME_ZFLAGS_TS_DELAY |
+                         NVME_ZFLAGS_AEN_PEND);
         notify = notify && (zone->d.za & NVME_ZA_FINISH_RECOMMENDED);
         zone->d.za &= ~NVME_ZA_FINISH_RECOMMENDED;
         zone->tstamp = 0;
+        if (notify) {
+            nvme_notify_zone_changed(n, ns, zone);
+        }
     }
 }
 
+static bool nvme_process_rrl(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
+{
+    if (zone->flags & NVME_ZFLAGS_SET_RZR) {
+        if (zone->flags & NVME_ZFLAGS_TS_DELAY) {
+            assert(!(zone->d.za & NVME_ZA_RESET_RECOMMENDED));
+            if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - zone->tstamp >=
+                n->params.rzr_delay_usec) {
+                nvme_set_rzr(n, ns, zone);
+                return true;
+            }
+        } else if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - zone->tstamp >=
+                   n->params.rrl_usec) {
+            assert(zone->d.za & NVME_ZA_RESET_RECOMMENDED);
+            nvme_clear_rzr(n, ns, zone, true);
+            trace_pci_nvme_zone_reset_internal_op(zone->d.zslba);
+            return true;
+        }
+    }
+
+    return false;
+}
+
+static bool nvme_process_frl(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
+{
+    if (zone->flags & NVME_ZFLAGS_SET_FZR) {
+        if (zone->flags & NVME_ZFLAGS_TS_DELAY) {
+            assert(!(zone->d.za & NVME_ZA_FINISH_RECOMMENDED));
+            if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - zone->tstamp >=
+                n->params.fzr_delay_usec) {
+                nvme_set_fzr(n, ns, zone);
+                return true;
+            }
+        } else if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - zone->tstamp >=
+                   n->params.frl_usec) {
+            assert(zone->d.za & NVME_ZA_FINISH_RECOMMENDED);
+            nvme_clear_fzr(n, ns, zone, true);
+            trace_pci_nvme_zone_finish_internal_op(zone->d.zslba);
+            return true;
+        }
+    }
+
+    return false;
+}
+
 static void nvme_schedule_rzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
 {
     if (n->params.frl_usec) {
@@ -279,6 +386,48 @@ static void nvme_schedule_fzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
     }
 }
 
+static void nvme_observe_ns_zone_time_limits(NvmeCtrl *n, NvmeNamespace *ns)
+{
+    NvmeZone *zone;
+
+    if (n->params.frl_usec) {
+        for (zone = nvme_peek_zone_head(ns, ns->closed_zones);
+             zone;
+             zone = nvme_next_zone_in_list(ns, zone, ns->closed_zones)) {
+            nvme_process_frl(n, ns, zone);
+        }
+
+        for (zone = nvme_peek_zone_head(ns, ns->imp_open_zones);
+             zone;
+             zone = nvme_next_zone_in_list(ns, zone, ns->imp_open_zones)) {
+            nvme_process_frl(n, ns, zone);
+        }
+
+        for (zone = nvme_peek_zone_head(ns, ns->exp_open_zones);
+             zone;
+             zone = nvme_next_zone_in_list(ns, zone, ns->exp_open_zones)) {
+            nvme_process_frl(n, ns, zone);
+        }
+    }
+
+    if (n->params.rrl_usec) {
+        for (zone = nvme_peek_zone_head(ns, ns->full_zones);
+             zone;
+             zone = nvme_next_zone_in_list(ns, zone, ns->full_zones)) {
+            nvme_process_rrl(n, ns, zone);
+        }
+    }
+}
+
+static void nvme_observe_zone_time_limits(NvmeCtrl *n)
+{
+    int i;
+
+    for (i = 0; i < n->num_namespaces; i++) {
+        nvme_observe_ns_zone_time_limits(n, &n->namespaces[i]);
+    }
+}
+
 static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
     NvmeZone *zone, uint8_t state)
 {
@@ -563,6 +712,7 @@ static void nvme_post_cqes(void *opaque)
     NvmeCQueue *cq = opaque;
     NvmeCtrl *n = cq->ctrl;
     NvmeRequest *req, *next;
+    NvmeAsyncEvent *ae;
 
     QTAILQ_FOREACH_SAFE(req, &cq->req_list, entry, next) {
         NvmeSQueue *sq;
@@ -572,8 +722,26 @@ static void nvme_post_cqes(void *opaque)
             break;
         }
 
+        ae = NULL;
+        if (req->flags & NVME_REQ_FLG_AER) {
+            if (likely(QTAILQ_EMPTY(&n->async_reqs))) {
+                continue;
+            } else {
+                ae = QTAILQ_FIRST(&n->async_reqs);
+                QTAILQ_REMOVE(&n->async_reqs, ae, entry);
+            }
+        }
+
         QTAILQ_REMOVE(&cq->req_list, req, entry);
         sq = req->sq;
+        if (unlikely(ae)) {
+            assert(!sq->sqid);
+            req->cqe.ae.info = cpu_to_le32(ae->res);
+            req->cqe.ae.nsid = cpu_to_le32(ae->nsid);
+            g_free(ae);
+            assert(n->nr_aers);
+            n->nr_aers--;
+        }
 
         req->cqe.status = cpu_to_le16((req->status << 1) | cq->phase);
         req->cqe.sq_id = cpu_to_le16(sq->sqid);
@@ -587,6 +755,15 @@ static void nvme_post_cqes(void *opaque)
     if (cq->tail != cq->head) {
         nvme_irq_assert(n, cq);
     }
+
+    if (cq == &n->admin_cq &&
+        n->params.zoned && n->params.zone_async_events) {
+        nvme_observe_zone_time_limits(n);
+        if (timer_expired(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL))) {
+            timer_mod(cq->timer,
+                      qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 10 * SCALE_MS);
+        }
+    }
 }
 
 static void nvme_fill_data(QEMUSGList *qsg, QEMUIOVector *iov,
@@ -618,7 +795,9 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
     assert(cq->cqid == req->sq->cqid);
     QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
     QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
-    timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+    if (!(req->flags & NVME_REQ_FLG_AER)) {
+        timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
+    }
 }
 
 static void nvme_auto_transition_zone(NvmeCtrl *n, NvmeNamespace *ns,
@@ -643,6 +822,7 @@ static void nvme_auto_transition_zone(NvmeCtrl *n, NvmeNamespace *ns,
             zone->d.za |= NVME_ZA_FINISHED_BY_CTLR;
             zone->flags = 0;
             zone->tstamp = 0;
+            nvme_notify_zone_changed(n, ns, zone);
             trace_pci_nvme_zone_finished_by_controller(zone->d.zslba);
         }
     }
@@ -1978,6 +2158,10 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         break;
     case NVME_TIMESTAMP:
         return nvme_get_feature_timestamp(n, cmd);
+    case NVME_ASYNCHRONOUS_EVENT_CONF:
+        result = cpu_to_le32(n->ae_cfg);
+        trace_pci_nvme_getfeat_aen_cfg(result);
+        break;
     case NVME_COMMAND_SET_PROFILE:
         result = 0;
         break;
@@ -2029,6 +2213,19 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         return nvme_set_feature_timestamp(n, cmd);
         break;
 
+    case NVME_ASYNCHRONOUS_EVENT_CONF:
+        if (dw11 & NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES) {
+            if (!(n->ae_cfg & NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES)) {
+                trace_pci_nvme_zone_aen_not_requested(dw11);
+            } else {
+                trace_pci_nvme_setfeat_zone_info_aer_on();
+            }
+        } else if (n->ae_cfg & NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES) {
+            trace_pci_nvme_setfeat_zone_info_aer_off();
+            n->ae_cfg &= ~NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES;
+        }
+        break;
+
     case NVME_COMMAND_SET_PROFILE:
         if (dw11 & 0x1ff) {
             trace_pci_nvme_err_invalid_iocsci(dw11 & 0x1ff);
@@ -2043,6 +2240,18 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     return NVME_SUCCESS;
 }
 
+static uint16_t nvme_async_req(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+    if (n->nr_aers >= NVME_MAX_ASYNC_EVENTS) {
+        return NVME_AER_LIMIT_EXCEEDED | NVME_DNR;
+    }
+
+    assert(!(req->flags & NVME_REQ_FLG_AER));
+    req->flags |= NVME_REQ_FLG_AER;
+    n->nr_aers++;
+    return NVME_SUCCESS;
+}
+
 static uint16_t nvme_handle_cmd_effects(NvmeCtrl *n, NvmeCmd *cmd,
     uint64_t prp1, uint64_t prp2, uint64_t ofs, uint32_t len, uint8_t csi)
 {
@@ -2068,6 +2277,7 @@ static uint16_t nvme_handle_cmd_effects(NvmeCtrl *n, NvmeCmd *cmd,
     iocs[NVME_ADM_CMD_SET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
     iocs[NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
     iocs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;
+    iocs[NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFFECTS_CSUPP;
 
     if (NVME_CC_CSS(n->bar.cc) != CSS_ADMIN_ONLY) {
         iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
@@ -2086,6 +2296,67 @@ static uint16_t nvme_handle_cmd_effects(NvmeCtrl *n, NvmeCmd *cmd,
     return nvme_dma_read_prp(n, (uint8_t *)&cmd_eff_log, len, prp1, prp2);
 }
 
+static uint16_t nvme_handle_changed_zone_log(NvmeCtrl *n, NvmeCmd *cmd,
+    uint64_t prp1, uint64_t prp2, uint16_t nsid, uint64_t ofs, uint32_t len,
+    uint8_t csi, bool rae)
+{
+    NvmeNamespace *ns;
+    NvmeChangedZoneLog zc_log = {};
+    NvmeZone *zone;
+    uint64_t *zid_ptr = &zc_log.zone_ids[0];
+    uint64_t *zid_end = zid_ptr + ARRAY_SIZE(zc_log.zone_ids);
+    int i, nids = 0, num_aen_zones = 0;
+
+    trace_pci_nvme_changed_zone_log_read(nsid);
+
+    if (!n->params.zoned || !n->params.zone_async_events) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
+        trace_pci_nvme_err_invalid_ns(nsid, n->num_namespaces);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+    ns = &n->namespaces[nsid - 1];
+    if (csi != ns->csi) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    if (ofs != 0) {
+        trace_pci_nvme_err_invalid_changed_zone_list_offset(ofs);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+    if (len != sizeof(zc_log)) {
+        trace_pci_nvme_err_invalid_changed_zone_list_len(len);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    zone = ns->zone_array;
+    for (i = 0; i < n->num_zones && zid_ptr < zid_end; i++, zone++) {
+        if (!(zone->flags & NVME_ZFLAGS_AEN_PEND)) {
+            continue;
+        }
+        num_aen_zones++;
+        if (zone->d.za) {
+            trace_pci_nvme_reporting_changed_zone(zone->d.zslba, zone->d.za);
+            *zid_ptr++ = cpu_to_le64(zone->d.zslba);
+            nids++;
+        }
+        if (!rae) {
+            zone->flags &= ~NVME_ZFLAGS_AEN_PEND;
+        }
+    }
+
+    if (num_aen_zones && !nids) {
+        trace_pci_nvme_empty_changed_zone_list();
+        nids = 0xffff;
+    }
+    zc_log.nr_zone_ids = cpu_to_le16(nids);
+    ns->aen_pending = false;
+
+    return nvme_dma_read_prp(n, (uint8_t *)&zc_log, len, prp1, prp2);
+}
+
 static uint16_t nvme_get_log_page(NvmeCtrl *n, NvmeCmd *cmd)
 {
     uint64_t prp1 = le64_to_cpu(cmd->prp1);
@@ -2095,9 +2366,11 @@ static uint16_t nvme_get_log_page(NvmeCtrl *n, NvmeCmd *cmd)
     uint64_t dw12 = le32_to_cpu(cmd->cdw12);
     uint64_t dw13 = le32_to_cpu(cmd->cdw13);
     uint64_t ofs = (dw13 << 32) | dw12;
+    uint32_t nsid = le32_to_cpu(cmd->nsid);
     uint32_t numdl, numdu, len;
     uint16_t lid = dw10 & 0xff;
     uint8_t csi = le32_to_cpu(cmd->cdw14) >> 24;
+    bool rae = !!(dw10 & (1 << 15));
 
     numdl = dw10 >> 16;
     numdu = dw11 & 0xffff;
@@ -2106,6 +2379,9 @@ static uint16_t nvme_get_log_page(NvmeCtrl *n, NvmeCmd *cmd)
     switch (lid) {
     case NVME_LOG_CMD_EFFECTS:
         return nvme_handle_cmd_effects(n, cmd, prp1, prp2, ofs, len, csi);
+    case NVME_LOG_ZONE_CHANGED_LIST:
+        return nvme_handle_changed_zone_log(n, cmd, prp1, prp2, nsid,
+                                            ofs, len, csi, rae);
      }
 
     trace_pci_nvme_unsupported_log_page(lid);
@@ -2131,6 +2407,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         return nvme_get_feature(n, cmd, req);
     case NVME_ADM_CMD_GET_LOG_PAGE:
         return nvme_get_log_page(n, cmd);
+    case NVME_ADM_CMD_ASYNC_EV_REQ:
+        return nvme_async_req(n, cmd, req);
     default:
         trace_pci_nvme_err_invalid_admin_opc(cmd->opcode);
         return NVME_INVALID_OPCODE | NVME_DNR;
@@ -2171,6 +2449,7 @@ static void nvme_process_sq(void *opaque)
 
 static void nvme_clear_ctrl(NvmeCtrl *n)
 {
+    NvmeAsyncEvent *ae_entry, *next;
     int i;
 
     blk_drain(n->conf.blk);
@@ -2186,6 +2465,11 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
         }
     }
 
+    QTAILQ_FOREACH_SAFE(ae_entry, &n->async_reqs, entry, next) {
+        g_free(ae_entry);
+    }
+    n->nr_aers = 0;
+
     blk_flush(n->conf.blk);
     n->bar.cc = 0;
 }
@@ -2290,6 +2574,9 @@ static int nvme_start_ctrl(NvmeCtrl *n)
 
     nvme_set_timestamp(n, 0ULL);
 
+    QTAILQ_INIT(&n->async_reqs);
+    n->nr_aers = 0;
+
     return 0;
 }
 
@@ -2724,6 +3011,10 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
         n->params.max_active_zones = nz;
     }
 
+    if (n->params.zone_async_events) {
+        n->ae_cfg |= NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES;
+    }
+
     return;
 }
 
@@ -2993,6 +3284,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     id->ieee[1] = 0x02;
     id->ieee[2] = 0xb3;
     id->oacs = cpu_to_le16(0);
+    id->oaes = cpu_to_le32(n->ae_cfg);
     id->frmw = 7 << 1;
     id->lpa = 1 << 1;
     id->sqes = (0x6 << 4) | 0x6;
@@ -3111,6 +3403,8 @@ static Property nvme_props[] = {
     DEFINE_PROP_UINT64("finish_rcmnd_delay", NvmeCtrl,
                        params.fzr_delay_usec, 0),
     DEFINE_PROP_UINT64("finish_rcmnd_limit", NvmeCtrl, params.frl_usec, 0),
+    DEFINE_PROP_BOOL("zone_async_events", NvmeCtrl, params.zone_async_events,
+                     true),
     DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
     DEFINE_PROP_BOOL("active_excursions", NvmeCtrl, params.active_excursions,
                      false),
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index be1920f1ef..e63f7736d7 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -3,6 +3,7 @@
 
 #include "block/nvme.h"
 
+#define NVME_MAX_ASYNC_EVENTS    16
 #define NVME_DEFAULT_ZONE_SIZE   128 /* MiB */
 #define NVME_DEFAULT_MAX_ZA_SIZE 128 /* KiB */
 
@@ -15,6 +16,7 @@ typedef struct NvmeParams {
 
     bool        zoned;
     bool        cross_zone_read;
+    bool        zone_async_events;
     bool        active_excursions;
     uint8_t     fill_pattern;
     uint32_t    zamds_bs;
@@ -29,13 +31,16 @@ typedef struct NvmeParams {
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
-    QSIMPLEQ_ENTRY(NvmeAsyncEvent) entry;
+    QTAILQ_ENTRY(NvmeAsyncEvent) entry;
+    uint32_t                     res;
+    uint32_t                     nsid;
 } NvmeAsyncEvent;
 
 enum NvmeRequestFlags {
     NVME_REQ_FLG_HAS_SG   = 1 << 0,
     NVME_REQ_FLG_FILL     = 1 << 1,
     NVME_REQ_FLG_APPEND   = 1 << 2,
+    NVME_REQ_FLG_AER      = 1 << 3,
 };
 
 typedef struct NvmeRequest {
@@ -85,6 +90,7 @@ enum NvmeZoneFlags {
     NVME_ZFLAGS_TS_DELAY = 1 << 0,
     NVME_ZFLAGS_SET_RZR  = 1 << 1,
     NVME_ZFLAGS_SET_FZR  = 1 << 2,
+    NVME_ZFLAGS_AEN_PEND = 1 << 3,
 };
 
 typedef struct NvmeZone {
@@ -119,6 +125,7 @@ typedef struct NvmeNamespace {
     NvmeZoneList    *full_zones;
     int32_t         nr_open_zones;
     int32_t         nr_active_zones;
+    bool            aen_pending;
 } NvmeNamespace;
 
 static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
@@ -173,6 +180,10 @@ typedef struct NvmeCtrl {
     NvmeSQueue      admin_sq;
     NvmeCQueue      admin_cq;
     NvmeIdCtrl      id_ctrl;
+
+    QTAILQ_HEAD(, NvmeAsyncEvent) async_reqs;
+    uint32_t        nr_aers;
+    uint32_t        ae_cfg;
 } NvmeCtrl;
 
 /* calculate the number of LBAs that the namespace can accomodate */
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 596c39162b..e06fb97337 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -633,16 +633,22 @@ enum NvmeAsyncErrorInfo {
 
 enum NvmeAsyncNoticeInfo {
     NVME_AER_NOTICE_NS_CHANGED              = 0x00,
+    NVME_AER_NOTICE_ZONE_DESCR_CHANGED      = 0xef,
 };
 
 enum NvmeAsyncEventCfg {
     NVME_AEN_CFG_NS_ATTR                    = 1 << 8,
+    NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES   = 1 << 27,
 };
 
 typedef struct NvmeCqe {
     union {
         uint64_t     result64;
         uint32_t     result32;
+        struct {
+            uint32_t info;
+            uint32_t nsid;
+        } ae;
     };
     uint16_t    sq_head;
     uint16_t    sq_id;
@@ -778,11 +784,19 @@ enum {
    NVME_CMD_EFFECTS_UUID_SEL          = 1 << 19,
 };
 
+typedef struct NvmeChangedZoneLog {
+    uint16_t    nr_zone_ids;
+    uint8_t     rsvd2[6];
+    uint64_t    zone_ids[511];
+} NvmeChangedZoneLog;
+
 enum LogIdentifier {
-    NVME_LOG_ERROR_INFO     = 0x01,
-    NVME_LOG_SMART_INFO     = 0x02,
-    NVME_LOG_FW_SLOT_INFO   = 0x03,
-    NVME_LOG_CMD_EFFECTS    = 0x05,
+    NVME_LOG_ERROR_INFO               = 0x01,
+    NVME_LOG_SMART_INFO               = 0x02,
+    NVME_LOG_FW_SLOT_INFO             = 0x03,
+    NVME_LOG_CHANGED_NS_LIST          = 0x04,
+    NVME_LOG_CMD_EFFECTS              = 0x05,
+    NVME_LOG_ZONE_CHANGED_LIST        = 0xbf,
 };
 
 typedef struct NvmePSD {
@@ -1097,6 +1111,7 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsZoned) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeChangedZoneLog) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeZoneDescr) != 64);
 }
 #endif
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 15/18] hw/block/nvme: Support Zone Descriptor Extensions
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (13 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 14/18] hw/block/nvme: Generate zone AENs Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-07-01 16:32   ` Klaus Jensen
  2020-06-17 21:34 ` [PATCH v2 16/18] hw/block/nvme: Add injection of Offline/Read-Only zones Dmitry Fomichev
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

Zone Descriptor Extension is a label that can be assigned to a zone.
It can be set to an Empty zone and it stays assigned until the zone
is reset.

This commit adds a new optional property, "zone_descr_ext_size", to
the driver. Its value must be a multiple of 64 bytes. If this value
is non-zero, it becomes possible to assign extensions of that size
to any Empty zones. The default value for this property is 0,
therefore setting extensions is disabled by default.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++---
 hw/block/nvme.h |  8 ++++++
 2 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b9135a6b1f..eb41081627 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1360,6 +1360,26 @@ static bool nvme_cond_offline_all(uint8_t state)
     return state == NVME_ZONE_STATE_READ_ONLY;
 }
 
+static uint16_t nvme_set_zd_ext(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone, uint8_t state)
+{
+    uint16_t status;
+
+    if (state == NVME_ZONE_STATE_EMPTY) {
+        nvme_auto_transition_zone(n, ns, false, true);
+        status = nvme_aor_check(n, ns, 1, 0);
+        if (status != NVME_SUCCESS) {
+            return status;
+        }
+        nvme_aor_inc_active(n, ns);
+        zone->d.za |= NVME_ZA_ZD_EXT_VALID;
+        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
+        return NVME_SUCCESS;
+    }
+
+    return NVME_ZONE_INVAL_TRANSITION;
+}
+
 static uint16_t name_do_zone_op(NvmeCtrl *n, NvmeNamespace *ns,
     NvmeZone *zone, uint8_t state, bool all,
     uint16_t (*op_hndlr)(NvmeCtrl *, NvmeNamespace *, NvmeZone *,
@@ -1388,13 +1408,16 @@ static uint16_t name_do_zone_op(NvmeCtrl *n, NvmeNamespace *ns,
 static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeNamespace *ns,
     NvmeCmd *cmd, NvmeRequest *req)
 {
+    NvmeRwCmd *rw;
     uint32_t dw13 = le32_to_cpu(cmd->cdw13);
+    uint64_t prp1, prp2;
     uint64_t slba = 0;
     uint64_t zone_idx = 0;
     uint16_t status;
     uint8_t action, state;
     bool all;
     NvmeZone *zone;
+    uint8_t *zd_ext;
 
     action = dw13 & 0xff;
     all = dw13 & 0x100;
@@ -1449,7 +1472,25 @@ static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeNamespace *ns,
 
     case NVME_ZONE_ACTION_SET_ZD_EXT:
         trace_pci_nvme_set_descriptor_extension(slba, zone_idx);
-        return NVME_INVALID_FIELD | NVME_DNR;
+        if (all || !n->params.zd_extension_size) {
+            return NVME_INVALID_FIELD | NVME_DNR;
+        }
+        zd_ext = nvme_get_zd_extension(n, ns, zone_idx);
+        rw = (NvmeRwCmd *)cmd;
+        prp1 = le64_to_cpu(rw->prp1);
+        prp2 = le64_to_cpu(rw->prp2);
+        status = nvme_dma_write_prp(n, zd_ext, n->params.zd_extension_size,
+                                    prp1, prp2);
+        if (status) {
+            trace_pci_nvme_err_zd_extension_map_error(zone_idx);
+            return status;
+        }
+
+        status = nvme_set_zd_ext(n, ns, zone, state);
+        if (status == NVME_SUCCESS) {
+            trace_pci_nvme_zd_extension_set(zone_idx);
+            return status;
+        }
         break;
 
     default:
@@ -1528,7 +1569,7 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeNamespace *ns,
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
-    if (zra == NVME_ZONE_REPORT_EXTENDED) {
+    if (zra == NVME_ZONE_REPORT_EXTENDED && !n->params.zd_extension_size) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
@@ -1540,6 +1581,9 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeNamespace *ns,
     partial = (dw13 >> 16) & 0x01;
 
     zone_entry_sz = sizeof(NvmeZoneDescr);
+    if (zra == NVME_ZONE_REPORT_EXTENDED) {
+        zone_entry_sz += n->params.zd_extension_size;
+    }
 
     max_zones = (len - sizeof(NvmeZoneReportHeader)) / zone_entry_sz;
     buf = g_malloc0(len);
@@ -1571,6 +1615,14 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeNamespace *ns,
             z->wp = cpu_to_le64(~0ULL);
         }
 
+        if (zra == NVME_ZONE_REPORT_EXTENDED) {
+            if (zs->d.za & NVME_ZA_ZD_EXT_VALID) {
+                memcpy(buf_p, nvme_get_zd_extension(n, ns, zone_index),
+                       n->params.zd_extension_size);
+            }
+            buf_p += n->params.zd_extension_size;
+        }
+
         zone_index++;
     }
 
@@ -2337,7 +2389,7 @@ static uint16_t nvme_handle_changed_zone_log(NvmeCtrl *n, NvmeCmd *cmd,
             continue;
         }
         num_aen_zones++;
-        if (zone->d.za) {
+        if (zone->d.za & ~NVME_ZA_ZD_EXT_VALID) {
             trace_pci_nvme_reporting_changed_zone(zone->d.zslba, zone->d.za);
             *zid_ptr++ = cpu_to_le64(zone->d.zslba);
             nids++;
@@ -2936,6 +2988,7 @@ static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
     ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
     ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
     ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
+    ns->zd_extensions = g_malloc0(n->params.zd_extension_size * n->num_zones);
     zone = ns->zone_array;
 
     nvme_init_zone_list(ns->exp_open_zones);
@@ -3010,6 +3063,17 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
     if (n->params.max_active_zones > nz) {
         n->params.max_active_zones = nz;
     }
+    if (n->params.zd_extension_size) {
+        if (n->params.zd_extension_size & 0x3f) {
+            error_setg(errp,
+                "zone descriptor extension size must be a multiple of 64B");
+            return;
+        }
+        if ((n->params.zd_extension_size >> 6) > 0xff) {
+            error_setg(errp, "zone descriptor extension size is too large");
+            return;
+        }
+    }
 
     if (n->params.zone_async_events) {
         n->ae_cfg |= NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES;
@@ -3040,7 +3104,8 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
     ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
 
     ns->id_ns_zoned->lbafe[lba_index].zsze = cpu_to_le64(n->params.zone_size);
-    ns->id_ns_zoned->lbafe[lba_index].zdes = 0;
+    ns->id_ns_zoned->lbafe[lba_index].zdes =
+        n->params.zd_extension_size >> 6; /* Units of 64B */
 
     if (n->params.fill_pattern == 0) {
         ns->id_ns.dlfeat = 0x01;
@@ -3063,6 +3128,7 @@ static void nvme_zoned_clear(NvmeCtrl *n)
         g_free(ns->imp_open_zones);
         g_free(ns->closed_zones);
         g_free(ns->full_zones);
+        g_free(ns->zd_extensions);
     }
 }
 
@@ -3396,6 +3462,8 @@ static Property nvme_props[] = {
     DEFINE_PROP_UINT64("zone_size", NvmeCtrl, params.zone_size, 512),
     DEFINE_PROP_UINT64("zone_capacity", NvmeCtrl, params.zone_capacity, 512),
     DEFINE_PROP_UINT32("zone_append_max_size", NvmeCtrl, params.zamds_bs, 0),
+    DEFINE_PROP_UINT32("zone_descr_ext_size", NvmeCtrl,
+                       params.zd_extension_size, 0),
     DEFINE_PROP_INT32("max_active", NvmeCtrl, params.max_active_zones, 0),
     DEFINE_PROP_INT32("max_open", NvmeCtrl, params.max_open_zones, 0),
     DEFINE_PROP_UINT64("reset_rcmnd_delay", NvmeCtrl, params.rzr_delay_usec, 0),
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index e63f7736d7..4251295917 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -24,6 +24,7 @@ typedef struct NvmeParams {
     uint64_t    zone_capacity;
     int32_t     max_active_zones;
     int32_t     max_open_zones;
+    uint32_t    zd_extension_size;
     uint64_t    rzr_delay_usec;
     uint64_t    rrl_usec;
     uint64_t    fzr_delay_usec;
@@ -123,6 +124,7 @@ typedef struct NvmeNamespace {
     NvmeZoneList    *imp_open_zones;
     NvmeZoneList    *closed_zones;
     NvmeZoneList    *full_zones;
+    uint8_t         *zd_extensions;
     int32_t         nr_open_zones;
     int32_t         nr_active_zones;
     bool            aen_pending;
@@ -221,6 +223,12 @@ static inline bool nvme_wp_is_valid(NvmeZone *zone)
            st != NVME_ZONE_STATE_OFFLINE;
 }
 
+static inline uint8_t *nvme_get_zd_extension(NvmeCtrl *n,
+    NvmeNamespace *ns, uint32_t zone_idx)
+{
+    return &ns->zd_extensions[zone_idx * n->params.zd_extension_size];
+}
+
 /*
  * Initialize a zone list head.
  */
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 16/18] hw/block/nvme: Add injection of Offline/Read-Only zones
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (14 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 15/18] hw/block/nvme: Support Zone Descriptor Extensions Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-06-17 21:34 ` [PATCH v2 17/18] hw/block/nvme: Use zone metadata file for persistence Dmitry Fomichev
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

ZNS specification defines two zone conditions for the zones that no
longer can function properly, possibly because of flash wear or other
internal fault. It is useful to be able to "inject" a small number of
such zones for testing purposes.

This commit defines two optional driver properties, "offline_zones"
and "rdonly_zones". User can assign non-zero values to these variables
to specify the number of zones to be initialized as Offline or
Read-Only. The actual number of injected zones may be smaller than the
requested amount - Read-Only and Offline counts are expected to be much
smaller than the total number of drive zones.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 hw/block/nvme.h |  2 ++
 2 files changed, 48 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index eb41081627..14d5f1d155 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -2980,8 +2980,11 @@ static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
     uint64_t capacity)
 {
     NvmeZone *zone;
+    Error *err;
     uint64_t start = 0, zone_size = n->params.zone_size;
+    uint32_t rnd;
     int i;
+    uint16_t zs;
 
     ns->zone_array = g_malloc0(n->zone_array_size);
     ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
@@ -3011,6 +3014,37 @@ static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
         start += zone_size;
     }
 
+    /* If required, make some zones Offline or Read Only */
+
+    for (i = 0; i < n->params.nr_offline_zones; i++) {
+        do {
+            qcrypto_random_bytes(&rnd, sizeof(rnd), &err);
+            rnd %= n->num_zones;
+        } while (rnd < n->params.max_open_zones);
+        zone = &ns->zone_array[rnd];
+        zs = nvme_get_zone_state(zone);
+        if (zs != NVME_ZONE_STATE_OFFLINE) {
+            nvme_set_zone_state(zone, NVME_ZONE_STATE_OFFLINE);
+        } else {
+            i--;
+        }
+    }
+
+    for (i = 0; i < n->params.nr_rdonly_zones; i++) {
+        do {
+            qcrypto_random_bytes(&rnd, sizeof(rnd), &err);
+            rnd %= n->num_zones;
+        } while (rnd < n->params.max_open_zones);
+        zone = &ns->zone_array[rnd];
+        zs = nvme_get_zone_state(zone);
+        if (zs != NVME_ZONE_STATE_OFFLINE &&
+            zs != NVME_ZONE_STATE_READ_ONLY) {
+            nvme_set_zone_state(zone, NVME_ZONE_STATE_READ_ONLY);
+        } else {
+            i--;
+        }
+    }
+
     return 0;
 }
 
@@ -3063,6 +3097,16 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
     if (n->params.max_active_zones > nz) {
         n->params.max_active_zones = nz;
     }
+    if (n->params.max_open_zones < nz) {
+        if (n->params.nr_offline_zones > nz - n->params.max_open_zones) {
+            n->params.nr_offline_zones = nz - n->params.max_open_zones;
+        }
+        if (n->params.nr_rdonly_zones >
+            nz - n->params.max_open_zones - n->params.nr_offline_zones) {
+            n->params.nr_rdonly_zones =
+                nz - n->params.max_open_zones - n->params.nr_offline_zones;
+        }
+    }
     if (n->params.zd_extension_size) {
         if (n->params.zd_extension_size & 0x3f) {
             error_setg(errp,
@@ -3471,6 +3515,8 @@ static Property nvme_props[] = {
     DEFINE_PROP_UINT64("finish_rcmnd_delay", NvmeCtrl,
                        params.fzr_delay_usec, 0),
     DEFINE_PROP_UINT64("finish_rcmnd_limit", NvmeCtrl, params.frl_usec, 0),
+    DEFINE_PROP_UINT32("offline_zones", NvmeCtrl, params.nr_offline_zones, 0),
+    DEFINE_PROP_UINT32("rdonly_zones", NvmeCtrl, params.nr_rdonly_zones, 0),
     DEFINE_PROP_BOOL("zone_async_events", NvmeCtrl, params.zone_async_events,
                      true),
     DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 4251295917..900fc54809 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -24,6 +24,8 @@ typedef struct NvmeParams {
     uint64_t    zone_capacity;
     int32_t     max_active_zones;
     int32_t     max_open_zones;
+    uint32_t    nr_offline_zones;
+    uint32_t    nr_rdonly_zones;
     uint32_t    zd_extension_size;
     uint64_t    rzr_delay_usec;
     uint64_t    rrl_usec;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 17/18] hw/block/nvme: Use zone metadata file for persistence
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (15 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 16/18] hw/block/nvme: Add injection of Offline/Read-Only zones Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-07-01 17:26   ` Klaus Jensen
  2020-06-17 21:34 ` [PATCH v2 18/18] hw/block/nvme: Document zoned parameters in usage text Dmitry Fomichev
  2020-06-29 20:26 ` [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
  18 siblings, 1 reply; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

A ZNS drive that is emulated by this driver is currently initialized
with all zones Empty upon startup. However, actual ZNS SSDs save the
state and condition of all zones in their internal NVRAM in the event
of power loss. When such a drive is powered up again, it closes or
finishes all zones that were open at the moment of shutdown. Besides
that, the write pointer position as well as the state and condition
of all zones is preserved across power-downs.

This commit adds the capability to have a persistent zone metadata
to the driver. The new optional driver property, "zone_file",
is introduced. If added to the command line, this property specifies
the name of the file that stores the zone metadata. If "zone_file" is
omitted, the driver will initialize with all zones empty, the same as
before.

If zone metadata is configured to be persistent, then zone descriptor
extensions also persist across controller shutdowns.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 371 +++++++++++++++++++++++++++++++++++++++++++++---
 hw/block/nvme.h |  38 +++++
 2 files changed, 388 insertions(+), 21 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 14d5f1d155..63e7a6352e 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -69,6 +69,8 @@
     } while (0)
 
 static void nvme_process_sq(void *opaque);
+static void nvme_sync_zone_file(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone, int len);
 
 /*
  * Add a zone to the tail of a zone list.
@@ -90,6 +92,7 @@ static void nvme_add_zone_tail(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
         zl->tail = idx;
     }
     zl->size++;
+    nvme_set_zone_meta_dirty(n, ns, true);
 }
 
 /*
@@ -106,12 +109,15 @@ static void nvme_remove_zone(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
     if (zl->size == 0) {
         zl->head = NVME_ZONE_LIST_NIL;
         zl->tail = NVME_ZONE_LIST_NIL;
+        nvme_set_zone_meta_dirty(n, ns, true);
     } else if (idx == zl->head) {
         zl->head = zone->next;
         ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
+        nvme_set_zone_meta_dirty(n, ns, true);
     } else if (idx == zl->tail) {
         zl->tail = zone->prev;
         ns->zone_array[zl->tail].next = NVME_ZONE_LIST_NIL;
+        nvme_set_zone_meta_dirty(n, ns, true);
     } else {
         ns->zone_array[zone->next].prev = zone->prev;
         ns->zone_array[zone->prev].next = zone->next;
@@ -138,6 +144,7 @@ static NvmeZone *nvme_remove_zone_head(NvmeCtrl *n, NvmeNamespace *ns,
             ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
         }
         zone->prev = zone->next = 0;
+        nvme_set_zone_meta_dirty(n, ns, true);
     }
 
     return zone;
@@ -476,6 +483,7 @@ static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
     case NVME_ZONE_STATE_READ_ONLY:
         zone->tstamp = 0;
     }
+    nvme_sync_zone_file(n, ns, zone, sizeof(NvmeZone));
 }
 
 static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
@@ -2976,9 +2984,114 @@ static const MemoryRegionOps nvme_cmb_ops = {
     },
 };
 
-static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
+static int nvme_validate_zone_file(NvmeCtrl *n, NvmeNamespace *ns,
     uint64_t capacity)
 {
+    NvmeZoneMeta *meta = ns->zone_meta;
+    NvmeZone *zone = ns->zone_array;
+    uint64_t start = 0, zone_size = n->params.zone_size;
+    int i, n_imp_open = 0, n_exp_open = 0, n_closed = 0, n_full = 0;
+
+    if (meta->magic != NVME_ZONE_META_MAGIC) {
+        return 1;
+    }
+    if (meta->version != NVME_ZONE_META_VER) {
+        return 2;
+    }
+    if (meta->zone_size != zone_size) {
+        return 3;
+    }
+    if (meta->zone_capacity != n->params.zone_capacity) {
+        return 4;
+    }
+    if (meta->nr_offline_zones != n->params.nr_offline_zones) {
+        return 5;
+    }
+    if (meta->nr_rdonly_zones != n->params.nr_rdonly_zones) {
+        return 6;
+    }
+    if (meta->lba_size != n->conf.logical_block_size) {
+        return 7;
+    }
+    if (meta->zd_extension_size != n->params.zd_extension_size) {
+        return 8;
+    }
+
+    for (i = 0; i < n->num_zones; i++, zone++) {
+        if (start + zone_size > capacity) {
+            zone_size = capacity - start;
+        }
+        if (zone->d.zt != NVME_ZONE_TYPE_SEQ_WRITE) {
+            return 9;
+        }
+        if (zone->d.zcap != n->params.zone_capacity) {
+            return 10;
+        }
+        if (zone->d.zslba != start) {
+            return 11;
+        }
+        switch (nvme_get_zone_state(zone)) {
+        case NVME_ZONE_STATE_EMPTY:
+        case NVME_ZONE_STATE_OFFLINE:
+        case NVME_ZONE_STATE_READ_ONLY:
+            if (zone->d.wp != start) {
+                return 12;
+            }
+            break;
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+            if (zone->d.wp < start ||
+                zone->d.wp >= zone->d.zslba + zone->d.zcap) {
+                return 13;
+            }
+            n_imp_open++;
+            break;
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+            if (zone->d.wp < start ||
+                zone->d.wp >= zone->d.zslba + zone->d.zcap) {
+                return 13;
+            }
+            n_exp_open++;
+            break;
+        case NVME_ZONE_STATE_CLOSED:
+            if (zone->d.wp < start ||
+                zone->d.wp >= zone->d.zslba + zone->d.zcap) {
+                return 13;
+            }
+            n_closed++;
+            break;
+        case NVME_ZONE_STATE_FULL:
+            if (zone->d.wp != zone->d.zslba + zone->d.zcap) {
+                return 14;
+            }
+            n_full++;
+            break;
+        default:
+            return 15;
+        }
+
+        start += zone_size;
+    }
+
+    if (n_imp_open != nvme_zone_list_size(ns->exp_open_zones)) {
+        return 16;
+    }
+    if (n_exp_open != nvme_zone_list_size(ns->imp_open_zones)) {
+        return 17;
+    }
+    if (n_closed != nvme_zone_list_size(ns->closed_zones)) {
+        return 18;
+    }
+    if (n_full != nvme_zone_list_size(ns->full_zones)) {
+        return 19;
+    }
+
+    return 0;
+}
+
+static int nvme_init_zone_file(NvmeCtrl *n, NvmeNamespace *ns,
+    uint64_t capacity)
+{
+    NvmeZoneMeta *meta = ns->zone_meta;
     NvmeZone *zone;
     Error *err;
     uint64_t start = 0, zone_size = n->params.zone_size;
@@ -2986,18 +3099,33 @@ static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
     int i;
     uint16_t zs;
 
-    ns->zone_array = g_malloc0(n->zone_array_size);
-    ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
-    ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
-    ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
-    ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
-    ns->zd_extensions = g_malloc0(n->params.zd_extension_size * n->num_zones);
+    if (n->params.zone_file) {
+        meta->magic = NVME_ZONE_META_MAGIC;
+        meta->version = NVME_ZONE_META_VER;
+        meta->zone_size = zone_size;
+        meta->zone_capacity = n->params.zone_capacity;
+        meta->lba_size = n->conf.logical_block_size;
+        meta->nr_offline_zones = n->params.nr_offline_zones;
+        meta->nr_rdonly_zones = n->params.nr_rdonly_zones;
+        meta->zd_extension_size = n->params.zd_extension_size;
+    } else {
+        ns->zone_array = g_malloc0(n->zone_array_size);
+        ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
+        ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
+        ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
+        ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
+        ns->zd_extensions =
+            g_malloc0(n->params.zd_extension_size * n->num_zones);
+    }
     zone = ns->zone_array;
 
     nvme_init_zone_list(ns->exp_open_zones);
     nvme_init_zone_list(ns->imp_open_zones);
     nvme_init_zone_list(ns->closed_zones);
     nvme_init_zone_list(ns->full_zones);
+    if (n->params.zone_file) {
+        nvme_set_zone_meta_dirty(n, ns, true);
+    }
 
     for (i = 0; i < n->num_zones; i++, zone++) {
         if (start + zone_size > capacity) {
@@ -3048,7 +3176,189 @@ static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
     return 0;
 }
 
-static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
+static int nvme_open_zone_file(NvmeCtrl *n, bool *init_meta)
+{
+    struct stat statbuf;
+    size_t fsize;
+    int ret;
+
+    ret = stat(n->params.zone_file, &statbuf);
+    if (ret && errno == ENOENT) {
+        *init_meta = true;
+    } else if (!S_ISREG(statbuf.st_mode)) {
+        fprintf(stderr, "%s is not a regular file\n", strerror(errno));
+        return -1;
+    }
+
+    n->zone_file_fd = open(n->params.zone_file,
+                           O_RDWR | O_LARGEFILE | O_BINARY | O_CREAT, 644);
+    if (n->zone_file_fd < 0) {
+            fprintf(stderr, "failed to create zone file %s, err %s\n",
+                    n->params.zone_file, strerror(errno));
+            return -1;
+    }
+
+    fsize = n->meta_size * n->num_namespaces;
+
+    if (stat(n->params.zone_file, &statbuf)) {
+        fprintf(stderr, "can't stat zone file %s, err %s\n",
+                n->params.zone_file, strerror(errno));
+        return -1;
+    }
+    if (statbuf.st_size != fsize) {
+        ret = ftruncate(n->zone_file_fd, fsize);
+        if (ret < 0) {
+            fprintf(stderr, "can't truncate zone file %s, err %s\n",
+                    n->params.zone_file, strerror(errno));
+            return -1;
+        }
+        *init_meta = true;
+    }
+
+    return 0;
+}
+
+static int nvme_map_zone_file(NvmeCtrl *n, NvmeNamespace *ns, bool *init_meta)
+{
+    off_t meta_ofs = n->meta_size * (ns->nsid - 1);
+
+    ns->zone_meta = mmap(0, n->meta_size, PROT_READ | PROT_WRITE,
+                         MAP_SHARED, n->zone_file_fd, meta_ofs);
+    if (ns->zone_meta == MAP_FAILED) {
+        fprintf(stderr, "failed to map zone file %s, ofs %lu, err %s\n",
+                n->params.zone_file, meta_ofs, strerror(errno));
+        return -1;
+    }
+
+    ns->zone_array = (NvmeZone *)(ns->zone_meta + 1);
+    ns->exp_open_zones = &ns->zone_meta->exp_open_zones;
+    ns->imp_open_zones = &ns->zone_meta->imp_open_zones;
+    ns->closed_zones = &ns->zone_meta->closed_zones;
+    ns->full_zones = &ns->zone_meta->full_zones;
+
+    if (n->params.zd_extension_size) {
+        ns->zd_extensions = (uint8_t *)(ns->zone_meta + 1);
+        ns->zd_extensions += n->zone_array_size;
+    }
+
+    return 0;
+}
+
+static void nvme_sync_zone_file(NvmeCtrl *n, NvmeNamespace *ns,
+    NvmeZone *zone, int len)
+{
+    uintptr_t addr, zd = (uintptr_t)zone;
+
+    addr = zd & qemu_real_host_page_mask;
+    len += zd - addr;
+    if (msync((void *)addr, len, MS_ASYNC) < 0)
+        fprintf(stderr, "msync: failed to sync zone descriptors, file %s\n",
+                strerror(errno));
+
+    if (nvme_zone_meta_dirty(n, ns)) {
+        nvme_set_zone_meta_dirty(n, ns, false);
+        if (msync(ns->zone_meta, sizeof(NvmeZoneMeta), MS_ASYNC) < 0)
+            fprintf(stderr, "msync: failed to sync zone meta, file %s\n",
+                    strerror(errno));
+    }
+}
+
+/*
+ * Close or finish all the zones that might be still open after power-down.
+ */
+static void nvme_prepare_zones(NvmeCtrl *n, NvmeNamespace *ns)
+{
+    NvmeZone *zone;
+    uint32_t set_state;
+    int i;
+
+    assert(!ns->nr_active_zones);
+    assert(!ns->nr_open_zones);
+
+    zone = ns->zone_array;
+    for (i = 0; i < n->num_zones; i++, zone++) {
+        zone->flags = 0;
+        zone->tstamp = 0;
+
+        switch (nvme_get_zone_state(zone)) {
+        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
+        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
+            break;
+        case NVME_ZONE_STATE_CLOSED:
+            nvme_aor_inc_active(n, ns);
+            /* pass through */
+        default:
+            continue;
+        }
+
+        if (zone->d.za & NVME_ZA_ZD_EXT_VALID) {
+            set_state = NVME_ZONE_STATE_CLOSED;
+        } else if (zone->d.wp == zone->d.zslba) {
+            set_state = NVME_ZONE_STATE_EMPTY;
+        } else if (n->params.max_active_zones == 0 ||
+                   ns->nr_active_zones < n->params.max_active_zones) {
+            set_state = NVME_ZONE_STATE_CLOSED;
+        } else {
+            set_state = NVME_ZONE_STATE_FULL;
+        }
+
+        switch (set_state) {
+        case NVME_ZONE_STATE_CLOSED:
+            trace_pci_nvme_power_on_close(nvme_get_zone_state(zone),
+                                          zone->d.zslba);
+            nvme_aor_inc_active(n, ns);
+            nvme_add_zone_tail(n, ns, ns->closed_zones, zone);
+        break;
+        case NVME_ZONE_STATE_EMPTY:
+            trace_pci_nvme_power_on_reset(nvme_get_zone_state(zone),
+                                          zone->d.zslba);
+        break;
+        case NVME_ZONE_STATE_FULL:
+            trace_pci_nvme_power_on_full(nvme_get_zone_state(zone),
+                                         zone->d.zslba);
+            zone->d.wp = nvme_zone_wr_boundary(zone);
+        }
+
+        nvme_set_zone_state(zone, set_state);
+    }
+}
+
+static int nvme_load_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
+    uint64_t capacity, bool init_meta)
+{
+    int ret = 0;
+
+    if (n->params.zone_file) {
+        ret = nvme_map_zone_file(n, ns, &init_meta);
+        trace_pci_nvme_mapped_zone_file(n->params.zone_file, ret);
+        if (ret < 0) {
+            return ret;
+        }
+
+        if (!init_meta) {
+            ret = nvme_validate_zone_file(n, ns, capacity);
+            if (ret) {
+                trace_pci_nvme_err_zone_file_invalid(ret);
+                init_meta = true;
+            }
+        }
+    } else {
+        init_meta = true;
+    }
+
+    if (init_meta) {
+        ret = nvme_init_zone_file(n, ns, capacity);
+    } else {
+        nvme_prepare_zones(n, ns);
+    }
+    if (!ret && n->params.zone_file) {
+        nvme_sync_zone_file(n, ns, ns->zone_array, n->zone_array_size);
+    }
+
+    return ret;
+}
+
+static void nvme_zoned_init_ctrl(NvmeCtrl *n, bool *init_meta, Error **errp)
 {
     uint64_t zone_size = 0, capacity;
     uint32_t nz;
@@ -3084,6 +3394,9 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
     nz = DIV_ROUND_UP(capacity, zone_size);
     n->num_zones = nz;
     n->zone_array_size = sizeof(NvmeZone) * nz;
+    n->meta_size = sizeof(NvmeZoneMeta) + n->zone_array_size +
+                          nz * n->params.zd_extension_size;
+    n->meta_size = ROUND_UP(n->meta_size, qemu_real_host_page_size);
 
     n->params.rzr_delay_usec *= SCALE_MS;
     n->params.rrl_usec *= SCALE_MS;
@@ -3119,6 +3432,13 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
         }
     }
 
+    if (n->params.zone_file) {
+        if (nvme_open_zone_file(n, init_meta) < 0) {
+            error_setg(errp, "cannot open zone metadata file");
+            return;
+        }
+    }
+
     if (n->params.zone_async_events) {
         n->ae_cfg |= NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES;
     }
@@ -3127,13 +3447,14 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
 }
 
 static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
-    Error **errp)
+    bool init_meta, Error **errp)
 {
     int ret;
 
-    ret = nvme_init_zone_meta(n, ns, n->num_zones * n->params.zone_size);
+    ret = nvme_load_zone_meta(n, ns, n->num_zones * n->params.zone_size,
+                              init_meta);
     if (ret) {
-        error_setg(errp, "could not init zone metadata");
+        error_setg(errp, "could not load/init zone metadata");
         return -1;
     }
 
@@ -3164,15 +3485,20 @@ static void nvme_zoned_clear(NvmeCtrl *n)
 {
     int i;
 
+    if (n->params.zone_file)  {
+        close(n->zone_file_fd);
+    }
     for (i = 0; i < n->num_namespaces; i++) {
         NvmeNamespace *ns = &n->namespaces[i];
         g_free(ns->id_ns_zoned);
-        g_free(ns->zone_array);
-        g_free(ns->exp_open_zones);
-        g_free(ns->imp_open_zones);
-        g_free(ns->closed_zones);
-        g_free(ns->full_zones);
-        g_free(ns->zd_extensions);
+        if (!n->params.zone_file) {
+            g_free(ns->zone_array);
+            g_free(ns->exp_open_zones);
+            g_free(ns->imp_open_zones);
+            g_free(ns->closed_zones);
+            g_free(ns->full_zones);
+            g_free(ns->zd_extensions);
+        }
     }
 }
 
@@ -3258,7 +3584,8 @@ static void nvme_init_blk(NvmeCtrl *n, Error **errp)
     n->ns_size = bs_size;
 }
 
-static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
+static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, bool init_meta,
+    Error **errp)
 {
     NvmeIdNs *id_ns = &ns->id_ns;
     int lba_index;
@@ -3272,7 +3599,7 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
     if (n->params.zoned) {
         ns->csi = NVME_CSI_ZONED;
         id_ns->ncap = cpu_to_le64(n->params.zone_capacity * n->num_zones);
-        if (nvme_zoned_init_ns(n, ns, lba_index, errp) != 0) {
+        if (nvme_zoned_init_ns(n, ns, lba_index, init_meta, errp) != 0) {
             return;
         }
     } else {
@@ -3429,6 +3756,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     NvmeCtrl *n = NVME(pci_dev);
     NvmeNamespace *ns;
     Error *local_err = NULL;
+    bool init_meta = false;
 
     int i;
 
@@ -3452,7 +3780,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     }
 
     if (n->params.zoned) {
-        nvme_zoned_init_ctrl(n, &local_err);
+        nvme_zoned_init_ctrl(n, &init_meta, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
             return;
@@ -3463,7 +3791,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     ns = n->namespaces;
     for (i = 0; i < n->num_namespaces; i++, ns++) {
         ns->nsid = i + 1;
-        nvme_init_namespace(n, ns, &local_err);
+        nvme_init_namespace(n, ns, init_meta, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
             return;
@@ -3506,6 +3834,7 @@ static Property nvme_props[] = {
     DEFINE_PROP_UINT64("zone_size", NvmeCtrl, params.zone_size, 512),
     DEFINE_PROP_UINT64("zone_capacity", NvmeCtrl, params.zone_capacity, 512),
     DEFINE_PROP_UINT32("zone_append_max_size", NvmeCtrl, params.zamds_bs, 0),
+    DEFINE_PROP_STRING("zone_file", NvmeCtrl, params.zone_file),
     DEFINE_PROP_UINT32("zone_descr_ext_size", NvmeCtrl,
                        params.zd_extension_size, 0),
     DEFINE_PROP_INT32("max_active", NvmeCtrl, params.max_active_zones, 0),
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 900fc54809..5e9a3a62f7 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -14,6 +14,7 @@ typedef struct NvmeParams {
     uint16_t msix_qsize;
     uint32_t cmb_size_mb;
 
+    char        *zone_file;
     bool        zoned;
     bool        cross_zone_read;
     bool        zone_async_events;
@@ -114,6 +115,27 @@ typedef struct NvmeZoneList {
     uint8_t         rsvd12[4];
 } NvmeZoneList;
 
+#define NVME_ZONE_META_MAGIC 0x3aebaa70
+#define NVME_ZONE_META_VER  1
+
+typedef struct NvmeZoneMeta {
+    uint32_t        magic;
+    uint32_t        version;
+    uint64_t        zone_size;
+    uint64_t        zone_capacity;
+    uint32_t        nr_offline_zones;
+    uint32_t        nr_rdonly_zones;
+    uint32_t        lba_size;
+    uint32_t        rsvd40;
+    NvmeZoneList    exp_open_zones;
+    NvmeZoneList    imp_open_zones;
+    NvmeZoneList    closed_zones;
+    NvmeZoneList    full_zones;
+    uint8_t         zd_extension_size;
+    uint8_t         dirty;
+    uint8_t         rsvd594[3990];
+} NvmeZoneMeta;
+
 typedef struct NvmeNamespace {
     NvmeIdNs        id_ns;
     uint32_t        nsid;
@@ -122,6 +144,7 @@ typedef struct NvmeNamespace {
 
     NvmeIdNsZoned   *id_ns_zoned;
     NvmeZone        *zone_array;
+    NvmeZoneMeta    *zone_meta;
     NvmeZoneList    *exp_open_zones;
     NvmeZoneList    *imp_open_zones;
     NvmeZoneList    *closed_zones;
@@ -174,6 +197,7 @@ typedef struct NvmeCtrl {
 
     int             zone_file_fd;
     uint32_t        num_zones;
+    size_t          meta_size;
     uint64_t        zone_size_bs;
     uint64_t        zone_array_size;
     uint8_t         zamds;
@@ -282,6 +306,19 @@ static inline NvmeZone *nvme_next_zone_in_list(NvmeNamespace *ns, NvmeZone *z,
     return &ns->zone_array[z->next];
 }
 
+static inline bool nvme_zone_meta_dirty(NvmeCtrl *n, NvmeNamespace *ns)
+{
+    return n->params.zone_file ? ns->zone_meta->dirty : false;
+}
+
+static inline void nvme_set_zone_meta_dirty(NvmeCtrl *n, NvmeNamespace *ns,
+    bool yesno)
+{
+    if (n->params.zone_file) {
+        ns->zone_meta->dirty = yesno;
+    }
+}
+
 static inline int nvme_ilog2(uint64_t i)
 {
     int log = -1;
@@ -295,6 +332,7 @@ static inline int nvme_ilog2(uint64_t i)
 
 static inline void _hw_nvme_check_size(void)
 {
+    QEMU_BUILD_BUG_ON(sizeof(NvmeZoneMeta) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeZoneList) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeZone) != 88);
 }
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v2 18/18] hw/block/nvme: Document zoned parameters in usage text
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (16 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 17/18] hw/block/nvme: Use zone metadata file for persistence Dmitry Fomichev
@ 2020-06-17 21:34 ` Dmitry Fomichev
  2020-06-29 20:26 ` [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
  18 siblings, 0 replies; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-17 21:34 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

Added brief descriptions of the new driver properties now available
to users to configure features of Zoned Namespace Command Set in the
driver.

This patch is for documentation only, no functionality change.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 60 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 63e7a6352e..90b1ae24b5 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,7 +9,7 @@
  */
 
 /**
- * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
+ * Reference Specs: http://www.nvmexpress.org, 1.4, 1.3, 1.2, 1.1, 1.0e
  *
  *  http://www.nvmexpress.org/resources/
  */
@@ -20,7 +20,8 @@
  *      -device nvme,drive=<drive_id>,serial=<serial>,id=<id[optional]>, \
  *              cmb_size_mb=<cmb_size_mb[optional]>, \
  *              [pmrdev=<mem_backend_file_id>,] \
- *              max_ioqpairs=<N[optional]>
+ *              max_ioqpairs=<N[optional]> \
+ *              zoned=<true|false[optional]>
  *
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
  * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
@@ -32,6 +33,63 @@
  * For example:
  * -object memory-backend-file,id=<mem_id>,share=on,mem-path=<file_path>, \
  *  size=<size> .... -device nvme,...,pmrdev=<mem_id>
+ *
+ * Setting "zoned" to true makes the driver emulate zoned namespaces.
+ * In this case, of the following options are available to configure zoned
+ * operation:
+ *     zone_size=<zone size in MiB>
+ *
+ *     zone_capacity=<zone capacity in MiB, default: zone_size>
+ *
+ *     zone_file=<zone metadata file name, default: none>
+ *         Zone metadata file, if specified, allows zone information
+ *         to be persistent across shutdowns and restarts.
+ *
+ *     zone_descr_ext_size=<zone descriptor extension size, default 0>
+ *         This value needs to be specified in 64B units. If it is zero,
+ *         namespace(s) will not support zone descriptor extensions.
+ *
+ *     max_active=<Maximum Active Resources (zones), default: 0 - no limit>
+ *
+ *     max_open=<Maximum Open Resources (zones), default: 0 - no limit>
+ *
+ *     reset_rcmnd_delay=<Reset Zone Recommended Delay in milliseconds>
+ *         The amount of time that passes between the moment when a zone
+ *         enters Full state and when Reset Zone Recommended attribute
+ *         is set for that zone.
+ *
+ *     reset_rcmnd_limit=<Reset Zone Recommended Limit in milliseconds>
+ *         If this value is zero (default), RZR attribute is not set for
+ *          any zones.
+ *
+ *     finish_rcmnd_delay=<Finish Zone Recommended Delay in milliseconds>
+ *         The amount of time that passes between the moment when a zone
+ *         enters an Open or Closed state and when Finish Zone Recommended
+ *         attribute is set for that zone.
+ *
+ *     finish_rcmnd_limit=<Finish Zone Recommended Limit in milliseconds>
+ *         If this value is zero (default), FZR attribute is not set for
+ *         any zones.
+ *
+ *     zamds=<zone append maximum data size, in KiB, default: 128>
+ *         The maximum I/O size that can be supported by Zone Append
+ *         command. Since internally this this value is maintained as
+ *         ZAMDS = log2(<maximum append size> / <page size>), some
+ *         values assigned to this property may be rounded down and
+ *         result in a lower maximum ZA data size being in effect.
+ *
+ *     zone_async_events=<send zone Async Events: default: true>
+ *         Enable sending Zone Descriptor Changed AENs to the host.
+ *
+ *     offline_zones=<the number of offline zones to inject, default: 0>
+ *
+ *     rdonly_zones=<the number of read-only zones to inject, default: 0>
+ *
+ *     cross_zone_read=<enables Read Across Zone Boundaries, default: true>
+ *
+ *     fill_pattern=<data fill pattern, default: 0x00>
+ *         Byte pattern to to return for any portions of unwritten data
+ *         during read.
  */
 
 #include "qemu/osdep.h"
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* RE: [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
  2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
                   ` (17 preceding siblings ...)
  2020-06-17 21:34 ` [PATCH v2 18/18] hw/block/nvme: Document zoned parameters in usage text Dmitry Fomichev
@ 2020-06-29 20:26 ` Dmitry Fomichev
  18 siblings, 0 replies; 49+ messages in thread
From: Dmitry Fomichev @ 2020-06-29 20:26 UTC (permalink / raw)
  To: Kevin Wolf, Keith Busch, Philippe Mathieu-Daudé, Maxim Levitsky
  Cc: Niklas Cassel, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Matias Bjorling

Bump... Any feedback on this series?

> -----Original Message-----
> From: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> Sent: Wednesday, June 17, 2020 5:34 PM
> To: Kevin Wolf <kwolf@redhat.com>; Keith Busch <kbusch@kernel.org>;
> Philippe Mathieu-Daudé <philmd@redhat.com>; Maxim Levitsky
> <mlevitsky@redhat.com>
> Cc: qemu-block@nongnu.org; qemu-devel@nongnu.org; Matias Bjorling
> <Matias.Bjorling@wdc.com>; Damien Le Moal <Damien.LeMoal@wdc.com>;
> Niklas Cassel <Niklas.Cassel@wdc.com>; Dmitry Fomichev
> <Dmitry.Fomichev@wdc.com>
> Subject: [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and
> Zoned Namespace Command Set
> 
> v2: rebased on top of block-next/block branch
> 
> Zoned Namespace (ZNS) Command Set is a newly introduced command set
> published by the NVM Express, Inc. organization as TP 4053. The main
> design goals of ZNS are to provide hardware designers the means to
> reduce NVMe controller complexity and to allow achieving a better I/O
> latency and throughput. SSDs that implement this interface are
> commonly known as ZNS SSDs.
> 
> This command set is implementing a zoned storage model, similarly to
> ZAC/ZBC. As such, there is already support in Linux, allowing one to
> perform the majority of tasks needed for managing ZNS SSDs.
> 
> The Zoned Namespace Command Set relies on another TP, known as
> Namespace Types (NVMe TP 4056), which introduces support for having
> multiple command sets per namespace.
> 
> Both ZNS and Namespace Types specifications can be downloaded by
> visiting the following link -
> 
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-
> TPs.zip
> 
> This patch series adds Namespace Types support and zoned namespace
> emulation capability to the existing NVMe PCI driver.
> 
> The patchset is organized as follows -
> 
> The first several patches are preparatory and are added to allow for
> an easier review of the subsequent commits. The group of patches that
> follows adds NS Types support with only NVM Command Set being
> available. Finally, the last group of commits makes definitions and
> adds new code to support Zoned Namespace Command Set.
> 
> Based-on: <20200609205944.3549240-1-eblake@redhat.com>
> 
> Ajay Joshi (1):
>   hw/block/nvme: Define 64 bit cqe.result
> 
> Dmitry Fomichev (15):
>   hw/block/nvme: Move NvmeRequest has_sg field to a bit flag
>   hw/block/nvme: Clean up unused AER definitions
>   hw/block/nvme: Add Commands Supported and Effects log
>   hw/block/nvme: Define trace events related to NS Types
>   hw/block/nvme: Make Zoned NS Command Set definitions
>   hw/block/nvme: Define Zoned NS Command Set trace events
>   hw/block/nvme: Support Zoned Namespace Command Set
>   hw/block/nvme: Introduce max active and open zone limits
>   hw/block/nvme: Simulate Zone Active excursions
>   hw/block/nvme: Set Finish/Reset Zone Recommended attributes
>   hw/block/nvme: Generate zone AENs
>   hw/block/nvme: Support Zone Descriptor Extensions
>   hw/block/nvme: Add injection of Offline/Read-Only zones
>   hw/block/nvme: Use zone metadata file for persistence
>   hw/block/nvme: Document zoned parameters in usage text
> 
> Niklas Cassel (2):
>   hw/block/nvme: Introduce the Namespace Types definitions
>   hw/block/nvme: Add support for Namespace Types
> 
>  block/nvme.c          |    2 +-
>  block/trace-events    |    2 +-
>  hw/block/nvme.c       | 2316
> ++++++++++++++++++++++++++++++++++++++++-
>  hw/block/nvme.h       |  228 +++-
>  hw/block/trace-events |   56 +
>  include/block/nvme.h  |  282 ++++-
>  6 files changed, 2820 insertions(+), 66 deletions(-)
> 
> --
> 2.21.0



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 01/18] hw/block/nvme: Move NvmeRequest has_sg field to a bit flag
  2020-06-17 21:33 ` [PATCH v2 01/18] hw/block/nvme: Move NvmeRequest has_sg field to a bit flag Dmitry Fomichev
@ 2020-06-30  0:56   ` Alistair Francis
  2020-06-30  4:09   ` Klaus Jensen
  1 sibling, 0 replies; 49+ messages in thread
From: Alistair Francis @ 2020-06-30  0:56 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, Qemu-block,
	qemu-devel@nongnu.org Developers, Keith Busch,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Wed, Jun 17, 2020 at 2:43 PM Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
>
> In addition to the existing has_sg flag, a few more Boolean
> NvmeRequest flags are going to be introduced in subsequent patches.
> Convert "has_sg" variable to "flags" and define NvmeRequestFlags
> enum for individual flag values.
>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

Reviewed-by: Alistair Francis <alistair.francis@wdc.com>

Alistair

> ---
>  hw/block/nvme.c | 8 +++-----
>  hw/block/nvme.h | 6 +++++-
>  2 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 1aee042d4c..3ed9f3d321 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -350,7 +350,7 @@ static void nvme_rw_cb(void *opaque, int ret)
>          block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
>          req->status = NVME_INTERNAL_DEV_ERROR;
>      }
> -    if (req->has_sg) {
> +    if (req->flags & NVME_REQ_FLG_HAS_SG) {
>          qemu_sglist_destroy(&req->qsg);
>      }
>      nvme_enqueue_req_completion(cq, req);
> @@ -359,7 +359,6 @@ static void nvme_rw_cb(void *opaque, int ret)
>  static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>      NvmeRequest *req)
>  {
> -    req->has_sg = false;
>      block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
>           BLOCK_ACCT_FLUSH);
>      req->aiocb = blk_aio_flush(n->conf.blk, nvme_rw_cb, req);
> @@ -383,7 +382,6 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>          return NVME_LBA_RANGE | NVME_DNR;
>      }
>
> -    req->has_sg = false;
>      block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
>                       BLOCK_ACCT_WRITE);
>      req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
> @@ -422,14 +420,13 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>
>      dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
>      if (req->qsg.nsg > 0) {
> -        req->has_sg = true;
> +        req->flags |= NVME_REQ_FLG_HAS_SG;
>          req->aiocb = is_write ?
>              dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
>                            nvme_rw_cb, req) :
>              dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
>                           nvme_rw_cb, req);
>      } else {
> -        req->has_sg = false;
>          req->aiocb = is_write ?
>              blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
>                              req) :
> @@ -917,6 +914,7 @@ static void nvme_process_sq(void *opaque)
>          QTAILQ_REMOVE(&sq->req_list, req, entry);
>          QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
>          memset(&req->cqe, 0, sizeof(req->cqe));
> +        req->flags = 0;
>          req->cqe.cid = cmd.cid;
>
>          status = sq->sqid ? nvme_io_cmd(n, &cmd, req) :
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 1d30c0bca2..0460cc0e62 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -16,11 +16,15 @@ typedef struct NvmeAsyncEvent {
>      NvmeAerResult result;
>  } NvmeAsyncEvent;
>
> +enum NvmeRequestFlags {
> +    NVME_REQ_FLG_HAS_SG   = 1 << 0,
> +};
> +
>  typedef struct NvmeRequest {
>      struct NvmeSQueue       *sq;
>      BlockAIOCB              *aiocb;
>      uint16_t                status;
> -    bool                    has_sg;
> +    uint16_t                flags;
>      NvmeCqe                 cqe;
>      BlockAcctCookie         acct;
>      QEMUSGList              qsg;
> --
> 2.21.0
>
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 02/18] hw/block/nvme: Define 64 bit cqe.result
  2020-06-17 21:33 ` [PATCH v2 02/18] hw/block/nvme: Define 64 bit cqe.result Dmitry Fomichev
@ 2020-06-30  0:58   ` Alistair Francis
  2020-06-30  4:15   ` Klaus Jensen
  1 sibling, 0 replies; 49+ messages in thread
From: Alistair Francis @ 2020-06-30  0:58 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, Qemu-block,
	qemu-devel@nongnu.org Developers, Keith Busch,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Wed, Jun 17, 2020 at 2:44 PM Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
>
> From: Ajay Joshi <ajay.joshi@wdc.com>
>
> A new write command, Zone Append, is added as a part of Zoned
> Namespace Command Set. Upon successful completion of this command,
> the controller returns the start LBA of the performed write operation
> in cqe.result field. Therefore, the maximum size of this variable
> needs to be changed from 32 to 64 bit, consuming the reserved 32 bit
> field that follows the result in CQE struct. Since the existing
> commands are expected to return a 32 bit LE value, two separate
> variables, result32 and result64, are now kept in a union.
>
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

Reviewed-by: Alistair Francis <alistair.francis@wdc.com>

Alistair

> ---
>  block/nvme.c         | 2 +-
>  block/trace-events   | 2 +-
>  hw/block/nvme.c      | 6 +++---
>  include/block/nvme.h | 6 ++++--
>  4 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/block/nvme.c b/block/nvme.c
> index eb2f54dd9d..ca245ec574 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -287,7 +287,7 @@ static inline int nvme_translate_error(const NvmeCqe *c)
>  {
>      uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF;
>      if (status) {
> -        trace_nvme_error(le32_to_cpu(c->result),
> +        trace_nvme_error(le64_to_cpu(c->result64),
>                           le16_to_cpu(c->sq_head),
>                           le16_to_cpu(c->sq_id),
>                           le16_to_cpu(c->cid),
> diff --git a/block/trace-events b/block/trace-events
> index 29dff8881c..05c1393943 100644
> --- a/block/trace-events
> +++ b/block/trace-events
> @@ -156,7 +156,7 @@ vxhs_get_creds(const char *cacert, const char *client_key, const char *client_ce
>  # nvme.c
>  nvme_kick(void *s, int queue) "s %p queue %d"
>  nvme_dma_flush_queue_wait(void *s) "s %p"
> -nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) "cmd_specific %d sq_head %d sqid %d cid %d status 0x%x"
> +nvme_error(uint64_t cmd_specific, int sq_head, int sqid, int cid, int status) "cmd_specific %ld sq_head %d sqid %d cid %d status 0x%x"
>  nvme_process_completion(void *s, int index, int inflight) "s %p queue %d inflight %d"
>  nvme_process_completion_queue_busy(void *s, int index) "s %p queue %d"
>  nvme_complete_command(void *s, int index, int cid) "s %p queue %d cid %d"
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 3ed9f3d321..a1bbc9acde 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -823,7 +823,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>
> -    req->cqe.result = result;
> +    req->cqe.result32 = result;
>      return NVME_SUCCESS;
>  }
>
> @@ -859,8 +859,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>                                      ((dw11 >> 16) & 0xFFFF) + 1,
>                                      n->params.max_ioqpairs,
>                                      n->params.max_ioqpairs);
> -        req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
> -                                      ((n->params.max_ioqpairs - 1) << 16));
> +        req->cqe.result32 = cpu_to_le32((n->params.max_ioqpairs - 1) |
> +                                        ((n->params.max_ioqpairs - 1) << 16));
>          break;
>      case NVME_TIMESTAMP:
>          return nvme_set_feature_timestamp(n, cmd);
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 1720ee1d51..9c3a04dcd7 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -577,8 +577,10 @@ typedef struct NvmeAerResult {
>  } NvmeAerResult;
>
>  typedef struct NvmeCqe {
> -    uint32_t    result;
> -    uint32_t    rsvd;
> +    union {
> +        uint64_t     result64;
> +        uint32_t     result32;
> +    };
>      uint16_t    sq_head;
>      uint16_t    sq_id;
>      uint16_t    cid;
> --
> 2.21.0
>
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 03/18] hw/block/nvme: Clean up unused AER definitions
  2020-06-17 21:34 ` [PATCH v2 03/18] hw/block/nvme: Clean up unused AER definitions Dmitry Fomichev
@ 2020-06-30  1:00   ` Alistair Francis
  2020-06-30  4:40   ` Klaus Jensen
  1 sibling, 0 replies; 49+ messages in thread
From: Alistair Francis @ 2020-06-30  1:00 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, Qemu-block,
	qemu-devel@nongnu.org Developers, Keith Busch,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Wed, Jun 17, 2020 at 2:48 PM Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
>
> Removed unused struct NvmeAerResult and SMART-related async event
> codes. All other event codes are now categorized by their type.
> This avoids having to define the same values in a single enum,
> NvmeAsyncEventRequest, that is now removed.
>
> Later commits in this series will define additional values in some
> of these enums. No functional change.
>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

Reviewed-by: Alistair Francis <alistair.francis@wdc.com>

Alistair

> ---
>  hw/block/nvme.h      |  1 -
>  include/block/nvme.h | 43 ++++++++++++++++++++++---------------------
>  2 files changed, 22 insertions(+), 22 deletions(-)
>
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 0460cc0e62..4f0dac39ae 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -13,7 +13,6 @@ typedef struct NvmeParams {
>
>  typedef struct NvmeAsyncEvent {
>      QSIMPLEQ_ENTRY(NvmeAsyncEvent) entry;
> -    NvmeAerResult result;
>  } NvmeAsyncEvent;
>
>  enum NvmeRequestFlags {
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 9c3a04dcd7..3099df99eb 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -553,28 +553,30 @@ typedef struct NvmeDsmRange {
>      uint64_t    slba;
>  } NvmeDsmRange;
>
> -enum NvmeAsyncEventRequest {
> -    NVME_AER_TYPE_ERROR                     = 0,
> -    NVME_AER_TYPE_SMART                     = 1,
> -    NVME_AER_TYPE_IO_SPECIFIC               = 6,
> -    NVME_AER_TYPE_VENDOR_SPECIFIC           = 7,
> -    NVME_AER_INFO_ERR_INVALID_SQ            = 0,
> -    NVME_AER_INFO_ERR_INVALID_DB            = 1,
> -    NVME_AER_INFO_ERR_DIAG_FAIL             = 2,
> -    NVME_AER_INFO_ERR_PERS_INTERNAL_ERR     = 3,
> -    NVME_AER_INFO_ERR_TRANS_INTERNAL_ERR    = 4,
> -    NVME_AER_INFO_ERR_FW_IMG_LOAD_ERR       = 5,
> -    NVME_AER_INFO_SMART_RELIABILITY         = 0,
> -    NVME_AER_INFO_SMART_TEMP_THRESH         = 1,
> -    NVME_AER_INFO_SMART_SPARE_THRESH        = 2,
> +enum NvmeAsyncEventType {
> +    NVME_AER_TYPE_ERROR                     = 0x00,
> +    NVME_AER_TYPE_SMART                     = 0x01,
> +    NVME_AER_TYPE_NOTICE                    = 0x02,
> +    NVME_AER_TYPE_CMDSET_SPECIFIC           = 0x06,
> +    NVME_AER_TYPE_VENDOR_SPECIFIC           = 0x07,
>  };
>
> -typedef struct NvmeAerResult {
> -    uint8_t event_type;
> -    uint8_t event_info;
> -    uint8_t log_page;
> -    uint8_t resv;
> -} NvmeAerResult;
> +enum NvmeAsyncErrorInfo {
> +    NVME_AER_ERR_INVALID_SQ                 = 0x00,
> +    NVME_AER_ERR_INVALID_DB                 = 0x01,
> +    NVME_AER_ERR_DIAG_FAIL                  = 0x02,
> +    NVME_AER_ERR_PERS_INTERNAL_ERR          = 0x03,
> +    NVME_AER_ERR_TRANS_INTERNAL_ERR         = 0x04,
> +    NVME_AER_ERR_FW_IMG_LOAD_ERR            = 0x05,
> +};
> +
> +enum NvmeAsyncNoticeInfo {
> +    NVME_AER_NOTICE_NS_CHANGED              = 0x00,
> +};
> +
> +enum NvmeAsyncEventCfg {
> +    NVME_AEN_CFG_NS_ATTR                    = 1 << 8,
> +};
>
>  typedef struct NvmeCqe {
>      union {
> @@ -881,7 +883,6 @@ enum NvmeIdNsDps {
>
>  static inline void _nvme_check_size(void)
>  {
> -    QEMU_BUILD_BUG_ON(sizeof(NvmeAerResult) != 4);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeCqe) != 16);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeDsmRange) != 16);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeCmd) != 64);
> --
> 2.21.0
>
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 04/18] hw/block/nvme: Add Commands Supported and Effects log
  2020-06-17 21:34 ` [PATCH v2 04/18] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
@ 2020-06-30  1:35   ` Alistair Francis
  2020-06-30  4:46   ` Klaus Jensen
  1 sibling, 0 replies; 49+ messages in thread
From: Alistair Francis @ 2020-06-30  1:35 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, Qemu-block,
	qemu-devel@nongnu.org Developers, Keith Busch,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Wed, Jun 17, 2020 at 3:05 PM Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
>
> This log page becomes necessary to implement to allow checking for
> Zone Append command support in Zoned Namespace Command Set.
>
> This commit adds the code to report this log page for NVM Command
> Set only. The parts that are specific to zoned operation will be
> added later in the series.
>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

Acked-by: Alistair Francis <alistair.francis@wdc.com>

Alistair

> ---
>  hw/block/nvme.c       | 62 +++++++++++++++++++++++++++++++++++++++++++
>  hw/block/trace-events |  4 +++
>  include/block/nvme.h  | 18 +++++++++++++
>  3 files changed, 84 insertions(+)
>
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index a1bbc9acde..03b8deee85 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -871,6 +871,66 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      return NVME_SUCCESS;
>  }
>
> +static uint16_t nvme_handle_cmd_effects(NvmeCtrl *n, NvmeCmd *cmd,
> +    uint64_t prp1, uint64_t prp2, uint64_t ofs, uint32_t len)
> +{
> +   NvmeEffectsLog cmd_eff_log = {};
> +   uint32_t *iocs = cmd_eff_log.iocs;
> +
> +    trace_pci_nvme_cmd_supp_and_effects_log_read();
> +
> +    if (ofs != 0) {
> +        trace_pci_nvme_err_invalid_effects_log_offset(ofs);
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +    if (len != sizeof(cmd_eff_log)) {
> +        trace_pci_nvme_err_invalid_effects_log_len(len);
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    iocs[NVME_ADM_CMD_DELETE_SQ] = NVME_CMD_EFFECTS_CSUPP;
> +    iocs[NVME_ADM_CMD_CREATE_SQ] = NVME_CMD_EFFECTS_CSUPP;
> +    iocs[NVME_ADM_CMD_DELETE_CQ] = NVME_CMD_EFFECTS_CSUPP;
> +    iocs[NVME_ADM_CMD_CREATE_CQ] = NVME_CMD_EFFECTS_CSUPP;
> +    iocs[NVME_ADM_CMD_IDENTIFY] = NVME_CMD_EFFECTS_CSUPP;
> +    iocs[NVME_ADM_CMD_SET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
> +    iocs[NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
> +    iocs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;
> +
> +    iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> +    iocs[NVME_CMD_WRITE_ZEROS] = NVME_CMD_EFFECTS_CSUPP |
> +                                 NVME_CMD_EFFECTS_LBCC;
> +    iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> +    iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
> +
> +    return nvme_dma_read_prp(n, (uint8_t *)&cmd_eff_log, len, prp1, prp2);
> +}
> +
> +static uint16_t nvme_get_log_page(NvmeCtrl *n, NvmeCmd *cmd)
> +{
> +    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> +    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> +    uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> +    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> +    uint64_t dw12 = le32_to_cpu(cmd->cdw12);
> +    uint64_t dw13 = le32_to_cpu(cmd->cdw13);
> +    uint64_t ofs = (dw13 << 32) | dw12;
> +    uint32_t numdl, numdu, len;
> +    uint16_t lid = dw10 & 0xff;
> +
> +    numdl = dw10 >> 16;
> +    numdu = dw11 & 0xffff;
> +    len = (((numdu << 16) | numdl) + 1) << 2;
> +
> +    switch (lid) {
> +    case NVME_LOG_CMD_EFFECTS:
> +        return nvme_handle_cmd_effects(n, cmd, prp1, prp2, ofs, len);
> +    }
> +
> +    trace_pci_nvme_unsupported_log_page(lid);
> +    return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
>  static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
>      switch (cmd->opcode) {
> @@ -888,6 +948,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          return nvme_set_feature(n, cmd, req);
>      case NVME_ADM_CMD_GET_FEATURES:
>          return nvme_get_feature(n, cmd, req);
> +    case NVME_ADM_CMD_GET_LOG_PAGE:
> +        return nvme_get_log_page(n, cmd);
>      default:
>          trace_pci_nvme_err_invalid_admin_opc(cmd->opcode);
>          return NVME_INVALID_OPCODE | NVME_DNR;
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index 958fcc5508..423d491e27 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -58,6 +58,7 @@ pci_nvme_mmio_start_success(void) "setting controller enable bit succeeded"
>  pci_nvme_mmio_stopped(void) "cleared controller enable bit"
>  pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
>  pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
> +pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
>
>  # nvme traces for error conditions
>  pci_nvme_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
> @@ -69,6 +70,8 @@ pci_nvme_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid namespace %u not w
>  pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
>  pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
>  pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
> +pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
> +pci_nvme_err_invalid_effects_log_len(uint32_t len) "commands supported and effects log size is 4096, got %"PRIu32""
>  pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
>  pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
>  pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
> @@ -123,6 +126,7 @@ pci_nvme_ub_db_wr_invalid_cq(uint32_t qid) "completion queue doorbell write for
>  pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t new_head) "completion queue doorbell write value beyond queue size, cqid=%"PRIu32", new_head=%"PRIu16", ignoring"
>  pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write for nonexistent queue, sqid=%"PRIu32", ignoring"
>  pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission queue doorbell write value beyond queue size, sqid=%"PRIu32", new_head=%"PRIu16", ignoring"
> +pci_nvme_unsupported_log_page(uint16_t lid) "unsupported log page 0x%"PRIx16""
>
>  # xen-block.c
>  xen_block_realize(const char *type, uint32_t disk, uint32_t partition) "%s d%up%u"
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 3099df99eb..6a58bac0c2 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -691,10 +691,27 @@ enum NvmeSmartWarn {
>      NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,
>  };
>
> +typedef struct NvmeEffectsLog {
> +  uint32_t      acs[256];
> +  uint32_t      iocs[256];
> +  uint8_t       resv[2048];
> +} NvmeEffectsLog;
> +
> +enum {
> +   NVME_CMD_EFFECTS_CSUPP             = 1 << 0,
> +   NVME_CMD_EFFECTS_LBCC              = 1 << 1,
> +   NVME_CMD_EFFECTS_NCC               = 1 << 2,
> +   NVME_CMD_EFFECTS_NIC               = 1 << 3,
> +   NVME_CMD_EFFECTS_CCC               = 1 << 4,
> +   NVME_CMD_EFFECTS_CSE_MASK          = 3 << 16,
> +   NVME_CMD_EFFECTS_UUID_SEL          = 1 << 19,
> +};
> +
>  enum LogIdentifier {
>      NVME_LOG_ERROR_INFO     = 0x01,
>      NVME_LOG_SMART_INFO     = 0x02,
>      NVME_LOG_FW_SLOT_INFO   = 0x03,
> +    NVME_LOG_CMD_EFFECTS    = 0x05,
>  };
>
>  typedef struct NvmePSD {
> @@ -898,5 +915,6 @@ static inline void _nvme_check_size(void)
>      QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
>  }
>  #endif
> --
> 2.21.0
>
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 05/18] hw/block/nvme: Introduce the Namespace Types definitions
  2020-06-17 21:34 ` [PATCH v2 05/18] hw/block/nvme: Introduce the Namespace Types definitions Dmitry Fomichev
@ 2020-06-30  2:12   ` Alistair Francis
  2020-06-30 10:02     ` Niklas Cassel
  2020-06-30  4:57   ` Klaus Jensen
  1 sibling, 1 reply; 49+ messages in thread
From: Alistair Francis @ 2020-06-30  2:12 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, Qemu-block,
	qemu-devel@nongnu.org Developers, Keith Busch,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Wed, Jun 17, 2020 at 2:47 PM Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
>
> From: Niklas Cassel <niklas.cassel@wdc.com>
>
> Define the structures and constants required to implement
> Namespace Types support.
>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme.h      |  3 ++
>  include/block/nvme.h | 75 +++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 73 insertions(+), 5 deletions(-)
>
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 4f0dac39ae..4fd155c409 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -63,6 +63,9 @@ typedef struct NvmeCQueue {
>
>  typedef struct NvmeNamespace {
>      NvmeIdNs        id_ns;
> +    uint32_t        nsid;
> +    uint8_t         csi;
> +    QemuUUID        uuid;
>  } NvmeNamespace;
>
>  static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 6a58bac0c2..5a1e5e137c 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -50,6 +50,11 @@ enum NvmeCapMask {
>      CAP_PMR_MASK       = 0x1,
>  };
>
> +enum NvmeCapCssBits {
> +    CAP_CSS_NVM        = 0x01,
> +    CAP_CSS_CSI_SUPP   = 0x40,
> +};
> +
>  #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
>  #define NVME_CAP_CQR(cap)   (((cap) >> CAP_CQR_SHIFT)    & CAP_CQR_MASK)
>  #define NVME_CAP_AMS(cap)   (((cap) >> CAP_AMS_SHIFT)    & CAP_AMS_MASK)
> @@ -101,6 +106,12 @@ enum NvmeCcMask {
>      CC_IOCQES_MASK  = 0xf,
>  };
>
> +enum NvmeCcCss {
> +    CSS_NVM_ONLY        = 0,
> +    CSS_ALL_NSTYPES     = 6,
> +    CSS_ADMIN_ONLY      = 7,
> +};
> +
>  #define NVME_CC_EN(cc)     ((cc >> CC_EN_SHIFT)     & CC_EN_MASK)
>  #define NVME_CC_CSS(cc)    ((cc >> CC_CSS_SHIFT)    & CC_CSS_MASK)
>  #define NVME_CC_MPS(cc)    ((cc >> CC_MPS_SHIFT)    & CC_MPS_MASK)
> @@ -109,6 +120,21 @@ enum NvmeCcMask {
>  #define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK)
>  #define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK)
>
> +#define NVME_SET_CC_EN(cc, val)     \
> +    (cc |= (uint32_t)((val) & CC_EN_MASK) << CC_EN_SHIFT)
> +#define NVME_SET_CC_CSS(cc, val)    \
> +    (cc |= (uint32_t)((val) & CC_CSS_MASK) << CC_CSS_SHIFT)
> +#define NVME_SET_CC_MPS(cc, val)    \
> +    (cc |= (uint32_t)((val) & CC_MPS_MASK) << CC_MPS_SHIFT)
> +#define NVME_SET_CC_AMS(cc, val)    \
> +    (cc |= (uint32_t)((val) & CC_AMS_MASK) << CC_AMS_SHIFT)
> +#define NVME_SET_CC_SHN(cc, val)    \
> +    (cc |= (uint32_t)((val) & CC_SHN_MASK) << CC_SHN_SHIFT)
> +#define NVME_SET_CC_IOSQES(cc, val) \
> +    (cc |= (uint32_t)((val) & CC_IOSQES_MASK) << CC_IOSQES_SHIFT)
> +#define NVME_SET_CC_IOCQES(cc, val) \
> +    (cc |= (uint32_t)((val) & CC_IOCQES_MASK) << CC_IOCQES_SHIFT)
> +
>  enum NvmeCstsShift {
>      CSTS_RDY_SHIFT      = 0,
>      CSTS_CFS_SHIFT      = 1,
> @@ -482,10 +508,41 @@ typedef struct NvmeIdentify {
>      uint64_t    rsvd2[2];
>      uint64_t    prp1;
>      uint64_t    prp2;
> -    uint32_t    cns;
> -    uint32_t    rsvd11[5];
> +    uint8_t     cns;
> +    uint8_t     rsvd4;
> +    uint16_t    ctrlid;

Shouldn't this be CNTID?

Alistair

> +    uint16_t    nvmsetid;
> +    uint8_t     rsvd3;
> +    uint8_t     csi;
> +    uint32_t    rsvd12[4];
>  } NvmeIdentify;
>
> +typedef struct NvmeNsIdDesc {
> +    uint8_t     nidt;
> +    uint8_t     nidl;
> +    uint16_t    rsvd2;
> +} NvmeNsIdDesc;
> +
> +enum NvmeNidType {
> +    NVME_NIDT_EUI64             = 0x01,
> +    NVME_NIDT_NGUID             = 0x02,
> +    NVME_NIDT_UUID              = 0x03,
> +    NVME_NIDT_CSI               = 0x04,
> +};
> +
> +enum NvmeNidLength {
> +    NVME_NIDL_EUI64             = 8,
> +    NVME_NIDL_NGUID             = 16,
> +    NVME_NIDL_UUID              = 16,
> +    NVME_NIDL_CSI               = 1,
> +};
> +
> +enum NvmeCsi {
> +    NVME_CSI_NVM                = 0x00,
> +};
> +
> +#define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
> +
>  typedef struct NvmeRwCmd {
>      uint8_t     opcode;
>      uint8_t     flags;
> @@ -603,6 +660,7 @@ enum NvmeStatusCodes {
>      NVME_CMD_ABORT_MISSING_FUSE = 0x000a,
>      NVME_INVALID_NSID           = 0x000b,
>      NVME_CMD_SEQ_ERROR          = 0x000c,
> +    NVME_CMD_SET_CMB_REJECTED   = 0x002b,
>      NVME_LBA_RANGE              = 0x0080,
>      NVME_CAP_EXCEEDED           = 0x0081,
>      NVME_NS_NOT_READY           = 0x0082,
> @@ -729,9 +787,14 @@ typedef struct NvmePSD {
>  #define NVME_IDENTIFY_DATA_SIZE 4096
>
>  enum {
> -    NVME_ID_CNS_NS             = 0x0,
> -    NVME_ID_CNS_CTRL           = 0x1,
> -    NVME_ID_CNS_NS_ACTIVE_LIST = 0x2,
> +    NVME_ID_CNS_NS                = 0x0,
> +    NVME_ID_CNS_CTRL              = 0x1,
> +    NVME_ID_CNS_NS_ACTIVE_LIST    = 0x2,
> +    NVME_ID_CNS_NS_DESC_LIST      = 0x03,
> +    NVME_ID_CNS_CS_NS             = 0x05,
> +    NVME_ID_CNS_CS_CTRL           = 0x06,
> +    NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
> +    NVME_ID_CNS_IO_COMMAND_SET    = 0x1c,
>  };
>
>  typedef struct NvmeIdCtrl {
> @@ -825,6 +888,7 @@ enum NvmeFeatureIds {
>      NVME_WRITE_ATOMICITY            = 0xa,
>      NVME_ASYNCHRONOUS_EVENT_CONF    = 0xb,
>      NVME_TIMESTAMP                  = 0xe,
> +    NVME_COMMAND_SET_PROFILE        = 0x19,
>      NVME_SOFTWARE_PROGRESS_MARKER   = 0x80
>  };
>
> @@ -914,6 +978,7 @@ static inline void _nvme_check_size(void)
>      QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeNsIdDesc) != 4);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
>  }
> --
> 2.21.0
>
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 01/18] hw/block/nvme: Move NvmeRequest has_sg field to a bit flag
  2020-06-17 21:33 ` [PATCH v2 01/18] hw/block/nvme: Move NvmeRequest has_sg field to a bit flag Dmitry Fomichev
  2020-06-30  0:56   ` Alistair Francis
@ 2020-06-30  4:09   ` Klaus Jensen
  1 sibling, 0 replies; 49+ messages in thread
From: Klaus Jensen @ 2020-06-30  4:09 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Keith Busch, Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Jun 18 06:33, Dmitry Fomichev wrote:
> In addition to the existing has_sg flag, a few more Boolean
> NvmeRequest flags are going to be introduced in subsequent patches.
> Convert "has_sg" variable to "flags" and define NvmeRequestFlags
> enum for individual flag values.
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

Reviewed-by: Klaus Jensen <k.jensen@samsung.com>

> ---
>  hw/block/nvme.c | 8 +++-----
>  hw/block/nvme.h | 6 +++++-
>  2 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 1aee042d4c..3ed9f3d321 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -350,7 +350,7 @@ static void nvme_rw_cb(void *opaque, int ret)
>          block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
>          req->status = NVME_INTERNAL_DEV_ERROR;
>      }
> -    if (req->has_sg) {
> +    if (req->flags & NVME_REQ_FLG_HAS_SG) {
>          qemu_sglist_destroy(&req->qsg);
>      }
>      nvme_enqueue_req_completion(cq, req);
> @@ -359,7 +359,6 @@ static void nvme_rw_cb(void *opaque, int ret)
>  static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>      NvmeRequest *req)
>  {
> -    req->has_sg = false;
>      block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
>           BLOCK_ACCT_FLUSH);
>      req->aiocb = blk_aio_flush(n->conf.blk, nvme_rw_cb, req);
> @@ -383,7 +382,6 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>          return NVME_LBA_RANGE | NVME_DNR;
>      }
>  
> -    req->has_sg = false;
>      block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
>                       BLOCK_ACCT_WRITE);
>      req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
> @@ -422,14 +420,13 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>  
>      dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
>      if (req->qsg.nsg > 0) {
> -        req->has_sg = true;
> +        req->flags |= NVME_REQ_FLG_HAS_SG;
>          req->aiocb = is_write ?
>              dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
>                            nvme_rw_cb, req) :
>              dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
>                           nvme_rw_cb, req);
>      } else {
> -        req->has_sg = false;
>          req->aiocb = is_write ?
>              blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
>                              req) :
> @@ -917,6 +914,7 @@ static void nvme_process_sq(void *opaque)
>          QTAILQ_REMOVE(&sq->req_list, req, entry);
>          QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
>          memset(&req->cqe, 0, sizeof(req->cqe));
> +        req->flags = 0;
>          req->cqe.cid = cmd.cid;
>  
>          status = sq->sqid ? nvme_io_cmd(n, &cmd, req) :
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 1d30c0bca2..0460cc0e62 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -16,11 +16,15 @@ typedef struct NvmeAsyncEvent {
>      NvmeAerResult result;
>  } NvmeAsyncEvent;
>  
> +enum NvmeRequestFlags {
> +    NVME_REQ_FLG_HAS_SG   = 1 << 0,
> +};
> +
>  typedef struct NvmeRequest {
>      struct NvmeSQueue       *sq;
>      BlockAIOCB              *aiocb;
>      uint16_t                status;
> -    bool                    has_sg;
> +    uint16_t                flags;
>      NvmeCqe                 cqe;
>      BlockAcctCookie         acct;
>      QEMUSGList              qsg;
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 02/18] hw/block/nvme: Define 64 bit cqe.result
  2020-06-17 21:33 ` [PATCH v2 02/18] hw/block/nvme: Define 64 bit cqe.result Dmitry Fomichev
  2020-06-30  0:58   ` Alistair Francis
@ 2020-06-30  4:15   ` Klaus Jensen
  1 sibling, 0 replies; 49+ messages in thread
From: Klaus Jensen @ 2020-06-30  4:15 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Keith Busch, Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Jun 18 06:33, Dmitry Fomichev wrote:
> From: Ajay Joshi <ajay.joshi@wdc.com>
> 
> A new write command, Zone Append, is added as a part of Zoned
> Namespace Command Set. Upon successful completion of this command,
> the controller returns the start LBA of the performed write operation
> in cqe.result field. Therefore, the maximum size of this variable
> needs to be changed from 32 to 64 bit, consuming the reserved 32 bit
> field that follows the result in CQE struct. Since the existing
> commands are expected to return a 32 bit LE value, two separate
> variables, result32 and result64, are now kept in a union.
> 
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

Reviewed-by: Klaus Jensen <k.jensen@samsung.com>

> ---
>  block/nvme.c         | 2 +-
>  block/trace-events   | 2 +-
>  hw/block/nvme.c      | 6 +++---
>  include/block/nvme.h | 6 ++++--
>  4 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index eb2f54dd9d..ca245ec574 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -287,7 +287,7 @@ static inline int nvme_translate_error(const NvmeCqe *c)
>  {
>      uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF;
>      if (status) {
> -        trace_nvme_error(le32_to_cpu(c->result),
> +        trace_nvme_error(le64_to_cpu(c->result64),
>                           le16_to_cpu(c->sq_head),
>                           le16_to_cpu(c->sq_id),
>                           le16_to_cpu(c->cid),
> diff --git a/block/trace-events b/block/trace-events
> index 29dff8881c..05c1393943 100644
> --- a/block/trace-events
> +++ b/block/trace-events
> @@ -156,7 +156,7 @@ vxhs_get_creds(const char *cacert, const char *client_key, const char *client_ce
>  # nvme.c
>  nvme_kick(void *s, int queue) "s %p queue %d"
>  nvme_dma_flush_queue_wait(void *s) "s %p"
> -nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) "cmd_specific %d sq_head %d sqid %d cid %d status 0x%x"
> +nvme_error(uint64_t cmd_specific, int sq_head, int sqid, int cid, int status) "cmd_specific %ld sq_head %d sqid %d cid %d status 0x%x"
>  nvme_process_completion(void *s, int index, int inflight) "s %p queue %d inflight %d"
>  nvme_process_completion_queue_busy(void *s, int index) "s %p queue %d"
>  nvme_complete_command(void *s, int index, int cid) "s %p queue %d cid %d"
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 3ed9f3d321..a1bbc9acde 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -823,7 +823,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> -    req->cqe.result = result;
> +    req->cqe.result32 = result;
>      return NVME_SUCCESS;
>  }
>  
> @@ -859,8 +859,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>                                      ((dw11 >> 16) & 0xFFFF) + 1,
>                                      n->params.max_ioqpairs,
>                                      n->params.max_ioqpairs);
> -        req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
> -                                      ((n->params.max_ioqpairs - 1) << 16));
> +        req->cqe.result32 = cpu_to_le32((n->params.max_ioqpairs - 1) |
> +                                        ((n->params.max_ioqpairs - 1) << 16));
>          break;
>      case NVME_TIMESTAMP:
>          return nvme_set_feature_timestamp(n, cmd);
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 1720ee1d51..9c3a04dcd7 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -577,8 +577,10 @@ typedef struct NvmeAerResult {
>  } NvmeAerResult;
>  
>  typedef struct NvmeCqe {
> -    uint32_t    result;
> -    uint32_t    rsvd;
> +    union {
> +        uint64_t     result64;
> +        uint32_t     result32;
> +    };
>      uint16_t    sq_head;
>      uint16_t    sq_id;
>      uint16_t    cid;
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 03/18] hw/block/nvme: Clean up unused AER definitions
  2020-06-17 21:34 ` [PATCH v2 03/18] hw/block/nvme: Clean up unused AER definitions Dmitry Fomichev
  2020-06-30  1:00   ` Alistair Francis
@ 2020-06-30  4:40   ` Klaus Jensen
  1 sibling, 0 replies; 49+ messages in thread
From: Klaus Jensen @ 2020-06-30  4:40 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Keith Busch, Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Jun 18 06:34, Dmitry Fomichev wrote:
> Removed unused struct NvmeAerResult and SMART-related async event
> codes. All other event codes are now categorized by their type.
> This avoids having to define the same values in a single enum,
> NvmeAsyncEventRequest, that is now removed.
> 
> Later commits in this series will define additional values in some
> of these enums. No functional change.
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme.h      |  1 -
>  include/block/nvme.h | 43 ++++++++++++++++++++++---------------------
>  2 files changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 0460cc0e62..4f0dac39ae 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -13,7 +13,6 @@ typedef struct NvmeParams {
>  
>  typedef struct NvmeAsyncEvent {
>      QSIMPLEQ_ENTRY(NvmeAsyncEvent) entry;
> -    NvmeAerResult result;
>  } NvmeAsyncEvent;
>  
>  enum NvmeRequestFlags {
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 9c3a04dcd7..3099df99eb 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -553,28 +553,30 @@ typedef struct NvmeDsmRange {
>      uint64_t    slba;
>  } NvmeDsmRange;
>  
> -enum NvmeAsyncEventRequest {
> -    NVME_AER_TYPE_ERROR                     = 0,
> -    NVME_AER_TYPE_SMART                     = 1,
> -    NVME_AER_TYPE_IO_SPECIFIC               = 6,
> -    NVME_AER_TYPE_VENDOR_SPECIFIC           = 7,
> -    NVME_AER_INFO_ERR_INVALID_SQ            = 0,
> -    NVME_AER_INFO_ERR_INVALID_DB            = 1,
> -    NVME_AER_INFO_ERR_DIAG_FAIL             = 2,
> -    NVME_AER_INFO_ERR_PERS_INTERNAL_ERR     = 3,
> -    NVME_AER_INFO_ERR_TRANS_INTERNAL_ERR    = 4,
> -    NVME_AER_INFO_ERR_FW_IMG_LOAD_ERR       = 5,
> -    NVME_AER_INFO_SMART_RELIABILITY         = 0,
> -    NVME_AER_INFO_SMART_TEMP_THRESH         = 1,
> -    NVME_AER_INFO_SMART_SPARE_THRESH        = 2,
> +enum NvmeAsyncEventType {
> +    NVME_AER_TYPE_ERROR                     = 0x00,
> +    NVME_AER_TYPE_SMART                     = 0x01,
> +    NVME_AER_TYPE_NOTICE                    = 0x02,
> +    NVME_AER_TYPE_CMDSET_SPECIFIC           = 0x06,
> +    NVME_AER_TYPE_VENDOR_SPECIFIC           = 0x07,
>  };
>  
> -typedef struct NvmeAerResult {
> -    uint8_t event_type;
> -    uint8_t event_info;
> -    uint8_t log_page;
> -    uint8_t resv;
> -} NvmeAerResult;
> +enum NvmeAsyncErrorInfo {
> +    NVME_AER_ERR_INVALID_SQ                 = 0x00,
> +    NVME_AER_ERR_INVALID_DB                 = 0x01,

Since we are moving this around, can we change it to
NVME_AER_INVALID_DB_REGISTER and NVME_AER_INVALID_DB_VALUE instead? I
believe those are the terms used in the spec.

Otherwise,

Reviewed-by: Klaus Jensen <k.jensen@samsung.com>

> +    NVME_AER_ERR_DIAG_FAIL                  = 0x02,
> +    NVME_AER_ERR_PERS_INTERNAL_ERR          = 0x03,
> +    NVME_AER_ERR_TRANS_INTERNAL_ERR         = 0x04,
> +    NVME_AER_ERR_FW_IMG_LOAD_ERR            = 0x05,
> +};
> +
> +enum NvmeAsyncNoticeInfo {
> +    NVME_AER_NOTICE_NS_CHANGED              = 0x00,
> +};
> +
> +enum NvmeAsyncEventCfg {
> +    NVME_AEN_CFG_NS_ATTR                    = 1 << 8,
> +};
>  
>  typedef struct NvmeCqe {
>      union {
> @@ -881,7 +883,6 @@ enum NvmeIdNsDps {
>  
>  static inline void _nvme_check_size(void)
>  {
> -    QEMU_BUILD_BUG_ON(sizeof(NvmeAerResult) != 4);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeCqe) != 16);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeDsmRange) != 16);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeCmd) != 64);
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 04/18] hw/block/nvme: Add Commands Supported and Effects log
  2020-06-17 21:34 ` [PATCH v2 04/18] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
  2020-06-30  1:35   ` Alistair Francis
@ 2020-06-30  4:46   ` Klaus Jensen
  1 sibling, 0 replies; 49+ messages in thread
From: Klaus Jensen @ 2020-06-30  4:46 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Keith Busch, Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Jun 18 06:34, Dmitry Fomichev wrote:
> This log page becomes necessary to implement to allow checking for
> Zone Append command support in Zoned Namespace Command Set.
> 
> This commit adds the code to report this log page for NVM Command
> Set only. The parts that are specific to zoned operation will be
> added later in the series.
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme.c       | 62 +++++++++++++++++++++++++++++++++++++++++++
>  hw/block/trace-events |  4 +++
>  include/block/nvme.h  | 18 +++++++++++++
>  3 files changed, 84 insertions(+)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index a1bbc9acde..03b8deee85 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -871,6 +871,66 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      return NVME_SUCCESS;
>  }
>  
> +static uint16_t nvme_handle_cmd_effects(NvmeCtrl *n, NvmeCmd *cmd,
> +    uint64_t prp1, uint64_t prp2, uint64_t ofs, uint32_t len)
> +{
> +   NvmeEffectsLog cmd_eff_log = {};
> +   uint32_t *iocs = cmd_eff_log.iocs;
> +
> +    trace_pci_nvme_cmd_supp_and_effects_log_read();
> +
> +    if (ofs != 0) {
> +        trace_pci_nvme_err_invalid_effects_log_offset(ofs);
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +    if (len != sizeof(cmd_eff_log)) {
> +        trace_pci_nvme_err_invalid_effects_log_len(len);
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }

I don't see why you cannot request a subset of the page like any log
page?

> +
> +    iocs[NVME_ADM_CMD_DELETE_SQ] = NVME_CMD_EFFECTS_CSUPP;
> +    iocs[NVME_ADM_CMD_CREATE_SQ] = NVME_CMD_EFFECTS_CSUPP;
> +    iocs[NVME_ADM_CMD_DELETE_CQ] = NVME_CMD_EFFECTS_CSUPP;
> +    iocs[NVME_ADM_CMD_CREATE_CQ] = NVME_CMD_EFFECTS_CSUPP;
> +    iocs[NVME_ADM_CMD_IDENTIFY] = NVME_CMD_EFFECTS_CSUPP;
> +    iocs[NVME_ADM_CMD_SET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
> +    iocs[NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
> +    iocs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;

These are admin commands and should go to acs.

> +
> +    iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> +    iocs[NVME_CMD_WRITE_ZEROS] = NVME_CMD_EFFECTS_CSUPP |
> +                                 NVME_CMD_EFFECTS_LBCC;
> +    iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> +    iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
> +
> +    return nvme_dma_read_prp(n, (uint8_t *)&cmd_eff_log, len, prp1, prp2);
> +}
> +
> +static uint16_t nvme_get_log_page(NvmeCtrl *n, NvmeCmd *cmd)
> +{
> +    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> +    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> +    uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> +    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> +    uint64_t dw12 = le32_to_cpu(cmd->cdw12);
> +    uint64_t dw13 = le32_to_cpu(cmd->cdw13);
> +    uint64_t ofs = (dw13 << 32) | dw12;
> +    uint32_t numdl, numdu, len;
> +    uint16_t lid = dw10 & 0xff;
> +
> +    numdl = dw10 >> 16;
> +    numdu = dw11 & 0xffff;
> +    len = (((numdu << 16) | numdl) + 1) << 2;
> +
> +    switch (lid) {
> +    case NVME_LOG_CMD_EFFECTS:
> +        return nvme_handle_cmd_effects(n, cmd, prp1, prp2, ofs, len);
> +    }
> +
> +    trace_pci_nvme_unsupported_log_page(lid);
> +    return NVME_INVALID_FIELD | NVME_DNR;
> +}

The controller should set bit 2 of the LPA field to indicate support for
extended data.

> +
>  static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
>      switch (cmd->opcode) {
> @@ -888,6 +948,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          return nvme_set_feature(n, cmd, req);
>      case NVME_ADM_CMD_GET_FEATURES:
>          return nvme_get_feature(n, cmd, req);
> +    case NVME_ADM_CMD_GET_LOG_PAGE:
> +        return nvme_get_log_page(n, cmd);
>      default:
>          trace_pci_nvme_err_invalid_admin_opc(cmd->opcode);
>          return NVME_INVALID_OPCODE | NVME_DNR;
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index 958fcc5508..423d491e27 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -58,6 +58,7 @@ pci_nvme_mmio_start_success(void) "setting controller enable bit succeeded"
>  pci_nvme_mmio_stopped(void) "cleared controller enable bit"
>  pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
>  pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
> +pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
>  
>  # nvme traces for error conditions
>  pci_nvme_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
> @@ -69,6 +70,8 @@ pci_nvme_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid namespace %u not w
>  pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
>  pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
>  pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
> +pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
> +pci_nvme_err_invalid_effects_log_len(uint32_t len) "commands supported and effects log size is 4096, got %"PRIu32""
>  pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
>  pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
>  pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
> @@ -123,6 +126,7 @@ pci_nvme_ub_db_wr_invalid_cq(uint32_t qid) "completion queue doorbell write for
>  pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t new_head) "completion queue doorbell write value beyond queue size, cqid=%"PRIu32", new_head=%"PRIu16", ignoring"
>  pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write for nonexistent queue, sqid=%"PRIu32", ignoring"
>  pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission queue doorbell write value beyond queue size, sqid=%"PRIu32", new_head=%"PRIu16", ignoring"
> +pci_nvme_unsupported_log_page(uint16_t lid) "unsupported log page 0x%"PRIx16""
>  
>  # xen-block.c
>  xen_block_realize(const char *type, uint32_t disk, uint32_t partition) "%s d%up%u"
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 3099df99eb..6a58bac0c2 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -691,10 +691,27 @@ enum NvmeSmartWarn {
>      NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,
>  };
>  
> +typedef struct NvmeEffectsLog {
> +  uint32_t      acs[256];
> +  uint32_t      iocs[256];
> +  uint8_t       resv[2048];
> +} NvmeEffectsLog;
> +
> +enum {
> +   NVME_CMD_EFFECTS_CSUPP             = 1 << 0,
> +   NVME_CMD_EFFECTS_LBCC              = 1 << 1,
> +   NVME_CMD_EFFECTS_NCC               = 1 << 2,
> +   NVME_CMD_EFFECTS_NIC               = 1 << 3,
> +   NVME_CMD_EFFECTS_CCC               = 1 << 4,
> +   NVME_CMD_EFFECTS_CSE_MASK          = 3 << 16,
> +   NVME_CMD_EFFECTS_UUID_SEL          = 1 << 19,
> +};
> +
>  enum LogIdentifier {
>      NVME_LOG_ERROR_INFO     = 0x01,
>      NVME_LOG_SMART_INFO     = 0x02,
>      NVME_LOG_FW_SLOT_INFO   = 0x03,
> +    NVME_LOG_CMD_EFFECTS    = 0x05,
>  };
>  
>  typedef struct NvmePSD {
> @@ -898,5 +915,6 @@ static inline void _nvme_check_size(void)
>      QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
>  }
>  #endif
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 05/18] hw/block/nvme: Introduce the Namespace Types definitions
  2020-06-17 21:34 ` [PATCH v2 05/18] hw/block/nvme: Introduce the Namespace Types definitions Dmitry Fomichev
  2020-06-30  2:12   ` Alistair Francis
@ 2020-06-30  4:57   ` Klaus Jensen
  2020-06-30 16:04     ` Niklas Cassel
  1 sibling, 1 reply; 49+ messages in thread
From: Klaus Jensen @ 2020-06-30  4:57 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Keith Busch, Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Jun 18 06:34, Dmitry Fomichev wrote:
> From: Niklas Cassel <niklas.cassel@wdc.com>
> 
> Define the structures and constants required to implement
> Namespace Types support.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme.h      |  3 ++
>  include/block/nvme.h | 75 +++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 73 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 4f0dac39ae..4fd155c409 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -63,6 +63,9 @@ typedef struct NvmeCQueue {
>  
>  typedef struct NvmeNamespace {
>      NvmeIdNs        id_ns;
> +    uint32_t        nsid;
> +    uint8_t         csi;
> +    QemuUUID        uuid;
>  } NvmeNamespace;
>  
>  static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 6a58bac0c2..5a1e5e137c 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -50,6 +50,11 @@ enum NvmeCapMask {
>      CAP_PMR_MASK       = 0x1,
>  };
>  
> +enum NvmeCapCssBits {
> +    CAP_CSS_NVM        = 0x01,
> +    CAP_CSS_CSI_SUPP   = 0x40,
> +};
> +
>  #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
>  #define NVME_CAP_CQR(cap)   (((cap) >> CAP_CQR_SHIFT)    & CAP_CQR_MASK)
>  #define NVME_CAP_AMS(cap)   (((cap) >> CAP_AMS_SHIFT)    & CAP_AMS_MASK)
> @@ -101,6 +106,12 @@ enum NvmeCcMask {
>      CC_IOCQES_MASK  = 0xf,
>  };
>  
> +enum NvmeCcCss {
> +    CSS_NVM_ONLY        = 0,
> +    CSS_ALL_NSTYPES     = 6,

Maybe we could call this CSS_CSI, since it just specifies that one or
more command sets are supported, not that ALL namespace types are
supported.

Otherwise,
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>

> +    CSS_ADMIN_ONLY      = 7,
> +};
> +
>  #define NVME_CC_EN(cc)     ((cc >> CC_EN_SHIFT)     & CC_EN_MASK)
>  #define NVME_CC_CSS(cc)    ((cc >> CC_CSS_SHIFT)    & CC_CSS_MASK)
>  #define NVME_CC_MPS(cc)    ((cc >> CC_MPS_SHIFT)    & CC_MPS_MASK)
> @@ -109,6 +120,21 @@ enum NvmeCcMask {
>  #define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK)
>  #define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK)
>  
> +#define NVME_SET_CC_EN(cc, val)     \
> +    (cc |= (uint32_t)((val) & CC_EN_MASK) << CC_EN_SHIFT)
> +#define NVME_SET_CC_CSS(cc, val)    \
> +    (cc |= (uint32_t)((val) & CC_CSS_MASK) << CC_CSS_SHIFT)
> +#define NVME_SET_CC_MPS(cc, val)    \
> +    (cc |= (uint32_t)((val) & CC_MPS_MASK) << CC_MPS_SHIFT)
> +#define NVME_SET_CC_AMS(cc, val)    \
> +    (cc |= (uint32_t)((val) & CC_AMS_MASK) << CC_AMS_SHIFT)
> +#define NVME_SET_CC_SHN(cc, val)    \
> +    (cc |= (uint32_t)((val) & CC_SHN_MASK) << CC_SHN_SHIFT)
> +#define NVME_SET_CC_IOSQES(cc, val) \
> +    (cc |= (uint32_t)((val) & CC_IOSQES_MASK) << CC_IOSQES_SHIFT)
> +#define NVME_SET_CC_IOCQES(cc, val) \
> +    (cc |= (uint32_t)((val) & CC_IOCQES_MASK) << CC_IOCQES_SHIFT)
> +
>  enum NvmeCstsShift {
>      CSTS_RDY_SHIFT      = 0,
>      CSTS_CFS_SHIFT      = 1,
> @@ -482,10 +508,41 @@ typedef struct NvmeIdentify {
>      uint64_t    rsvd2[2];
>      uint64_t    prp1;
>      uint64_t    prp2;
> -    uint32_t    cns;
> -    uint32_t    rsvd11[5];
> +    uint8_t     cns;
> +    uint8_t     rsvd4;
> +    uint16_t    ctrlid;
> +    uint16_t    nvmsetid;
> +    uint8_t     rsvd3;
> +    uint8_t     csi;
> +    uint32_t    rsvd12[4];
>  } NvmeIdentify;
>  
> +typedef struct NvmeNsIdDesc {
> +    uint8_t     nidt;
> +    uint8_t     nidl;
> +    uint16_t    rsvd2;
> +} NvmeNsIdDesc;
> +
> +enum NvmeNidType {
> +    NVME_NIDT_EUI64             = 0x01,
> +    NVME_NIDT_NGUID             = 0x02,
> +    NVME_NIDT_UUID              = 0x03,
> +    NVME_NIDT_CSI               = 0x04,
> +};
> +
> +enum NvmeNidLength {
> +    NVME_NIDL_EUI64             = 8,
> +    NVME_NIDL_NGUID             = 16,
> +    NVME_NIDL_UUID              = 16,
> +    NVME_NIDL_CSI               = 1,
> +};
> +
> +enum NvmeCsi {
> +    NVME_CSI_NVM                = 0x00,
> +};
> +
> +#define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
> +
>  typedef struct NvmeRwCmd {
>      uint8_t     opcode;
>      uint8_t     flags;
> @@ -603,6 +660,7 @@ enum NvmeStatusCodes {
>      NVME_CMD_ABORT_MISSING_FUSE = 0x000a,
>      NVME_INVALID_NSID           = 0x000b,
>      NVME_CMD_SEQ_ERROR          = 0x000c,
> +    NVME_CMD_SET_CMB_REJECTED   = 0x002b,
>      NVME_LBA_RANGE              = 0x0080,
>      NVME_CAP_EXCEEDED           = 0x0081,
>      NVME_NS_NOT_READY           = 0x0082,
> @@ -729,9 +787,14 @@ typedef struct NvmePSD {
>  #define NVME_IDENTIFY_DATA_SIZE 4096
>  
>  enum {
> -    NVME_ID_CNS_NS             = 0x0,
> -    NVME_ID_CNS_CTRL           = 0x1,
> -    NVME_ID_CNS_NS_ACTIVE_LIST = 0x2,
> +    NVME_ID_CNS_NS                = 0x0,
> +    NVME_ID_CNS_CTRL              = 0x1,
> +    NVME_ID_CNS_NS_ACTIVE_LIST    = 0x2,
> +    NVME_ID_CNS_NS_DESC_LIST      = 0x03,
> +    NVME_ID_CNS_CS_NS             = 0x05,
> +    NVME_ID_CNS_CS_CTRL           = 0x06,
> +    NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
> +    NVME_ID_CNS_IO_COMMAND_SET    = 0x1c,
>  };
>  
>  typedef struct NvmeIdCtrl {
> @@ -825,6 +888,7 @@ enum NvmeFeatureIds {
>      NVME_WRITE_ATOMICITY            = 0xa,
>      NVME_ASYNCHRONOUS_EVENT_CONF    = 0xb,
>      NVME_TIMESTAMP                  = 0xe,
> +    NVME_COMMAND_SET_PROFILE        = 0x19,
>      NVME_SOFTWARE_PROGRESS_MARKER   = 0x80
>  };
>  
> @@ -914,6 +978,7 @@ static inline void _nvme_check_size(void)
>      QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeNsIdDesc) != 4);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
>  }
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 05/18] hw/block/nvme: Introduce the Namespace Types definitions
  2020-06-30  2:12   ` Alistair Francis
@ 2020-06-30 10:02     ` Niklas Cassel
  2020-06-30 17:02       ` Keith Busch
  0 siblings, 1 reply; 49+ messages in thread
From: Niklas Cassel @ 2020-06-30 10:02 UTC (permalink / raw)
  To: Alistair Francis
  Cc: Kevin Wolf, Damien Le Moal, Qemu-block, Dmitry Fomichev,
	qemu-devel@nongnu.org Developers, Keith Busch,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Mon, Jun 29, 2020 at 07:12:47PM -0700, Alistair Francis wrote:
> On Wed, Jun 17, 2020 at 2:47 PM Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
> >
> > From: Niklas Cassel <niklas.cassel@wdc.com>
> >
> > Define the structures and constants required to implement
> > Namespace Types support.
> >
> > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  hw/block/nvme.h      |  3 ++
> >  include/block/nvme.h | 75 +++++++++++++++++++++++++++++++++++++++++---
> >  2 files changed, 73 insertions(+), 5 deletions(-)
> >
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index 4f0dac39ae..4fd155c409 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -63,6 +63,9 @@ typedef struct NvmeCQueue {
> >
> >  typedef struct NvmeNamespace {
> >      NvmeIdNs        id_ns;
> > +    uint32_t        nsid;
> > +    uint8_t         csi;
> > +    QemuUUID        uuid;
> >  } NvmeNamespace;
> >
> >  static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 6a58bac0c2..5a1e5e137c 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -50,6 +50,11 @@ enum NvmeCapMask {
> >      CAP_PMR_MASK       = 0x1,
> >  };
> >
> > +enum NvmeCapCssBits {
> > +    CAP_CSS_NVM        = 0x01,
> > +    CAP_CSS_CSI_SUPP   = 0x40,
> > +};
> > +
> >  #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
> >  #define NVME_CAP_CQR(cap)   (((cap) >> CAP_CQR_SHIFT)    & CAP_CQR_MASK)
> >  #define NVME_CAP_AMS(cap)   (((cap) >> CAP_AMS_SHIFT)    & CAP_AMS_MASK)
> > @@ -101,6 +106,12 @@ enum NvmeCcMask {
> >      CC_IOCQES_MASK  = 0xf,
> >  };
> >
> > +enum NvmeCcCss {
> > +    CSS_NVM_ONLY        = 0,
> > +    CSS_ALL_NSTYPES     = 6,
> > +    CSS_ADMIN_ONLY      = 7,
> > +};
> > +
> >  #define NVME_CC_EN(cc)     ((cc >> CC_EN_SHIFT)     & CC_EN_MASK)
> >  #define NVME_CC_CSS(cc)    ((cc >> CC_CSS_SHIFT)    & CC_CSS_MASK)
> >  #define NVME_CC_MPS(cc)    ((cc >> CC_MPS_SHIFT)    & CC_MPS_MASK)
> > @@ -109,6 +120,21 @@ enum NvmeCcMask {
> >  #define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK)
> >  #define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK)
> >
> > +#define NVME_SET_CC_EN(cc, val)     \
> > +    (cc |= (uint32_t)((val) & CC_EN_MASK) << CC_EN_SHIFT)
> > +#define NVME_SET_CC_CSS(cc, val)    \
> > +    (cc |= (uint32_t)((val) & CC_CSS_MASK) << CC_CSS_SHIFT)
> > +#define NVME_SET_CC_MPS(cc, val)    \
> > +    (cc |= (uint32_t)((val) & CC_MPS_MASK) << CC_MPS_SHIFT)
> > +#define NVME_SET_CC_AMS(cc, val)    \
> > +    (cc |= (uint32_t)((val) & CC_AMS_MASK) << CC_AMS_SHIFT)
> > +#define NVME_SET_CC_SHN(cc, val)    \
> > +    (cc |= (uint32_t)((val) & CC_SHN_MASK) << CC_SHN_SHIFT)
> > +#define NVME_SET_CC_IOSQES(cc, val) \
> > +    (cc |= (uint32_t)((val) & CC_IOSQES_MASK) << CC_IOSQES_SHIFT)
> > +#define NVME_SET_CC_IOCQES(cc, val) \
> > +    (cc |= (uint32_t)((val) & CC_IOCQES_MASK) << CC_IOCQES_SHIFT)
> > +
> >  enum NvmeCstsShift {
> >      CSTS_RDY_SHIFT      = 0,
> >      CSTS_CFS_SHIFT      = 1,
> > @@ -482,10 +508,41 @@ typedef struct NvmeIdentify {
> >      uint64_t    rsvd2[2];
> >      uint64_t    prp1;
> >      uint64_t    prp2;
> > -    uint32_t    cns;
> > -    uint32_t    rsvd11[5];
> > +    uint8_t     cns;
> > +    uint8_t     rsvd4;
> > +    uint16_t    ctrlid;
> 
> Shouldn't this be CNTID?

From the NVMe spec:
https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf

Figure 241:
Controller  Identifier  (CNTID)

So you are correct, this is the official abbreviation.

I guess that I tried wanted to keep it in sync with Linux:
https://github.com/torvalds/linux/blob/master/include/linux/nvme.h#L974

Which uses ctrlid.


Looking further at the NVMe spec:
In Figure 247 (Identify Controller Data Structure) they use other names
for fields:

Controller  ID  (CNTLID)
Controller Attributes (CTRATT)

I can understand if they want to have unique names for fields, but it
seems like they have trouble deciding how to abbreviate controller :)

Personally I think that ctrlid is more obvious that we are talking about
a controller and not a count. But I'm fine regardless.


Kind regards,
Niklas

> 
> Alistair
> 
> > +    uint16_t    nvmsetid;
> > +    uint8_t     rsvd3;
> > +    uint8_t     csi;
> > +    uint32_t    rsvd12[4];
> >  } NvmeIdentify;
> >
> > +typedef struct NvmeNsIdDesc {
> > +    uint8_t     nidt;
> > +    uint8_t     nidl;
> > +    uint16_t    rsvd2;
> > +} NvmeNsIdDesc;
> > +
> > +enum NvmeNidType {
> > +    NVME_NIDT_EUI64             = 0x01,
> > +    NVME_NIDT_NGUID             = 0x02,
> > +    NVME_NIDT_UUID              = 0x03,
> > +    NVME_NIDT_CSI               = 0x04,
> > +};
> > +
> > +enum NvmeNidLength {
> > +    NVME_NIDL_EUI64             = 8,
> > +    NVME_NIDL_NGUID             = 16,
> > +    NVME_NIDL_UUID              = 16,
> > +    NVME_NIDL_CSI               = 1,
> > +};
> > +
> > +enum NvmeCsi {
> > +    NVME_CSI_NVM                = 0x00,
> > +};
> > +
> > +#define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
> > +
> >  typedef struct NvmeRwCmd {
> >      uint8_t     opcode;
> >      uint8_t     flags;
> > @@ -603,6 +660,7 @@ enum NvmeStatusCodes {
> >      NVME_CMD_ABORT_MISSING_FUSE = 0x000a,
> >      NVME_INVALID_NSID           = 0x000b,
> >      NVME_CMD_SEQ_ERROR          = 0x000c,
> > +    NVME_CMD_SET_CMB_REJECTED   = 0x002b,
> >      NVME_LBA_RANGE              = 0x0080,
> >      NVME_CAP_EXCEEDED           = 0x0081,
> >      NVME_NS_NOT_READY           = 0x0082,
> > @@ -729,9 +787,14 @@ typedef struct NvmePSD {
> >  #define NVME_IDENTIFY_DATA_SIZE 4096
> >
> >  enum {
> > -    NVME_ID_CNS_NS             = 0x0,
> > -    NVME_ID_CNS_CTRL           = 0x1,
> > -    NVME_ID_CNS_NS_ACTIVE_LIST = 0x2,
> > +    NVME_ID_CNS_NS                = 0x0,
> > +    NVME_ID_CNS_CTRL              = 0x1,
> > +    NVME_ID_CNS_NS_ACTIVE_LIST    = 0x2,
> > +    NVME_ID_CNS_NS_DESC_LIST      = 0x03,
> > +    NVME_ID_CNS_CS_NS             = 0x05,
> > +    NVME_ID_CNS_CS_CTRL           = 0x06,
> > +    NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
> > +    NVME_ID_CNS_IO_COMMAND_SET    = 0x1c,
> >  };
> >
> >  typedef struct NvmeIdCtrl {
> > @@ -825,6 +888,7 @@ enum NvmeFeatureIds {
> >      NVME_WRITE_ATOMICITY            = 0xa,
> >      NVME_ASYNCHRONOUS_EVENT_CONF    = 0xb,
> >      NVME_TIMESTAMP                  = 0xe,
> > +    NVME_COMMAND_SET_PROFILE        = 0x19,
> >      NVME_SOFTWARE_PROGRESS_MARKER   = 0x80
> >  };
> >
> > @@ -914,6 +978,7 @@ static inline void _nvme_check_size(void)
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
> > +    QEMU_BUILD_BUG_ON(sizeof(NvmeNsIdDesc) != 4);
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
> >  }
> > --
> > 2.21.0
> >
> >

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 06/18] hw/block/nvme: Define trace events related to NS Types
  2020-06-17 21:34 ` [PATCH v2 06/18] hw/block/nvme: Define trace events related to NS Types Dmitry Fomichev
@ 2020-06-30 10:20   ` Klaus Jensen
  2020-06-30 20:18   ` Alistair Francis
  1 sibling, 0 replies; 49+ messages in thread
From: Klaus Jensen @ 2020-06-30 10:20 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Keith Busch, Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Jun 18 06:34, Dmitry Fomichev wrote:
> A few trace events are defined that are relevant to implementing
> Namespace Types (NVMe TP 4056).
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

Reviewed-by: Klaus Jensen <k.jensen@samsung.com>

> ---
>  hw/block/trace-events | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index 423d491e27..3f3323fe38 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -39,8 +39,13 @@ pci_nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size,
>  pci_nvme_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
>  pci_nvme_del_cq(uint16_t cqid) "deleted completion queue, cqid=%"PRIu16""
>  pci_nvme_identify_ctrl(void) "identify controller"
> +pci_nvme_identify_ctrl_csi(uint8_t csi) "identify controller, csi=0x%"PRIx8""
>  pci_nvme_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
> +pci_nvme_identify_ns_csi(uint16_t ns, uint8_t csi) "identify namespace, nsid=%"PRIu16", csi=0x%"PRIx8""
>  pci_nvme_identify_nslist(uint16_t ns) "identify namespace list, nsid=%"PRIu16""
> +pci_nvme_identify_nslist_csi(uint16_t ns, uint8_t csi) "identify namespace list, nsid=%"PRIu16", csi=0x%"PRIx8""
> +pci_nvme_list_ns_descriptors(void) "identify namespace descriptors"
> +pci_nvme_identify_cmd_set(void) "identify i/o command set"
>  pci_nvme_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
>  pci_nvme_getfeat_numq(int result) "get feature number of queues, result=%d"
>  pci_nvme_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
> @@ -59,6 +64,8 @@ pci_nvme_mmio_stopped(void) "cleared controller enable bit"
>  pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
>  pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
>  pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
> +pci_nvme_css_nvm_cset_selected_by_host(uint32_t cc) "NVM command set selected by host, bar.cc=0x%"PRIx32""
> +pci_nvme_css_all_csets_sel_by_host(uint32_t cc) "all supported command sets selected by host, bar.cc=0x%"PRIx32""
>  
>  # nvme traces for error conditions
>  pci_nvme_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
> @@ -72,6 +79,9 @@ pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
>  pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
>  pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
>  pci_nvme_err_invalid_effects_log_len(uint32_t len) "commands supported and effects log size is 4096, got %"PRIu32""
> +pci_nvme_err_change_css_when_enabled(void) "changing CC.CSS while controller is enabled"
> +pci_nvme_err_only_nvm_cmd_set_avail(void) "setting 110b CC.CSS, but only NVM command set is enabled"
> +pci_nvme_err_invalid_iocsci(uint32_t idx) "unsupported command set combination index %"PRIu32""
>  pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
>  pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
>  pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
> @@ -127,6 +137,7 @@ pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t new_head) "completion qu
>  pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write for nonexistent queue, sqid=%"PRIu32", ignoring"
>  pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission queue doorbell write value beyond queue size, sqid=%"PRIu32", new_head=%"PRIu16", ignoring"
>  pci_nvme_unsupported_log_page(uint16_t lid) "unsupported log page 0x%"PRIx16""
> +pci_nvme_ub_unknown_css_value(void) "unknown value in cc.css field"
>  
>  # xen-block.c
>  xen_block_realize(const char *type, uint32_t disk, uint32_t partition) "%s d%up%u"
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 07/18] hw/block/nvme: Add support for Namespace Types
  2020-06-17 21:34 ` [PATCH v2 07/18] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
@ 2020-06-30 11:31   ` Klaus Jensen
  0 siblings, 0 replies; 49+ messages in thread
From: Klaus Jensen @ 2020-06-30 11:31 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Keith Busch, Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Jun 18 06:34, Dmitry Fomichev wrote:
> From: Niklas Cassel <niklas.cassel@wdc.com>
> 
> Namespace Types introduce a new command set, "I/O Command Sets",
> that allows the host to retrieve the command sets associated with
> a namespace. Introduce support for the command set, and enable
> detection for the NVM Command Set.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme.c | 210 ++++++++++++++++++++++++++++++++++++++++++++++--
>  hw/block/nvme.h |  11 +++
>  2 files changed, 216 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 03b8deee85..453f4747a5 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -686,6 +686,26 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c)
>          prp1, prp2);
>  }
>  
> +static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeIdentify *c)
> +{
> +    uint64_t prp1 = le64_to_cpu(c->prp1);
> +    uint64_t prp2 = le64_to_cpu(c->prp2);
> +    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
> +    uint32_t *list;
> +    uint16_t ret;
> +
> +    trace_pci_nvme_identify_ctrl_csi(c->csi);
> +
> +    if (c->csi == NVME_CSI_NVM) {
> +        list = g_malloc0(data_len);
> +        ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
> +        g_free(list);
> +        return ret;
> +    } else {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +}
> +
>  static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
>  {
>      NvmeNamespace *ns;
> @@ -701,11 +721,42 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
>      }
>  
>      ns = &n->namespaces[nsid - 1];
> +    assert(nsid == ns->nsid);
>  
>      return nvme_dma_read_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
>          prp1, prp2);
>  }
>  
> +static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeIdentify *c)
> +{
> +    NvmeNamespace *ns;
> +    uint32_t nsid = le32_to_cpu(c->nsid);
> +    uint64_t prp1 = le64_to_cpu(c->prp1);
> +    uint64_t prp2 = le64_to_cpu(c->prp2);
> +    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
> +    uint32_t *list;
> +    uint16_t ret;
> +
> +    trace_pci_nvme_identify_ns_csi(nsid, c->csi);
> +
> +    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> +        trace_pci_nvme_err_invalid_ns(nsid, n->num_namespaces);
> +        return NVME_INVALID_NSID | NVME_DNR;
> +    }
> +
> +    ns = &n->namespaces[nsid - 1];
> +    assert(nsid == ns->nsid);
> +
> +    if (c->csi == NVME_CSI_NVM) {
> +        list = g_malloc0(data_len);
> +        ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
> +        g_free(list);
> +        return ret;
> +    } else {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +}
> +
>  static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
>  {
>      static const int data_len = NVME_IDENTIFY_DATA_SIZE;
> @@ -733,6 +784,99 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
>      return ret;
>  }
>  
> +static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeIdentify *c)
> +{
> +    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
> +    uint32_t min_nsid = le32_to_cpu(c->nsid);
> +    uint64_t prp1 = le64_to_cpu(c->prp1);
> +    uint64_t prp2 = le64_to_cpu(c->prp2);
> +    uint32_t *list;
> +    uint16_t ret;
> +    int i, j = 0;
> +
> +    trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
> +
> +    if (c->csi != NVME_CSI_NVM) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    list = g_malloc0(data_len);
> +    for (i = 0; i < n->num_namespaces; i++) {
> +        if (i < min_nsid) {
> +            continue;
> +        }
> +        list[j++] = cpu_to_le32(i + 1);
> +        if (j == data_len / sizeof(uint32_t)) {
> +            break;
> +        }
> +    }
> +    ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
> +    g_free(list);
> +    return ret;
> +}
> +
> +static uint16_t nvme_list_ns_descriptors(NvmeCtrl *n, NvmeIdentify *c)
> +{
> +    NvmeNamespace *ns;
> +    uint32_t nsid = le32_to_cpu(c->nsid);
> +    uint64_t prp1 = le64_to_cpu(c->prp1);
> +    uint64_t prp2 = le64_to_cpu(c->prp2);
> +    void *buf_ptr;
> +    NvmeNsIdDesc *desc;
> +    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
> +    uint8_t *buf;
> +    uint16_t status;
> +
> +    trace_pci_nvme_list_ns_descriptors();
> +
> +    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> +        trace_pci_nvme_err_invalid_ns(nsid, n->num_namespaces);
> +        return NVME_INVALID_NSID | NVME_DNR;
> +    }
> +
> +    ns = &n->namespaces[nsid - 1];
> +    assert(nsid == ns->nsid);
> +
> +    buf = g_malloc0(data_len);
> +    buf_ptr = buf;
> +
> +    desc = buf_ptr;
> +    desc->nidt = NVME_NIDT_UUID;
> +    desc->nidl = NVME_NIDL_UUID;
> +    buf_ptr += sizeof(*desc);
> +    memcpy(buf_ptr, ns->uuid.data, NVME_NIDL_UUID);
> +    buf_ptr += NVME_NIDL_UUID;
> +
> +    desc = buf_ptr;
> +    desc->nidt = NVME_NIDT_CSI;
> +    desc->nidl = NVME_NIDL_CSI;
> +    buf_ptr += sizeof(*desc);
> +    *(uint8_t *)buf_ptr = NVME_CSI_NVM;
> +
> +    status = nvme_dma_read_prp(n, buf, data_len, prp1, prp2);
> +    g_free(buf);
> +    return status;
> +}
> +
> +static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeIdentify *c)
> +{
> +    uint64_t prp1 = le64_to_cpu(c->prp1);
> +    uint64_t prp2 = le64_to_cpu(c->prp2);
> +    static const int data_len = NVME_IDENTIFY_DATA_SIZE;
> +    uint32_t *list;
> +    uint8_t *ptr;
> +    uint16_t status;
> +
> +    trace_pci_nvme_identify_cmd_set();
> +
> +    list = g_malloc0(data_len);
> +    ptr = (uint8_t *)list;
> +    NVME_SET_CSI(*ptr, NVME_CSI_NVM);
> +    status = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
> +    g_free(list);
> +    return status;
> +}
> +
>  static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
>  {
>      NvmeIdentify *c = (NvmeIdentify *)cmd;
> @@ -740,10 +884,20 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
>      switch (le32_to_cpu(c->cns)) {
>      case NVME_ID_CNS_NS:
>          return nvme_identify_ns(n, c);
> +    case NVME_ID_CNS_CS_NS:
> +        return nvme_identify_ns_csi(n, c);
>      case NVME_ID_CNS_CTRL:
>          return nvme_identify_ctrl(n, c);
> +    case NVME_ID_CNS_CS_CTRL:
> +        return nvme_identify_ctrl_csi(n, c);
>      case NVME_ID_CNS_NS_ACTIVE_LIST:
>          return nvme_identify_nslist(n, c);
> +    case NVME_ID_CNS_CS_NS_ACTIVE_LIST:
> +        return nvme_identify_nslist_csi(n, c);
> +    case NVME_ID_CNS_NS_DESC_LIST:
> +        return nvme_list_ns_descriptors(n, c);
> +    case NVME_ID_CNS_IO_COMMAND_SET:
> +        return nvme_identify_cmd_set(n, c);
>      default:
>          trace_pci_nvme_err_invalid_identify_cns(le32_to_cpu(c->cns));
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -818,6 +972,9 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          break;
>      case NVME_TIMESTAMP:
>          return nvme_get_feature_timestamp(n, cmd);
> +    case NVME_COMMAND_SET_PROFILE:
> +        result = 0;
> +        break;
>      default:
>          trace_pci_nvme_err_invalid_getfeat(dw10);
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -864,6 +1021,15 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          break;
>      case NVME_TIMESTAMP:
>          return nvme_set_feature_timestamp(n, cmd);
> +        break;
> +
> +    case NVME_COMMAND_SET_PROFILE:
> +        if (dw11 & 0x1ff) {
> +            trace_pci_nvme_err_invalid_iocsci(dw11 & 0x1ff);
> +            return NVME_CMD_SET_CMB_REJECTED | NVME_DNR;
> +        }
> +        break;
> +
>      default:
>          trace_pci_nvme_err_invalid_setfeat(dw10);
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1149,6 +1315,29 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
>          break;
>      case 0x14:  /* CC */
>          trace_pci_nvme_mmio_cfg(data & 0xffffffff);
> +
> +        if (NVME_CC_CSS(data) != NVME_CC_CSS(n->bar.cc)) {
> +            if (NVME_CC_EN(n->bar.cc)) {
> +                NVME_GUEST_ERR(pci_nvme_err_change_css_when_enabled,
> +                               "changing selected command set when enabled");
> +                break;
> +            }
> +            switch (NVME_CC_CSS(data)) {
> +            case CSS_NVM_ONLY:
> +                trace_pci_nvme_css_nvm_cset_selected_by_host(data & 0xffffffff);
> +                break;
> +            case CSS_ALL_NSTYPES:
> +                NVME_SET_CC_CSS(n->bar.cc, CSS_ALL_NSTYPES);
> +                trace_pci_nvme_css_all_csets_sel_by_host(data & 0xffffffff);
> +                break;
> +            case CSS_ADMIN_ONLY:
> +                break;
> +            default:
> +                NVME_GUEST_ERR(pci_nvme_ub_unknown_css_value,
> +                               "unknown value in CC.CSS field");
> +            }
> +        }
> +
>          /* Windows first sends data, then sends enable bit */
>          if (!NVME_CC_EN(data) && !NVME_CC_EN(n->bar.cc) &&
>              !NVME_CC_SHN(data) && !NVME_CC_SHN(n->bar.cc))
> @@ -1496,6 +1685,7 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>  {
>      int64_t bs_size;
>      NvmeIdNs *id_ns = &ns->id_ns;
> +    int lba_index;
>  
>      bs_size = blk_getlength(n->conf.blk);
>      if (bs_size < 0) {
> @@ -1505,7 +1695,10 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>  
>      n->ns_size = bs_size;
>  
> -    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> +    ns->csi = NVME_CSI_NVM;
> +    qemu_uuid_generate(&ns->uuid); /* TODO make UUIDs persistent */
> +    lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);

There is only one LBA format at this point anyway, so I don't think this
is needed.

> +    id_ns->lbaf[lba_index].ds = nvme_ilog2(n->conf.logical_block_size);

Would be nice to have this in a separate patch.

>      id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(n, ns));
>  
>      /* no thin provisioning */
> @@ -1616,7 +1809,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>      id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
>      id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
>      strpadcpy((char *)id->mn, sizeof(id->mn), "QEMU NVMe Ctrl", ' ');
> -    strpadcpy((char *)id->fr, sizeof(id->fr), "1.0", ' ');
> +    strpadcpy((char *)id->fr, sizeof(id->fr), "2.0", ' ');

Out of curiosity, any specific reason for bumping the firmware revision?

>      strpadcpy((char *)id->sn, sizeof(id->sn), n->params.serial, ' ');
>      id->rab = 6;
>      id->ieee[0] = 0x00;
> @@ -1640,7 +1833,11 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>      NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
>      NVME_CAP_SET_CQR(n->bar.cap, 1);
>      NVME_CAP_SET_TO(n->bar.cap, 0xf);
> -    NVME_CAP_SET_CSS(n->bar.cap, 1);
> +    /*
> +     * The driver now always supports NS Types, but all commands that

s/driver/device

> +     * support CSI field will only handle NVM Command Set.
> +     */
> +    NVME_CAP_SET_CSS(n->bar.cap, (CAP_CSS_NVM | CAP_CSS_CSI_SUPP));
>      NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
>  
>      n->bar.vs = 0x00010200;
> @@ -1650,6 +1847,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>  static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>  {
>      NvmeCtrl *n = NVME(pci_dev);
> +    NvmeNamespace *ns;
>      Error *local_err = NULL;
>  
>      int i;
> @@ -1675,8 +1873,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>  
>      nvme_init_ctrl(n, pci_dev);
>  
> -    for (i = 0; i < n->num_namespaces; i++) {
> -        nvme_init_namespace(n, &n->namespaces[i], &local_err);
> +    ns = n->namespaces;
> +    for (i = 0; i < n->num_namespaces; i++, ns++) {
> +        ns->nsid = i + 1;

n->num_namespaces is hardcoded to 1, so no real need for this change,
but cleanup is always nice I guess :)

> +        nvme_init_namespace(n, ns, &local_err);
>          if (local_err) {
>              error_propagate(errp, local_err);
>              return;
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 4fd155c409..0d29f75475 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -121,4 +121,15 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
>      return n->ns_size >> nvme_ns_lbads(ns);
>  }
>  
> +static inline int nvme_ilog2(uint64_t i)
> +{
> +    int log = -1;
> +
> +    while (i) {
> +        i >>= 1;
> +        log++;
> +    }
> +    return log;
> +}
> +
>  #endif /* HW_NVME_H */
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 08/18] hw/block/nvme: Make Zoned NS Command Set definitions
  2020-06-17 21:34 ` [PATCH v2 08/18] hw/block/nvme: Make Zoned NS Command Set definitions Dmitry Fomichev
@ 2020-06-30 11:44   ` Klaus Jensen
  2020-06-30 12:08     ` Klaus Jensen
  2020-06-30 22:11   ` Alistair Francis
  1 sibling, 1 reply; 49+ messages in thread
From: Klaus Jensen @ 2020-06-30 11:44 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Maxim Levitsky, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Jun 18 06:34, Dmitry Fomichev wrote:
> Define values and structures that are needed to support Zoned
> Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.
> 
> All new protocol definitions are located in include/block/nvme.h
> and everything added that is specific to this implementation is kept
> in hw/block/nvme.h.
> 
> In order to improve scalability, all open, closed and full zones
> are organized in separate linked lists. Consequently, almost all
> zone operations don't require scanning of the entire zone array
> (which potentially can be quite large) - it is only necessary to
> enumerate one or more zone lists. Zone lists are designed to be
> position-independent as they can be persisted to the backing file
> as a part of zone metadata. NvmeZoneList struct defined in this patch
> serves as a head of every zone list.
> 
> NvmeZone structure encapsulates NvmeZoneDescriptor defined in Zoned
> Command Set specification and adds a few more fields that are
> internal to this implementation.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Signed-off-by: Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme.h      | 130 +++++++++++++++++++++++++++++++++++++++++++
>  include/block/nvme.h | 119 ++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 248 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 0d29f75475..2c932b5e29 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -3,12 +3,22 @@
>  
>  #include "block/nvme.h"
>  
> +#define NVME_DEFAULT_ZONE_SIZE   128 /* MiB */
> +#define NVME_DEFAULT_MAX_ZA_SIZE 128 /* KiB */
> +
>  typedef struct NvmeParams {
>      char     *serial;
>      uint32_t num_queues; /* deprecated since 5.1 */
>      uint32_t max_ioqpairs;
>      uint16_t msix_qsize;
>      uint32_t cmb_size_mb;
> +
> +    bool        zoned;
> +    bool        cross_zone_read;
> +    uint8_t     fill_pattern;
> +    uint32_t    zamds_bs;

Rename to zasl.

> +    uint64_t    zone_size;
> +    uint64_t    zone_capacity;
>  } NvmeParams;
>  
>  typedef struct NvmeAsyncEvent {
> @@ -17,6 +27,8 @@ typedef struct NvmeAsyncEvent {
>  
>  enum NvmeRequestFlags {
>      NVME_REQ_FLG_HAS_SG   = 1 << 0,
> +    NVME_REQ_FLG_FILL     = 1 << 1,
> +    NVME_REQ_FLG_APPEND   = 1 << 2,
>  };
>  
>  typedef struct NvmeRequest {
> @@ -24,6 +36,7 @@ typedef struct NvmeRequest {
>      BlockAIOCB              *aiocb;
>      uint16_t                status;
>      uint16_t                flags;
> +    uint64_t                fill_ofs;
>      NvmeCqe                 cqe;
>      BlockAcctCookie         acct;
>      QEMUSGList              qsg;
> @@ -61,11 +74,35 @@ typedef struct NvmeCQueue {
>      QTAILQ_HEAD(, NvmeRequest) req_list;
>  } NvmeCQueue;
>  
> +typedef struct NvmeZone {
> +    NvmeZoneDescr   d;
> +    uint64_t        tstamp;
> +    uint32_t        next;
> +    uint32_t        prev;
> +    uint8_t         rsvd80[8];
> +} NvmeZone;
> +
> +#define NVME_ZONE_LIST_NIL    UINT_MAX
> +
> +typedef struct NvmeZoneList {
> +    uint32_t        head;
> +    uint32_t        tail;
> +    uint32_t        size;
> +    uint8_t         rsvd12[4];
> +} NvmeZoneList;
> +
>  typedef struct NvmeNamespace {
>      NvmeIdNs        id_ns;
>      uint32_t        nsid;
>      uint8_t         csi;
>      QemuUUID        uuid;
> +
> +    NvmeIdNsZoned   *id_ns_zoned;
> +    NvmeZone        *zone_array;
> +    NvmeZoneList    *exp_open_zones;
> +    NvmeZoneList    *imp_open_zones;
> +    NvmeZoneList    *closed_zones;
> +    NvmeZoneList    *full_zones;
>  } NvmeNamespace;
>  
>  static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
> @@ -100,6 +137,7 @@ typedef struct NvmeCtrl {
>      uint32_t    num_namespaces;
>      uint32_t    max_q_ents;
>      uint64_t    ns_size;
> +
>      uint8_t     *cmbuf;
>      uint32_t    irq_status;
>      uint64_t    host_timestamp;                 /* Timestamp sent by the host */
> @@ -107,6 +145,12 @@ typedef struct NvmeCtrl {
>  
>      HostMemoryBackend *pmrdev;
>  
> +    int             zone_file_fd;
> +    uint32_t        num_zones;
> +    uint64_t        zone_size_bs;
> +    uint64_t        zone_array_size;
> +    uint8_t         zamds;

Rename to zasl.

> +
>      NvmeNamespace   *namespaces;
>      NvmeSQueue      **sq;
>      NvmeCQueue      **cq;
> @@ -121,6 +165,86 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
>      return n->ns_size >> nvme_ns_lbads(ns);
>  }
>  
> +static inline uint8_t nvme_get_zone_state(NvmeZone *zone)
> +{
> +    return zone->d.zs >> 4;
> +}
> +
> +static inline void nvme_set_zone_state(NvmeZone *zone, enum NvmeZoneState state)
> +{
> +    zone->d.zs = state << 4;
> +}
> +
> +static inline uint64_t nvme_zone_rd_boundary(NvmeCtrl *n, NvmeZone *zone)
> +{
> +    return zone->d.zslba + n->params.zone_size;
> +}
> +
> +static inline uint64_t nvme_zone_wr_boundary(NvmeZone *zone)
> +{
> +    return zone->d.zslba + zone->d.zcap;
> +}

Everything working on zone->d needs leXX_to_cpu() conversions.

> +
> +static inline bool nvme_wp_is_valid(NvmeZone *zone)
> +{
> +    uint8_t st = nvme_get_zone_state(zone);
> +
> +    return st != NVME_ZONE_STATE_FULL &&
> +           st != NVME_ZONE_STATE_READ_ONLY &&
> +           st != NVME_ZONE_STATE_OFFLINE;
> +}
> +
> +/*
> + * Initialize a zone list head.
> + */
> +static inline void nvme_init_zone_list(NvmeZoneList *zl)
> +{
> +    zl->head = NVME_ZONE_LIST_NIL;
> +    zl->tail = NVME_ZONE_LIST_NIL;
> +    zl->size = 0;
> +}
> +
> +/*
> + * Initialize the number of entries contained in a zone list.
> + */
> +static inline uint32_t nvme_zone_list_size(NvmeZoneList *zl)
> +{
> +    return zl->size;
> +}
> +
> +/*
> + * Check if the zone is not currently included into any zone list.
> + */
> +static inline bool nvme_zone_not_in_list(NvmeZone *zone)
> +{
> +    return (bool)(zone->prev == 0 && zone->next == 0);
> +}
> +
> +/*
> + * Return the zone at the head of zone list or NULL if the list is empty.
> + */
> +static inline NvmeZone *nvme_peek_zone_head(NvmeNamespace *ns, NvmeZoneList *zl)
> +{
> +    if (zl->head == NVME_ZONE_LIST_NIL) {
> +        return NULL;
> +    }
> +    return &ns->zone_array[zl->head];
> +}
> +
> +/*
> + * Return the next zone in the list.
> + */
> +static inline NvmeZone *nvme_next_zone_in_list(NvmeNamespace *ns, NvmeZone *z,
> +    NvmeZoneList *zl)
> +{
> +    assert(!nvme_zone_not_in_list(z));
> +
> +    if (z->next == NVME_ZONE_LIST_NIL) {
> +        return NULL;
> +    }
> +    return &ns->zone_array[z->next];
> +}
> +
>  static inline int nvme_ilog2(uint64_t i)
>  {
>      int log = -1;
> @@ -132,4 +256,10 @@ static inline int nvme_ilog2(uint64_t i)
>      return log;
>  }
>  
> +static inline void _hw_nvme_check_size(void)
> +{
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeZoneList) != 16);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeZone) != 88);
> +}
> +
>  #endif /* HW_NVME_H */
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 5a1e5e137c..596c39162b 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -446,6 +446,9 @@ enum NvmeIoCommands {
>      NVME_CMD_COMPARE            = 0x05,
>      NVME_CMD_WRITE_ZEROS        = 0x08,
>      NVME_CMD_DSM                = 0x09,
> +    NVME_CMD_ZONE_MGMT_SEND     = 0x79,
> +    NVME_CMD_ZONE_MGMT_RECV     = 0x7a,
> +    NVME_CMD_ZONE_APND          = 0x7d,
>  };
>  
>  typedef struct NvmeDeleteQ {
> @@ -539,6 +542,7 @@ enum NvmeNidLength {
>  
>  enum NvmeCsi {
>      NVME_CSI_NVM                = 0x00,
> +    NVME_CSI_ZONED              = 0x02,
>  };
>  
>  #define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
> @@ -661,6 +665,7 @@ enum NvmeStatusCodes {
>      NVME_INVALID_NSID           = 0x000b,
>      NVME_CMD_SEQ_ERROR          = 0x000c,
>      NVME_CMD_SET_CMB_REJECTED   = 0x002b,
> +    NVME_INVALID_CMD_SET        = 0x002c,
>      NVME_LBA_RANGE              = 0x0080,
>      NVME_CAP_EXCEEDED           = 0x0081,
>      NVME_NS_NOT_READY           = 0x0082,
> @@ -684,6 +689,14 @@ enum NvmeStatusCodes {
>      NVME_CONFLICTING_ATTRS      = 0x0180,
>      NVME_INVALID_PROT_INFO      = 0x0181,
>      NVME_WRITE_TO_RO            = 0x0182,
> +    NVME_ZONE_BOUNDARY_ERROR    = 0x01b8,
> +    NVME_ZONE_FULL              = 0x01b9,
> +    NVME_ZONE_READ_ONLY         = 0x01ba,
> +    NVME_ZONE_OFFLINE           = 0x01bb,
> +    NVME_ZONE_INVALID_WRITE     = 0x01bc,
> +    NVME_ZONE_TOO_MANY_ACTIVE   = 0x01bd,
> +    NVME_ZONE_TOO_MANY_OPEN     = 0x01be,
> +    NVME_ZONE_INVAL_TRANSITION  = 0x01bf,
>      NVME_WRITE_FAULT            = 0x0280,
>      NVME_UNRECOVERED_READ       = 0x0281,
>      NVME_E2E_GUARD_ERROR        = 0x0282,
> @@ -807,7 +820,17 @@ typedef struct NvmeIdCtrl {
>      uint8_t     ieee[3];
>      uint8_t     cmic;
>      uint8_t     mdts;
> -    uint8_t     rsvd255[178];
> +    uint16_t    cntlid;
> +    uint32_t    ver;
> +    uint32_t    rtd3r;
> +    uint32_t    rtd3e;
> +    uint32_t    oaes;
> +    uint32_t    ctratt;
> +    uint8_t     rsvd100[28];
> +    uint16_t    crdt1;
> +    uint16_t    crdt2;
> +    uint16_t    crdt3;
> +    uint8_t     rsvd134[122];

Would be nice in a separate patch, see my "bump to ..." patches.

>      uint16_t    oacs;
>      uint8_t     acl;
>      uint8_t     aerl;
> @@ -832,6 +855,11 @@ typedef struct NvmeIdCtrl {
>      uint8_t     vs[1024];
>  } NvmeIdCtrl;
>  
> +typedef struct NvmeIdCtrlZoned {
> +    uint8_t     zamds;

zasl.

> +    uint8_t     rsvd1[4095];
> +} NvmeIdCtrlZoned;
> +
>  enum NvmeIdCtrlOacs {
>      NVME_OACS_SECURITY  = 1 << 0,
>      NVME_OACS_FORMAT    = 1 << 1,
> @@ -908,6 +936,12 @@ typedef struct NvmeLBAF {
>      uint8_t     rp;
>  } NvmeLBAF;
>  
> +typedef struct NvmeLBAFE {
> +    uint64_t    zsze;
> +    uint8_t     zdes;
> +    uint8_t     rsvd9[7];
> +} NvmeLBAFE;
> +
>  typedef struct NvmeIdNs {
>      uint64_t    nsze;
>      uint64_t    ncap;
> @@ -930,6 +964,19 @@ typedef struct NvmeIdNs {
>      uint8_t     vs[3712];
>  } NvmeIdNs;
>  
> +typedef struct NvmeIdNsZoned {
> +    uint16_t    zoc;
> +    uint16_t    ozcs;
> +    uint32_t    mar;
> +    uint32_t    mor;
> +    uint32_t    rrl;
> +    uint32_t    frl;
> +    uint8_t     rsvd20[2796];
> +    NvmeLBAFE   lbafe[16];
> +    uint8_t     rsvd3072[768];
> +    uint8_t     vs[256];
> +} NvmeIdNsZoned;
> +
>  
>  /*Deallocate Logical Block Features*/
>  #define NVME_ID_NS_DLFEAT_GUARD_CRC(dlfeat)       ((dlfeat) & 0x10)
> @@ -962,6 +1009,71 @@ enum NvmeIdNsDps {
>      DPS_FIRST_EIGHT = 8,
>  };
>  
> +enum NvmeZoneAttr {
> +    NVME_ZA_FINISHED_BY_CTLR         = 1 << 0,
> +    NVME_ZA_FINISH_RECOMMENDED       = 1 << 1,
> +    NVME_ZA_RESET_RECOMMENDED        = 1 << 2,
> +    NVME_ZA_ZD_EXT_VALID             = 1 << 7,
> +};
> +
> +typedef struct NvmeZoneReportHeader {
> +    uint64_t    nr_zones;
> +    uint8_t     rsvd[56];
> +} NvmeZoneReportHeader;
> +
> +enum NvmeZoneReceiveAction {
> +    NVME_ZONE_REPORT                 = 0,
> +    NVME_ZONE_REPORT_EXTENDED        = 1,
> +};
> +
> +enum NvmeZoneReportType {
> +    NVME_ZONE_REPORT_ALL             = 0,
> +    NVME_ZONE_REPORT_EMPTY           = 1,
> +    NVME_ZONE_REPORT_IMPLICITLY_OPEN = 2,
> +    NVME_ZONE_REPORT_EXPLICITLY_OPEN = 3,
> +    NVME_ZONE_REPORT_CLOSED          = 4,
> +    NVME_ZONE_REPORT_FULL            = 5,
> +    NVME_ZONE_REPORT_READ_ONLY       = 6,
> +    NVME_ZONE_REPORT_OFFLINE         = 7,
> +};
> +
> +typedef struct NvmeZoneDescr {
> +    uint8_t     zt;
> +    uint8_t     zs;
> +    uint8_t     za;
> +    uint8_t     rsvd3[5];
> +    uint64_t    zcap;
> +    uint64_t    zslba;
> +    uint64_t    wp;
> +    uint8_t     rsvd32[32];
> +} NvmeZoneDescr;
> +
> +enum NvmeZoneState {
> +    NVME_ZONE_STATE_RESERVED         = 0x00,
> +    NVME_ZONE_STATE_EMPTY            = 0x01,
> +    NVME_ZONE_STATE_IMPLICITLY_OPEN  = 0x02,
> +    NVME_ZONE_STATE_EXPLICITLY_OPEN  = 0x03,
> +    NVME_ZONE_STATE_CLOSED           = 0x04,
> +    NVME_ZONE_STATE_READ_ONLY        = 0x0D,
> +    NVME_ZONE_STATE_FULL             = 0x0E,
> +    NVME_ZONE_STATE_OFFLINE          = 0x0F,
> +};
> +
> +enum NvmeZoneType {
> +    NVME_ZONE_TYPE_RESERVED          = 0x00,
> +    NVME_ZONE_TYPE_SEQ_WRITE         = 0x02,
> +};
> +
> +enum NvmeZoneSendAction {
> +    NVME_ZONE_ACTION_RSD             = 0x00,
> +    NVME_ZONE_ACTION_CLOSE           = 0x01,
> +    NVME_ZONE_ACTION_FINISH          = 0x02,
> +    NVME_ZONE_ACTION_OPEN            = 0x03,
> +    NVME_ZONE_ACTION_RESET           = 0x04,
> +    NVME_ZONE_ACTION_OFFLINE         = 0x05,
> +    NVME_ZONE_ACTION_SET_ZD_EXT      = 0x10,
> +};
> +
>  static inline void _nvme_check_size(void)
>  {
>      QEMU_BUILD_BUG_ON(sizeof(NvmeCqe) != 16);
> @@ -978,8 +1090,13 @@ static inline void _nvme_check_size(void)
>      QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrlZoned) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeNsIdDesc) != 4);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeLBAF) != 4);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeLBAFE) != 16);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsZoned) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeZoneDescr) != 64);
>  }
>  #endif
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 08/18] hw/block/nvme: Make Zoned NS Command Set definitions
  2020-06-30 11:44   ` Klaus Jensen
@ 2020-06-30 12:08     ` Klaus Jensen
  0 siblings, 0 replies; 49+ messages in thread
From: Klaus Jensen @ 2020-06-30 12:08 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Keith Busch, Philippe Mathieu-Daudé,
	Matias Bjorling

On Jun 30 13:44, Klaus Jensen wrote:
> On Jun 18 06:34, Dmitry Fomichev wrote:
> > Define values and structures that are needed to support Zoned
> > Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.
> > 
> > All new protocol definitions are located in include/block/nvme.h
> > and everything added that is specific to this implementation is kept
> > in hw/block/nvme.h.
> > 
> > In order to improve scalability, all open, closed and full zones
> > are organized in separate linked lists. Consequently, almost all
> > zone operations don't require scanning of the entire zone array
> > (which potentially can be quite large) - it is only necessary to
> > enumerate one or more zone lists. Zone lists are designed to be
> > position-independent as they can be persisted to the backing file
> > as a part of zone metadata. NvmeZoneList struct defined in this patch
> > serves as a head of every zone list.
> > 
> > NvmeZone structure encapsulates NvmeZoneDescriptor defined in Zoned
> > Command Set specification and adds a few more fields that are
> > internal to this implementation.
> > 
> > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> > Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> > Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> > Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > Signed-off-by: Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  hw/block/nvme.h      | 130 +++++++++++++++++++++++++++++++++++++++++++
> >  include/block/nvme.h | 119 ++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 248 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index 0d29f75475..2c932b5e29 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -121,6 +165,86 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
> >      return n->ns_size >> nvme_ns_lbads(ns);
> >  }
> >  
> > +static inline uint8_t nvme_get_zone_state(NvmeZone *zone)
> > +{
> > +    return zone->d.zs >> 4;
> > +}
> > +
> > +static inline void nvme_set_zone_state(NvmeZone *zone, enum NvmeZoneState state)
> > +{
> > +    zone->d.zs = state << 4;
> > +}
> > +
> > +static inline uint64_t nvme_zone_rd_boundary(NvmeCtrl *n, NvmeZone *zone)
> > +{
> > +    return zone->d.zslba + n->params.zone_size;
> > +}
> > +
> > +static inline uint64_t nvme_zone_wr_boundary(NvmeZone *zone)
> > +{
> > +    return zone->d.zslba + zone->d.zcap;
> > +}
> 
> Everything working on zone->d needs leXX_to_cpu() conversions.

Disregard this. I see from the following patches that you keep zone->d
in cpu endianess and convert on zone management receive.

Sorry!


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 09/18] hw/block/nvme: Define Zoned NS Command Set trace events
  2020-06-17 21:34 ` [PATCH v2 09/18] hw/block/nvme: Define Zoned NS Command Set trace events Dmitry Fomichev
@ 2020-06-30 12:14   ` Klaus Jensen
  0 siblings, 0 replies; 49+ messages in thread
From: Klaus Jensen @ 2020-06-30 12:14 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Maxim Levitsky, Keith Busch,
	Philippe Mathieu-Daudé,
	Matias Bjorling

On Jun 18 06:34, Dmitry Fomichev wrote:
> The Zoned Namespace Command Set / Namespace Types implementation that
> is being introduced in this series adds a good number of trace events.
> Combine all tracepoint definitions into a separate patch to make
> reviewing more convenient.
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

I would prefer that LBAs was reported in hex, but it's just personal
preference.

> ---
>  hw/block/trace-events | 41 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 41 insertions(+)
> 
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index 3f3323fe38..984db8a20c 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -66,6 +66,31 @@ pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
>  pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
>  pci_nvme_css_nvm_cset_selected_by_host(uint32_t cc) "NVM command set selected by host, bar.cc=0x%"PRIx32""
>  pci_nvme_css_all_csets_sel_by_host(uint32_t cc) "all supported command sets selected by host, bar.cc=0x%"PRIx32""
> +pci_nvme_open_zone(uint64_t slba, uint32_t zone_idx, int all) "open zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
> +pci_nvme_close_zone(uint64_t slba, uint32_t zone_idx, int all) "close zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
> +pci_nvme_finish_zone(uint64_t slba, uint32_t zone_idx, int all) "finish zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
> +pci_nvme_reset_zone(uint64_t slba, uint32_t zone_idx, int all) "reset zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
> +pci_nvme_offline_zone(uint64_t slba, uint32_t zone_idx, int all) "offline zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
> +pci_nvme_set_descriptor_extension(uint64_t slba, uint32_t zone_idx) "set zone descriptor extension, slba=%"PRIu64", idx=%"PRIu32""
> +pci_nvme_zone_reset_recommended(uint64_t slba) "slba=%"PRIu64""
> +pci_nvme_zone_reset_internal_op(uint64_t slba) "slba=%"PRIu64""
> +pci_nvme_zone_finish_recommended(uint64_t slba) "slba=%"PRIu64""
> +pci_nvme_zone_finish_internal_op(uint64_t slba) "slba=%"PRIu64""
> +pci_nvme_zone_finished_by_controller(uint64_t slba) "slba=%"PRIu64""
> +pci_nvme_zd_extension_set(uint32_t zone_idx) "set descriptor extension for zone_idx=%"PRIu32""
> +pci_nvme_power_on_close(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Closed state"
> +pci_nvme_power_on_reset(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Empty state"
> +pci_nvme_power_on_full(uint32_t state, uint64_t slba) "zone state=%"PRIu32", slba=%"PRIu64" transitioned to Full state"
> +pci_nvme_zone_ae_not_enabled(int info, int log_page, int nsid) "zone async event not enabled, info=0x%"PRIx32", lp=0x%"PRIx32", nsid=%"PRIu32""
> +pci_nvme_zone_ae_not_cleared(int info, int log_page, int nsid) "zoned async event not cleared, info=0x%"PRIx32", lp=0x%"PRIx32", nsid=%"PRIu32""

Can we use uintxx_t's here?

> +pci_nvme_zone_aen_not_requested(uint32_t oaes) "zone descriptor AEN are not requested by host, oaes=0x%"PRIx32""
> +pci_nvme_getfeat_aen_cfg(uint64_t res) "reporting async event config res=%"PRIu64""
> +pci_nvme_setfeat_zone_info_aer_on(void) "zone info change notices enabled"
> +pci_nvme_setfeat_zone_info_aer_off(void) "zone info change notices disabled"
> +pci_nvme_changed_zone_log_read(uint16_t nsid) "changed zone list log of ns %"PRIu16""

nsid should be uint32_t.

> +pci_nvme_reporting_changed_zone(uint64_t zslba, uint8_t za) "zslba=%"PRIu64", attr=0x%"PRIx8""
> +pci_nvme_empty_changed_zone_list(void) "no changes zones to report"

s/changes/changed

> +pci_nvme_mapped_zone_file(char *zfile_name, int ret) "mapped zone file %s, error %d"
>  
>  # nvme traces for error conditions
>  pci_nvme_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
> @@ -77,10 +102,25 @@ pci_nvme_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid namespace %u not w
>  pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
>  pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
>  pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
> +pci_nvme_err_capacity_exceeded(uint64_t zone_id, uint64_t nr_zones) "zone capacity exceeded, zone_id=%"PRIu64", nr_zones=%"PRIu64""

Change the name to pci_nvme_err_ZONE_capacity_exceeded maybe?

> +pci_nvme_err_unaligned_zone_cmd(uint8_t action, uint64_t slba, uint64_t zslba) "unaligned zone op 0x%"PRIx32", got slba=%"PRIu64", zslba=%"PRIu64""
> +pci_nvme_err_invalid_zone_state_transition(uint8_t state, uint8_t action, uint64_t slba, uint8_t attrs) "0x%"PRIx32"->0x%"PRIx32", slba=%"PRIu64", attrs=0x%"PRIx32""
> +pci_nvme_err_write_not_at_wp(uint64_t slba, uint64_t zone, uint64_t wp) "writing at slba=%"PRIu64", zone=%"PRIu64", but wp=%"PRIu64""
> +pci_nvme_err_append_not_at_start(uint64_t slba, uint64_t zone) "appending at slba=%"PRIu64", but zone=%"PRIu64""
> +pci_nvme_err_zone_write_not_ok(uint64_t slba, uint32_t nlb, uint32_t status) "slba=%"PRIu64", nlb=%"PRIu32", status=0x%"PRIx16""
> +pci_nvme_err_zone_read_not_ok(uint64_t slba, uint32_t nlb, uint32_t status) "slba=%"PRIu64", nlb=%"PRIu32", status=0x%"PRIx16""
> +pci_nvme_err_append_too_large(uint64_t slba, uint32_t nlb, uint8_t zamds) "slba=%"PRIu64", nlb=%"PRIu32", zamds=%"PRIu8""
> +pci_nvme_err_insuff_active_res(uint32_t max_active) "max_active=%"PRIu32" zone limit exceeded"
> +pci_nvme_err_insuff_open_res(uint32_t max_open) "max_open=%"PRIu32" zone limit exceeded"
> +pci_nvme_err_zone_file_invalid(int error) "validation error=%"PRIi32""
> +pci_nvme_err_zd_extension_map_error(uint32_t zone_idx) "can't map descriptor extension for zone_idx=%"PRIu32""
> +pci_nvme_err_invalid_changed_zone_list_offset(uint64_t ofs) "changed zone list log offset must be 0, got %"PRIu64""
> +pci_nvme_err_invalid_changed_zone_list_len(uint32_t len) "changed zone list log size is 4096, got %"PRIu32""
>  pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
>  pci_nvme_err_invalid_effects_log_len(uint32_t len) "commands supported and effects log size is 4096, got %"PRIu32""
>  pci_nvme_err_change_css_when_enabled(void) "changing CC.CSS while controller is enabled"
>  pci_nvme_err_only_nvm_cmd_set_avail(void) "setting 110b CC.CSS, but only NVM command set is enabled"
> +pci_nvme_err_only_zoned_cmd_set_avail(void) "setting 001b CC.CSS, but only ZONED+NVM command set is enabled"
>  pci_nvme_err_invalid_iocsci(uint32_t idx) "unsupported command set combination index %"PRIu32""
>  pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
>  pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
> @@ -113,6 +153,7 @@ pci_nvme_err_startfail_sqent_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_
>  pci_nvme_err_startfail_asqent_sz_zero(void) "nvme_start_ctrl failed because the admin submission queue size is zero"
>  pci_nvme_err_startfail_acqent_sz_zero(void) "nvme_start_ctrl failed because the admin completion queue size is zero"
>  pci_nvme_err_startfail(void) "setting controller enable bit failed"
> +pci_nvme_err_invalid_mgmt_action(int action) "action=0x%"PRIx32""

uint8_t for action here.

>  
>  # Traces for undefined behavior
>  pci_nvme_ub_mmiowr_misaligned32(uint64_t offset) "MMIO write not 32-bit aligned, offset=0x%"PRIx64""
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 10/18] hw/block/nvme: Support Zoned Namespace Command Set
  2020-06-17 21:34 ` [PATCH v2 10/18] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
@ 2020-06-30 13:31   ` Klaus Jensen
  0 siblings, 0 replies; 49+ messages in thread
From: Klaus Jensen @ 2020-06-30 13:31 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Keith Busch, Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Jun 18 06:34, Dmitry Fomichev wrote:
> The driver has been changed to advertise NVM Command Set when "zoned"
> driver property is not set (default) and Zoned Namespace Command Set
> otherwise.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Driver initialization code has been extended to create a proper
> configuration for zoned operation using driver properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Read Zeroes

s/Read Zeroes/Write Zeroes

> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> The code to support for Zone Descriptor Extensions is not included in
> this commit and the driver always reports ZDES 0. A later commit in
> this series will add ZDE support.
> 
> This commit doesn't yet include checks for active and open zone
> limits. It is assumed that there are no limits on either active or
> open zones.
> 

And s/driver/device ;)

> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme.c | 963 ++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 933 insertions(+), 30 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 453f4747a5..2e03b0b6ed 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -37,6 +37,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/units.h"
>  #include "qemu/error-report.h"
> +#include "crypto/random.h"
>  #include "hw/block/block.h"
>  #include "hw/pci/msix.h"
>  #include "hw/pci/pci.h"
> @@ -69,6 +70,98 @@
>  
>  static void nvme_process_sq(void *opaque);
>  
> +/*
> + * Add a zone to the tail of a zone list.
> + */
> +static void nvme_add_zone_tail(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
> +    NvmeZone *zone)
> +{
> +    uint32_t idx = (uint32_t)(zone - ns->zone_array);
> +
> +    assert(nvme_zone_not_in_list(zone));
> +
> +    if (!zl->size) {
> +        zl->head = zl->tail = idx;
> +        zone->next = zone->prev = NVME_ZONE_LIST_NIL;
> +    } else {
> +        ns->zone_array[zl->tail].next = idx;
> +        zone->prev = zl->tail;
> +        zone->next = NVME_ZONE_LIST_NIL;
> +        zl->tail = idx;
> +    }
> +    zl->size++;
> +}
> +
> +/*
> + * Remove a zone from a zone list. The zone must be linked in the list.
> + */
> +static void nvme_remove_zone(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
> +    NvmeZone *zone)
> +{
> +    uint32_t idx = (uint32_t)(zone - ns->zone_array);
> +
> +    assert(!nvme_zone_not_in_list(zone));
> +
> +    --zl->size;
> +    if (zl->size == 0) {
> +        zl->head = NVME_ZONE_LIST_NIL;
> +        zl->tail = NVME_ZONE_LIST_NIL;
> +    } else if (idx == zl->head) {
> +        zl->head = zone->next;
> +        ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
> +    } else if (idx == zl->tail) {
> +        zl->tail = zone->prev;
> +        ns->zone_array[zl->tail].next = NVME_ZONE_LIST_NIL;
> +    } else {
> +        ns->zone_array[zone->next].prev = zone->prev;
> +        ns->zone_array[zone->prev].next = zone->next;
> +    }
> +
> +    zone->prev = zone->next = 0;
> +}
> +
> +static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZone *zone, uint8_t state)
> +{
> +    if (!nvme_zone_not_in_list(zone)) {
> +        switch (nvme_get_zone_state(zone)) {
> +        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +            nvme_remove_zone(n, ns, ns->exp_open_zones, zone);
> +            break;
> +        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +            nvme_remove_zone(n, ns, ns->imp_open_zones, zone);
> +            break;
> +        case NVME_ZONE_STATE_CLOSED:
> +            nvme_remove_zone(n, ns, ns->closed_zones, zone);
> +            break;
> +        case NVME_ZONE_STATE_FULL:
> +            nvme_remove_zone(n, ns, ns->full_zones, zone);
> +        }
> +   }
> +
> +    nvme_set_zone_state(zone, state);
> +
> +    switch (state) {
> +    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +        nvme_add_zone_tail(n, ns, ns->exp_open_zones, zone);
> +        break;
> +    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        nvme_add_zone_tail(n, ns, ns->imp_open_zones, zone);
> +        break;
> +    case NVME_ZONE_STATE_CLOSED:
> +        nvme_add_zone_tail(n, ns, ns->closed_zones, zone);
> +        break;
> +    case NVME_ZONE_STATE_FULL:
> +        nvme_add_zone_tail(n, ns, ns->full_zones, zone);
> +        break;
> +    default:
> +        zone->d.za = 0;
> +        /* fall through */
> +    case NVME_ZONE_STATE_READ_ONLY:
> +        zone->tstamp = 0;
> +    }
> +}
> +
>  static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
>  {
>      hwaddr low = n->ctrl_mem.addr;
> @@ -314,6 +407,7 @@ static void nvme_post_cqes(void *opaque)
>  
>          QTAILQ_REMOVE(&cq->req_list, req, entry);
>          sq = req->sq;
> +

Spurious newline.

>          req->cqe.status = cpu_to_le16((req->status << 1) | cq->phase);
>          req->cqe.sq_id = cpu_to_le16(sq->sqid);
>          req->cqe.sq_head = cpu_to_le16(sq->head);
> @@ -328,6 +422,30 @@ static void nvme_post_cqes(void *opaque)
>      }
>  }
>  
> +static void nvme_fill_data(QEMUSGList *qsg, QEMUIOVector *iov,
> +    uint64_t offset, uint8_t pattern)
> +{
> +    ScatterGatherEntry *entry;
> +    uint32_t len, ent_len;
> +
> +    if (qsg->nsg > 0) {
> +        entry = qsg->sg;
> +        for (len = qsg->size; len > 0; len -= ent_len) {
> +            ent_len = MIN(len, entry->len);
> +            if (offset > ent_len) {
> +                offset -= ent_len;
> +            } else if (offset != 0) {
> +                dma_memory_set(qsg->as, entry->base + offset,
> +                               pattern, ent_len - offset);
> +                offset = 0;
> +            } else {
> +                dma_memory_set(qsg->as, entry->base, pattern, ent_len);
> +            }
> +            entry++;
> +        }
> +    }
> +}
> +
>  static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
>  {
>      assert(cq->cqid == req->sq->cqid);
> @@ -336,6 +454,114 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
>      timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>  }
>  
> +static uint16_t nvme_check_zone_write(NvmeZone *zone, uint64_t slba,
> +    uint32_t nlb)
> +{
> +    uint16_t status;
> +
> +    if (unlikely((slba + nlb) > nvme_zone_wr_boundary(zone))) {
> +        return NVME_ZONE_BOUNDARY_ERROR;
> +    }
> +
> +    switch (nvme_get_zone_state(zone)) {
> +    case NVME_ZONE_STATE_EMPTY:
> +    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +    case NVME_ZONE_STATE_CLOSED:
> +        status = NVME_SUCCESS;
> +        break;
> +    case NVME_ZONE_STATE_FULL:
> +        status = NVME_ZONE_FULL;
> +        break;
> +    case NVME_ZONE_STATE_OFFLINE:
> +        status = NVME_ZONE_OFFLINE;
> +        break;
> +    case NVME_ZONE_STATE_READ_ONLY:
> +        status = NVME_ZONE_READ_ONLY;
> +        break;
> +    default:
> +        assert(false);
> +    }
> +    return status;
> +}
> +
> +static uint16_t nvme_check_zone_read(NvmeCtrl *n, NvmeZone *zone, uint64_t slba,
> +    uint32_t nlb, bool zone_x_ok)
> +{
> +    uint64_t lba = slba, count;
> +    uint16_t status;
> +    uint8_t zs;
> +
> +    do {
> +        if (!zone_x_ok && (lba + nlb > nvme_zone_rd_boundary(n, zone))) {
> +            return NVME_ZONE_BOUNDARY_ERROR | NVME_DNR;
> +        }
> +
> +        zs = nvme_get_zone_state(zone);
> +        switch (zs) {
> +        case NVME_ZONE_STATE_EMPTY:
> +        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +        case NVME_ZONE_STATE_FULL:
> +        case NVME_ZONE_STATE_CLOSED:
> +        case NVME_ZONE_STATE_READ_ONLY:
> +            status = NVME_SUCCESS;
> +            break;
> +        case NVME_ZONE_STATE_OFFLINE:
> +            status = NVME_ZONE_OFFLINE | NVME_DNR;
> +            break;
> +        default:
> +            assert(false);
> +        }
> +        if (status != NVME_SUCCESS) {
> +            break;
> +        }
> +
> +        if (lba + nlb > nvme_zone_rd_boundary(n, zone)) {
> +            count = nvme_zone_rd_boundary(n, zone) - lba;
> +        } else {
> +            count = nlb;
> +        }
> +
> +        lba += count;
> +        nlb -= count;
> +        zone++;
> +    } while (nlb);
> +
> +    return status;
> +}
> +
> +static uint64_t nvme_finalize_zone_write(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZone *zone, uint32_t nlb)
> +{
> +    uint64_t result = cpu_to_le64(zone->d.wp);
> +    uint8_t zs = nvme_get_zone_state(zone);
> +
> +    zone->d.wp += nlb;
> +
> +    if (zone->d.wp == nvme_zone_wr_boundary(zone)) {
> +        switch (zs) {
> +        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +        case NVME_ZONE_STATE_CLOSED:
> +        case NVME_ZONE_STATE_EMPTY:
> +            break;
> +        default:
> +            assert(false);
> +        }
> +        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_FULL);
> +    } else {
> +        switch (zs) {
> +        case NVME_ZONE_STATE_EMPTY:
> +        case NVME_ZONE_STATE_CLOSED:
> +            nvme_assign_zone_state(n, ns, zone,
> +                                   NVME_ZONE_STATE_IMPLICITLY_OPEN);
> +        }
> +    }
> +
> +    return result;
> +}
> +
>  static void nvme_rw_cb(void *opaque, int ret)
>  {
>      NvmeRequest *req = opaque;
> @@ -344,6 +570,10 @@ static void nvme_rw_cb(void *opaque, int ret)
>      NvmeCQueue *cq = n->cq[sq->cqid];
>  
>      if (!ret) {
> +        if (req->flags & NVME_REQ_FLG_FILL) {
> +            nvme_fill_data(&req->qsg, &req->iov, req->fill_ofs,
> +                           n->params.fill_pattern);
> +        }
>          block_acct_done(blk_get_stats(n->conf.blk), &req->acct);
>          req->status = NVME_SUCCESS;
>      } else {
> @@ -370,22 +600,53 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>      NvmeRequest *req)
>  {
>      NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
> +    NvmeZone *zone = NULL;
>      const uint8_t lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
>      const uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
>      uint64_t slba = le64_to_cpu(rw->slba);
>      uint32_t nlb  = le16_to_cpu(rw->nlb) + 1;
> +    uint64_t zone_idx;
>      uint64_t offset = slba << data_shift;
>      uint32_t count = nlb << data_shift;
> +    uint16_t status;
>  
>      if (unlikely(slba + nlb > ns->id_ns.nsze)) {
>          trace_pci_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
>          return NVME_LBA_RANGE | NVME_DNR;
>      }
>  
> +    if (n->params.zoned) {
> +        zone_idx = slba / n->params.zone_size;
> +        if (unlikely(zone_idx >= n->num_zones)) {
> +            trace_pci_nvme_err_capacity_exceeded(zone_idx, n->num_zones);
> +            return NVME_CAP_EXCEEDED | NVME_DNR;
> +        }

Cpacity Exceeded happens when the NUSE exceeds NCAP; shouldn't this be
LBA Out of Range? But this shouldn't happen since it would be caught by
the regular bounds check (exceeding NSZE).

> +
> +        zone = &ns->zone_array[zone_idx];
> +
> +        status = nvme_check_zone_write(zone, slba, nlb);
> +        if (status != NVME_SUCCESS) {
> +            trace_pci_nvme_err_zone_write_not_ok(slba, nlb, status);
> +            return status | NVME_DNR;
> +        }
> +
> +        assert(nvme_wp_is_valid(zone));
> +        if (unlikely(slba != zone->d.wp)) {
> +            trace_pci_nvme_err_write_not_at_wp(slba, zone->d.zslba,
> +                                               zone->d.wp);
> +            return NVME_ZONE_INVALID_WRITE | NVME_DNR;
> +        }
> +    }
> +
>      block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
>                       BLOCK_ACCT_WRITE);
>      req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
>                                          BDRV_REQ_MAY_UNMAP, nvme_rw_cb, req);
> +
> +    if (n->params.zoned) {
> +        req->cqe.result64 = nvme_finalize_zone_write(n, ns, zone, nlb);
> +    }
> +
>      return NVME_NO_COMPLETE;
>  }
>  
> @@ -393,16 +654,19 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>      NvmeRequest *req)
>  {
>      NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
> +    NvmeZone *zone = NULL;
>      uint32_t nlb  = le32_to_cpu(rw->nlb) + 1;
>      uint64_t slba = le64_to_cpu(rw->slba);
>      uint64_t prp1 = le64_to_cpu(rw->prp1);
>      uint64_t prp2 = le64_to_cpu(rw->prp2);
> -
> +    uint64_t zone_idx = 0;
> +    uint16_t status;
>      uint8_t lba_index  = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
>      uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
>      uint64_t data_size = (uint64_t)nlb << data_shift;
> -    uint64_t data_offset = slba << data_shift;
> -    int is_write = rw->opcode == NVME_CMD_WRITE ? 1 : 0;
> +    uint64_t data_offset;
> +    bool is_write = rw->opcode == NVME_CMD_WRITE ||
> +                    (req->flags & NVME_REQ_FLG_APPEND);
>      enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
>  
>      trace_pci_nvme_rw(is_write ? "write" : "read", nlb, data_size, slba);
> @@ -413,11 +677,79 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>          return NVME_LBA_RANGE | NVME_DNR;
>      }
>  
> +    if (n->params.zoned) {
> +        zone_idx = slba / n->params.zone_size;
> +        if (unlikely(zone_idx >= n->num_zones)) {
> +            trace_pci_nvme_err_capacity_exceeded(zone_idx, n->num_zones);
> +            return NVME_CAP_EXCEEDED | NVME_DNR;
> +        }
> +
> +        zone = &ns->zone_array[zone_idx];
> +
> +        if (is_write) {
> +            status = nvme_check_zone_write(zone, slba, nlb);
> +            if (status != NVME_SUCCESS) {
> +                trace_pci_nvme_err_zone_write_not_ok(slba, nlb, status);
> +                return status | NVME_DNR;
> +            }
> +
> +            assert(nvme_wp_is_valid(zone));
> +            if (req->flags & NVME_REQ_FLG_APPEND) {
> +                if (unlikely(slba != zone->d.zslba)) {
> +                    trace_pci_nvme_err_append_not_at_start(slba, zone->d.zslba);
> +                    return NVME_ZONE_INVALID_WRITE | NVME_DNR;
> +                }
> +                if (data_size > (n->page_size << n->zamds)) {
> +                    trace_pci_nvme_err_append_too_large(slba, nlb, n->zamds);
> +                    return NVME_INVALID_FIELD | NVME_DNR;
> +                }
> +                slba = zone->d.wp;
> +            } else if (unlikely(slba != zone->d.wp)) {
> +                trace_pci_nvme_err_write_not_at_wp(slba, zone->d.zslba,
> +                                                   zone->d.wp);
> +                return NVME_ZONE_INVALID_WRITE | NVME_DNR;
> +            }
> +        } else {
> +            status = nvme_check_zone_read(n, zone, slba, nlb,
> +                                          n->params.cross_zone_read);
> +            if (status != NVME_SUCCESS) {
> +                trace_pci_nvme_err_zone_read_not_ok(slba, nlb, status);
> +                return status | NVME_DNR;
> +            }
> +
> +            if (slba + nlb > zone->d.wp) {
> +                /*
> +                 * All or some data is read above the WP. Need to
> +                 * fill out the buffer area that has no backing data
> +                 * with a predefined data pattern (zeros by default)
> +                 */
> +                req->flags |= NVME_REQ_FLG_FILL;
> +                if (slba >= zone->d.wp) {
> +                    req->fill_ofs = 0;
> +                } else {
> +                    req->fill_ofs = ((zone->d.wp - slba) << data_shift);
> +                }
> +            }
> +        }
> +    } else if (req->flags & NVME_REQ_FLG_APPEND) {
> +        trace_pci_nvme_err_invalid_opc(cmd->opcode);
> +        return NVME_INVALID_OPCODE | NVME_DNR;
> +    }
> +
>      if (nvme_map_prp(&req->qsg, &req->iov, prp1, prp2, data_size, n)) {
>          block_acct_invalid(blk_get_stats(n->conf.blk), acct);
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> +    if (unlikely(!is_write && (req->flags & NVME_REQ_FLG_FILL) &&
> +                 (req->fill_ofs == 0))) {
> +        /* No backend I/O necessary, only need to fill the buffer */
> +        nvme_fill_data(&req->qsg, &req->iov, 0, n->params.fill_pattern);
> +        req->status = NVME_SUCCESS;
> +        return NVME_SUCCESS;
> +    }
> +
> +    data_offset = slba << data_shift;
>      dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
>      if (req->qsg.nsg > 0) {
>          req->flags |= NVME_REQ_FLG_HAS_SG;
> @@ -434,9 +766,383 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>                             req);
>      }
>  
> +    if (is_write && n->params.zoned) {
> +        req->cqe.result64 = nvme_finalize_zone_write(n, ns, zone, nlb);
> +    }
> +
>      return NVME_NO_COMPLETE;
>  }
>  
> +static uint16_t nvme_get_mgmt_zone_slba_idx(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeCmd *c, uint64_t *slba, uint64_t *zone_idx)
> +{
> +    uint32_t dw10 = le32_to_cpu(c->cdw10);
> +    uint32_t dw11 = le32_to_cpu(c->cdw11);
> +
> +    if (!n->params.zoned) {
> +        trace_pci_nvme_err_invalid_opc(c->opcode);
> +        return NVME_INVALID_OPCODE | NVME_DNR;
> +    }
> +
> +    *slba = ((uint64_t)dw11) << 32 | dw10;
> +    if (unlikely(*slba >= ns->id_ns.nsze)) {
> +        trace_pci_nvme_err_invalid_lba_range(*slba, 0, ns->id_ns.nsze);
> +        *slba = 0;
> +        return NVME_LBA_RANGE | NVME_DNR;
> +    }
> +
> +    *zone_idx = *slba / n->params.zone_size;
> +    if (unlikely(*zone_idx >= n->num_zones)) {
> +        trace_pci_nvme_err_capacity_exceeded(*zone_idx, n->num_zones);
> +        *zone_idx = 0;
> +        return NVME_CAP_EXCEEDED | NVME_DNR;
> +    }
> +
> +    return NVME_SUCCESS;
> +}
> +
> +static uint16_t nvme_open_zone(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZone *zone, uint8_t state)
> +{
> +    switch (state) {
> +    case NVME_ZONE_STATE_EMPTY:
> +    case NVME_ZONE_STATE_CLOSED:
> +    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_EXPLICITLY_OPEN);
> +        /* fall through */
> +    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +        return NVME_SUCCESS;
> +    }
> +
> +    return NVME_ZONE_INVAL_TRANSITION;
> +}
> +
> +static bool nvme_cond_open_all(uint8_t state)
> +{
> +    return state == NVME_ZONE_STATE_CLOSED;
> +}
> +
> +static uint16_t nvme_close_zone(NvmeCtrl *n,  NvmeNamespace *ns,
> +    NvmeZone *zone, uint8_t state)
> +{
> +    switch (state) {
> +    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
> +        /* fall through */
> +    case NVME_ZONE_STATE_CLOSED:
> +        return NVME_SUCCESS;
> +    }
> +
> +    return NVME_ZONE_INVAL_TRANSITION;
> +}
> +
> +static bool nvme_cond_close_all(uint8_t state)
> +{
> +    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
> +           state == NVME_ZONE_STATE_EXPLICITLY_OPEN;
> +}
> +
> +static uint16_t nvme_finish_zone(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZone *zone, uint8_t state)
> +{
> +    switch (state) {
> +    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +    case NVME_ZONE_STATE_CLOSED:
> +    case NVME_ZONE_STATE_EMPTY:
> +        zone->d.wp = nvme_zone_wr_boundary(zone);
> +        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_FULL);
> +        /* fall through */
> +    case NVME_ZONE_STATE_FULL:
> +        return NVME_SUCCESS;
> +    }
> +
> +    return NVME_ZONE_INVAL_TRANSITION;
> +}
> +
> +static bool nvme_cond_finish_all(uint8_t state)
> +{
> +    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
> +           state == NVME_ZONE_STATE_EXPLICITLY_OPEN ||
> +           state == NVME_ZONE_STATE_CLOSED;
> +}
> +
> +static uint16_t nvme_reset_zone(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZone *zone, uint8_t state)
> +{
> +    switch (state) {
> +    case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +    case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +    case NVME_ZONE_STATE_CLOSED:
> +    case NVME_ZONE_STATE_FULL:
> +        zone->d.wp = zone->d.zslba;
> +        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_EMPTY);
> +        /* fall through */
> +    case NVME_ZONE_STATE_EMPTY:
> +        return NVME_SUCCESS;
> +    }
> +
> +    return NVME_ZONE_INVAL_TRANSITION;
> +}
> +
> +static bool nvme_cond_reset_all(uint8_t state)
> +{
> +    return state == NVME_ZONE_STATE_IMPLICITLY_OPEN ||
> +           state == NVME_ZONE_STATE_EXPLICITLY_OPEN ||
> +           state == NVME_ZONE_STATE_CLOSED ||
> +           state == NVME_ZONE_STATE_FULL;
> +}
> +
> +static uint16_t nvme_offline_zone(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZone *zone, uint8_t state)
> +{
> +    switch (state) {
> +    case NVME_ZONE_STATE_READ_ONLY:
> +        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_OFFLINE);
> +        /* fall through */
> +    case NVME_ZONE_STATE_OFFLINE:
> +        return NVME_SUCCESS;
> +    }
> +
> +    return NVME_ZONE_INVAL_TRANSITION;
> +}
> +
> +static bool nvme_cond_offline_all(uint8_t state)
> +{
> +    return state == NVME_ZONE_STATE_READ_ONLY;
> +}
> +
> +static uint16_t name_do_zone_op(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZone *zone, uint8_t state, bool all,
> +    uint16_t (*op_hndlr)(NvmeCtrl *, NvmeNamespace *, NvmeZone *,
> +                         uint8_t), bool (*proc_zone)(uint8_t))
> +{
> +    int i;
> +    uint16_t status = 0;
> +
> +    if (!all) {
> +        status = op_hndlr(n, ns, zone, state);
> +    } else {
> +        for (i = 0; i < n->num_zones; i++, zone++) {
> +            state = nvme_get_zone_state(zone);
> +            if (proc_zone(state)) {
> +                status = op_hndlr(n, ns, zone, state);
> +                if (status != NVME_SUCCESS) {
> +                    break;
> +                }
> +            }
> +        }
> +    }
> +
> +    return status;
> +}

This is actually pretty neat :)

> +
> +static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeCmd *cmd, NvmeRequest *req)
> +{
> +    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> +    uint64_t slba = 0;
> +    uint64_t zone_idx = 0;
> +    uint16_t status;
> +    uint8_t action, state;
> +    bool all;
> +    NvmeZone *zone;
> +
> +    action = dw13 & 0xff;
> +    all = dw13 & 0x100;
> +
> +    req->status = NVME_SUCCESS;
> +
> +    if (!all) {
> +        status = nvme_get_mgmt_zone_slba_idx(n, ns, cmd, &slba, &zone_idx);
> +        if (status) {
> +            return status;
> +        }
> +    }
> +
> +    zone = &ns->zone_array[zone_idx];
> +    if (slba != zone->d.zslba) {
> +        trace_pci_nvme_err_unaligned_zone_cmd(action, slba, zone->d.zslba);
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +    state = nvme_get_zone_state(zone);
> +
> +    switch (action) {
> +
> +    case NVME_ZONE_ACTION_OPEN:
> +        trace_pci_nvme_open_zone(slba, zone_idx, all);
> +        status = name_do_zone_op(n, ns, zone, state, all,
> +                                 nvme_open_zone, nvme_cond_open_all);
> +        break;
> +
> +    case NVME_ZONE_ACTION_CLOSE:
> +        trace_pci_nvme_close_zone(slba, zone_idx, all);
> +        status = name_do_zone_op(n, ns, zone, state, all,
> +                                 nvme_close_zone, nvme_cond_close_all);
> +        break;
> +
> +    case NVME_ZONE_ACTION_FINISH:
> +        trace_pci_nvme_finish_zone(slba, zone_idx, all);
> +        status = name_do_zone_op(n, ns, zone, state, all,
> +                                 nvme_finish_zone, nvme_cond_finish_all);
> +        break;
> +
> +    case NVME_ZONE_ACTION_RESET:
> +        trace_pci_nvme_reset_zone(slba, zone_idx, all);
> +        status = name_do_zone_op(n, ns, zone, state, all,
> +                                 nvme_reset_zone, nvme_cond_reset_all);
> +        break;
> +
> +    case NVME_ZONE_ACTION_OFFLINE:
> +        trace_pci_nvme_offline_zone(slba, zone_idx, all);
> +        status = name_do_zone_op(n, ns, zone, state, all,
> +                                 nvme_offline_zone, nvme_cond_offline_all);
> +        break;
> +
> +    case NVME_ZONE_ACTION_SET_ZD_EXT:
> +        trace_pci_nvme_set_descriptor_extension(slba, zone_idx);
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +        break;
> +
> +    default:
> +        trace_pci_nvme_err_invalid_mgmt_action(action);
> +        status = NVME_INVALID_FIELD;
> +    }
> +
> +    if (status == NVME_ZONE_INVAL_TRANSITION) {
> +        trace_pci_nvme_err_invalid_zone_state_transition(state, action, slba,
> +                                                         zone->d.za);
> +    }
> +    if (status) {
> +        status |= NVME_DNR;
> +    }
> +
> +    return status;
> +}
> +
> +static bool nvme_zone_matches_filter(uint32_t zafs, NvmeZone *zl)
> +{
> +    int zs = nvme_get_zone_state(zl);
> +
> +    switch (zafs) {
> +    case NVME_ZONE_REPORT_ALL:
> +        return true;
> +    case NVME_ZONE_REPORT_EMPTY:
> +        return (zs == NVME_ZONE_STATE_EMPTY);
> +    case NVME_ZONE_REPORT_IMPLICITLY_OPEN:
> +        return (zs == NVME_ZONE_STATE_IMPLICITLY_OPEN);
> +    case NVME_ZONE_REPORT_EXPLICITLY_OPEN:
> +        return (zs == NVME_ZONE_STATE_EXPLICITLY_OPEN);
> +    case NVME_ZONE_REPORT_CLOSED:
> +        return (zs == NVME_ZONE_STATE_CLOSED);
> +    case NVME_ZONE_REPORT_FULL:
> +        return (zs == NVME_ZONE_STATE_FULL);
> +    case NVME_ZONE_REPORT_READ_ONLY:
> +        return (zs == NVME_ZONE_STATE_READ_ONLY);
> +    case NVME_ZONE_REPORT_OFFLINE:
> +        return (zs == NVME_ZONE_STATE_OFFLINE);
> +    default:
> +        return false;
> +    }
> +}
> +
> +static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeCmd *cmd, NvmeRequest *req)
> +{
> +    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> +    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> +    /* cdw12 is zero-based number of dwords to return. Convert to bytes */
> +    uint32_t len = (le32_to_cpu(cmd->cdw12) + 1) << 2;
> +    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> +    uint32_t zra, zrasf, partial;
> +    uint64_t max_zones, zone_index, nr_zones = 0;
> +    uint16_t ret;
> +    uint64_t slba;
> +    NvmeZoneDescr *z;
> +    NvmeZone *zs;
> +    NvmeZoneReportHeader *header;
> +    void *buf, *buf_p;
> +    size_t zone_entry_sz;
> +
> +    req->status = NVME_SUCCESS;
> +
> +    ret = nvme_get_mgmt_zone_slba_idx(n, ns, cmd, &slba, &zone_index);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    if (len < sizeof(NvmeZoneReportHeader)) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    zra = dw13 & 0xff;
> +    if (!(zra == NVME_ZONE_REPORT || zra == NVME_ZONE_REPORT_EXTENDED)) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    if (zra == NVME_ZONE_REPORT_EXTENDED) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    zrasf = (dw13 >> 8) & 0xff;
> +    if (zrasf > NVME_ZONE_REPORT_OFFLINE) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    partial = (dw13 >> 16) & 0x01;
> +
> +    zone_entry_sz = sizeof(NvmeZoneDescr);
> +
> +    max_zones = (len - sizeof(NvmeZoneReportHeader)) / zone_entry_sz;
> +    buf = g_malloc0(len);
> +
> +    header = (NvmeZoneReportHeader *)buf;
> +    buf_p = buf + sizeof(NvmeZoneReportHeader);
> +
> +    while (zone_index < n->num_zones && nr_zones < max_zones) {
> +        zs = &ns->zone_array[zone_index];
> +
> +        if (!nvme_zone_matches_filter(zrasf, zs)) {
> +            zone_index++;
> +            continue;
> +        }
> +
> +        z = (NvmeZoneDescr *)buf_p;
> +        buf_p += sizeof(NvmeZoneDescr);
> +        nr_zones++;
> +
> +        z->zt = zs->d.zt;
> +        z->zs = zs->d.zs;
> +        z->zcap = cpu_to_le64(zs->d.zcap);
> +        z->zslba = cpu_to_le64(zs->d.zslba);
> +        z->za = zs->d.za;
> +
> +        if (nvme_wp_is_valid(zs)) {
> +            z->wp = cpu_to_le64(zs->d.wp);
> +        } else {
> +            z->wp = cpu_to_le64(~0ULL);
> +        }
> +
> +        zone_index++;
> +    }
> +
> +    if (!partial) {
> +        for (; zone_index < n->num_zones; zone_index++) {
> +            zs = &ns->zone_array[zone_index];
> +            if (nvme_zone_matches_filter(zrasf, zs)) {
> +                nr_zones++;
> +            }
> +        }
> +    }
> +    header->nr_zones = cpu_to_le64(nr_zones);
> +
> +    ret = nvme_dma_read_prp(n, (uint8_t *)buf, len, prp1, prp2);
> +    g_free(buf);
> +
> +    return ret;
> +}
> +
>  static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
>      NvmeNamespace *ns;
> @@ -453,9 +1159,16 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          return nvme_flush(n, ns, cmd, req);
>      case NVME_CMD_WRITE_ZEROS:
>          return nvme_write_zeros(n, ns, cmd, req);
> +    case NVME_CMD_ZONE_APND:
> +        req->flags |= NVME_REQ_FLG_APPEND;
> +        /* fall through */
>      case NVME_CMD_WRITE:
>      case NVME_CMD_READ:
>          return nvme_rw(n, ns, cmd, req);
> +    case NVME_CMD_ZONE_MGMT_SEND:
> +        return nvme_zone_mgmt_send(n, ns, cmd, req);
> +    case NVME_CMD_ZONE_MGMT_RECV:
> +        return nvme_zone_mgmt_recv(n, ns, cmd, req);
>      default:
>          trace_pci_nvme_err_invalid_opc(cmd->opcode);
>          return NVME_INVALID_OPCODE | NVME_DNR;
> @@ -675,6 +1388,16 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
>      return NVME_SUCCESS;
>  }
>  
> +static inline bool nvme_csi_has_nvm_support(NvmeNamespace *ns)
> +{
> +    switch (ns->csi) {
> +    case NVME_CSI_NVM:
> +    case NVME_CSI_ZONED:
> +        return true;
> +    }
> +    return false;
> +}
> +
>  static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c)
>  {
>      uint64_t prp1 = le64_to_cpu(c->prp1);
> @@ -701,6 +1424,12 @@ static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeIdentify *c)
>          ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
>          g_free(list);
>          return ret;
> +    } else if (c->csi == NVME_CSI_ZONED && n->params.zoned) {
> +        NvmeIdCtrlZoned *id = g_malloc0(sizeof(*id));
> +        id->zamds = n->zamds;
> +        ret = nvme_dma_read_prp(n, (uint8_t *)id, sizeof(*id), prp1, prp2);
> +        g_free(id);
> +        return ret;
>      } else {
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
> @@ -723,8 +1452,12 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
>      ns = &n->namespaces[nsid - 1];
>      assert(nsid == ns->nsid);
>  
> -    return nvme_dma_read_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
> -        prp1, prp2);
> +    if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
> +        return nvme_dma_read_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
> +            prp1, prp2);
> +    }
> +
> +    return NVME_INVALID_CMD_SET | NVME_DNR;
>  }
>  
>  static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeIdentify *c)
> @@ -747,14 +1480,17 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeIdentify *c)
>      ns = &n->namespaces[nsid - 1];
>      assert(nsid == ns->nsid);
>  
> -    if (c->csi == NVME_CSI_NVM) {
> +    if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
>          list = g_malloc0(data_len);
>          ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
>          g_free(list);
>          return ret;
> -    } else {
> -        return NVME_INVALID_FIELD | NVME_DNR;
> +    } else if (c->csi == NVME_CSI_ZONED && ns->csi == NVME_CSI_ZONED) {
> +        return nvme_dma_read_prp(n, (uint8_t *)ns->id_ns_zoned,
> +                                 sizeof(*ns->id_ns_zoned), prp1, prp2);
>      }
> +
> +    return NVME_INVALID_FIELD | NVME_DNR;
>  }
>  
>  static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
> @@ -796,13 +1532,13 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeIdentify *c)
>  
>      trace_pci_nvme_identify_nslist_csi(min_nsid, c->csi);
>  
> -    if (c->csi != NVME_CSI_NVM) {
> +    if (c->csi != NVME_CSI_NVM && c->csi != NVME_CSI_ZONED) {
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
>      list = g_malloc0(data_len);
>      for (i = 0; i < n->num_namespaces; i++) {
> -        if (i < min_nsid) {
> +        if (i < min_nsid || c->csi != n->namespaces[i].csi) {
>              continue;
>          }
>          list[j++] = cpu_to_le32(i + 1);
> @@ -851,7 +1587,7 @@ static uint16_t nvme_list_ns_descriptors(NvmeCtrl *n, NvmeIdentify *c)
>      desc->nidt = NVME_NIDT_CSI;
>      desc->nidl = NVME_NIDL_CSI;
>      buf_ptr += sizeof(*desc);
> -    *(uint8_t *)buf_ptr = NVME_CSI_NVM;
> +    *(uint8_t *)buf_ptr = ns->csi;
>  
>      status = nvme_dma_read_prp(n, buf, data_len, prp1, prp2);
>      g_free(buf);
> @@ -872,6 +1608,9 @@ static uint16_t nvme_identify_cmd_set(NvmeCtrl *n, NvmeIdentify *c)
>      list = g_malloc0(data_len);
>      ptr = (uint8_t *)list;
>      NVME_SET_CSI(*ptr, NVME_CSI_NVM);
> +    if (n->params.zoned) {
> +        NVME_SET_CSI(*ptr, NVME_CSI_ZONED);
> +    }
>      status = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
>      g_free(list);
>      return status;
> @@ -1038,7 +1777,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  }
>  
>  static uint16_t nvme_handle_cmd_effects(NvmeCtrl *n, NvmeCmd *cmd,
> -    uint64_t prp1, uint64_t prp2, uint64_t ofs, uint32_t len)
> +    uint64_t prp1, uint64_t prp2, uint64_t ofs, uint32_t len, uint8_t csi)
>  {
>     NvmeEffectsLog cmd_eff_log = {};
>     uint32_t *iocs = cmd_eff_log.iocs;
> @@ -1063,11 +1802,19 @@ static uint16_t nvme_handle_cmd_effects(NvmeCtrl *n, NvmeCmd *cmd,
>      iocs[NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
>      iocs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;
>  
> -    iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> -    iocs[NVME_CMD_WRITE_ZEROS] = NVME_CMD_EFFECTS_CSUPP |
> -                                 NVME_CMD_EFFECTS_LBCC;
> -    iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> -    iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
> +    if (NVME_CC_CSS(n->bar.cc) != CSS_ADMIN_ONLY) {
> +        iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> +        iocs[NVME_CMD_WRITE_ZEROS] = NVME_CMD_EFFECTS_CSUPP |
> +                                     NVME_CMD_EFFECTS_LBCC;
> +        iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> +        iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
> +    }
> +    if (csi == NVME_CSI_ZONED && NVME_CC_CSS(n->bar.cc) == CSS_ALL_NSTYPES) {
> +        iocs[NVME_CMD_ZONE_APND] = NVME_CMD_EFFECTS_CSUPP |
> +                                   NVME_CMD_EFFECTS_LBCC;
> +        iocs[NVME_CMD_ZONE_MGMT_SEND] = NVME_CMD_EFFECTS_CSUPP;
> +        iocs[NVME_CMD_ZONE_MGMT_RECV] = NVME_CMD_EFFECTS_CSUPP;
> +    }
>  
>      return nvme_dma_read_prp(n, (uint8_t *)&cmd_eff_log, len, prp1, prp2);
>  }
> @@ -1083,6 +1830,7 @@ static uint16_t nvme_get_log_page(NvmeCtrl *n, NvmeCmd *cmd)
>      uint64_t ofs = (dw13 << 32) | dw12;
>      uint32_t numdl, numdu, len;
>      uint16_t lid = dw10 & 0xff;
> +    uint8_t csi = le32_to_cpu(cmd->cdw14) >> 24;
>  
>      numdl = dw10 >> 16;
>      numdu = dw11 & 0xffff;
> @@ -1090,8 +1838,8 @@ static uint16_t nvme_get_log_page(NvmeCtrl *n, NvmeCmd *cmd)
>  
>      switch (lid) {
>      case NVME_LOG_CMD_EFFECTS:
> -        return nvme_handle_cmd_effects(n, cmd, prp1, prp2, ofs, len);
> -    }
> +        return nvme_handle_cmd_effects(n, cmd, prp1, prp2, ofs, len, csi);
> +     }
>  
>      trace_pci_nvme_unsupported_log_page(lid);
>      return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1255,6 +2003,14 @@ static int nvme_start_ctrl(NvmeCtrl *n)
>          return -1;
>      }
>  
> +    if (n->params.zoned) {
> +        if (!n->params.zamds_bs) {
> +            n->params.zamds_bs = NVME_DEFAULT_MAX_ZA_SIZE;
> +        }

0 is a valid value for zasl (means same as mdts).

> +        n->params.zamds_bs *= KiB;
> +        n->zamds = nvme_ilog2(n->params.zamds_bs / page_size);
> +    }
> +
>      n->page_bits = page_bits;
>      n->page_size = page_size;
>      n->max_prp_ents = n->page_size / sizeof(uint64_t);
> @@ -1324,6 +2080,11 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
>              }
>              switch (NVME_CC_CSS(data)) {
>              case CSS_NVM_ONLY:
> +                if (n->params.zoned) {
> +                    NVME_GUEST_ERR(pci_nvme_err_only_zoned_cmd_set_avail,
> +                                   "only NVM+ZONED command set can be selected");
> +                    break;
> +                }
>                  trace_pci_nvme_css_nvm_cset_selected_by_host(data & 0xffffffff);
>                  break;
>              case CSS_ALL_NSTYPES:

Actually I think this whole switch is wrong, I don't see why the
controller has to validate anything here. It should be removed from the
NST patch as well.

The TP is a little unclear on the behavior, but I think the only
reasonable behavior would be similar to what applies for the "Admin
Command Set only" (111b), that is - for any I/O command submitted, just
return Invalid Command Opcode.

For the zoned namespace there is nothing that prevents a host to
interact with it "blindly" through the NVM command set (the zoned
command set includes it). Even though the host has no means of getting a
report etc, but it can still read and write.

> @@ -1609,6 +2370,120 @@ static const MemoryRegionOps nvme_cmb_ops = {
>      },
>  };
>  
> +static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
> +    uint64_t capacity)
> +{
> +    NvmeZone *zone;
> +    uint64_t start = 0, zone_size = n->params.zone_size;
> +    int i;
> +
> +    ns->zone_array = g_malloc0(n->zone_array_size);
> +    ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
> +    ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
> +    ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
> +    ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
> +    zone = ns->zone_array;
> +
> +    nvme_init_zone_list(ns->exp_open_zones);
> +    nvme_init_zone_list(ns->imp_open_zones);
> +    nvme_init_zone_list(ns->closed_zones);
> +    nvme_init_zone_list(ns->full_zones);
> +
> +    for (i = 0; i < n->num_zones; i++, zone++) {
> +        if (start + zone_size > capacity) {
> +            zone_size = capacity - start;
> +        }
> +        zone->d.zt = NVME_ZONE_TYPE_SEQ_WRITE;
> +        nvme_set_zone_state(zone, NVME_ZONE_STATE_EMPTY);
> +        zone->d.za = 0;
> +        zone->d.zcap = n->params.zone_capacity;
> +        zone->d.zslba = start;
> +        zone->d.wp = start;
> +        zone->prev = 0;
> +        zone->next = 0;
> +        start += zone_size;
> +    }
> +
> +    return 0;
> +}
> +

The following function is super confusing. Bear with me ;)

> +static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
> +{
> +    uint64_t zone_size = 0, capacity;
> +    uint32_t nz;
> +
> +    if (n->params.zone_size) {
> +        zone_size = n->params.zone_size;
> +    } else {
> +        zone_size = NVME_DEFAULT_ZONE_SIZE;
> +    }

So zone_size is in MiB's.

> +    if (!n->params.zone_capacity) {
> +        n->params.zone_capacity = zone_size;
> +    }

OK, default the zone_capacity parameter to zone_size (in MiB's).

> +    n->zone_size_bs = zone_size * MiB;

OK, this is in bytes.

> +    n->params.zone_size = n->zone_size_bs / n->conf.logical_block_size;

Now the zone_size parameter is in terms of LBAs?

> +    capacity = n->params.zone_capacity * MiB;

OK, this is in bytes.

> +    n->params.zone_capacity = capacity / n->conf.logical_block_size;

And now this parameter is also in terms of LBAs.

> +    if (n->params.zone_capacity > n->params.zone_size) {
> +        error_setg(errp, "zone capacity exceeds zone size");
> +        return;
> +    }
> +    zone_size = n->params.zone_size;

And now zone_size is in LBAs.

> +
> +    capacity = n->ns_size / n->conf.logical_block_size;

And now overwrite capacity to be in LBAs as well. Wait what was capacity
previously? And what was params.zone_capacity? Madness!

> +    nz = DIV_ROUND_UP(capacity, zone_size);
> +    n->num_zones = nz;
> +    n->zone_array_size = sizeof(NvmeZone) * nz;
> +
> +    return;
> +}
> +
> +static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
> +    Error **errp)
> +{
> +    int ret;
> +
> +    ret = nvme_init_zone_meta(n, ns, n->num_zones * n->params.zone_size);
> +    if (ret) {
> +        error_setg(errp, "could not init zone metadata");
> +        return -1;
> +    }
> +
> +    ns->id_ns_zoned = g_malloc0(sizeof(*ns->id_ns_zoned));
> +
> +    /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
> +    ns->id_ns_zoned->mar = 0xffffffff;
> +    ns->id_ns_zoned->mor = 0xffffffff;
> +    ns->id_ns_zoned->zoc = 0;
> +    ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
> +
> +    ns->id_ns_zoned->lbafe[lba_index].zsze = cpu_to_le64(n->params.zone_size);
> +    ns->id_ns_zoned->lbafe[lba_index].zdes = 0;
> +
> +    if (n->params.fill_pattern == 0) {
> +        ns->id_ns.dlfeat = 0x01;
> +    } else if (n->params.fill_pattern == 0xff) {
> +        ns->id_ns.dlfeat = 0x02;
> +    }
> +
> +    return 0;
> +}
> +
> +static void nvme_zoned_clear(NvmeCtrl *n)
> +{
> +    int i;
> +
> +    for (i = 0; i < n->num_namespaces; i++) {
> +        NvmeNamespace *ns = &n->namespaces[i];
> +        g_free(ns->id_ns_zoned);
> +        g_free(ns->zone_array);
> +        g_free(ns->exp_open_zones);
> +        g_free(ns->imp_open_zones);
> +        g_free(ns->closed_zones);
> +        g_free(ns->full_zones);
> +    }
> +}
> +
>  static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
>  {
>      NvmeParams *params = &n->params;
> @@ -1674,18 +2549,13 @@ static void nvme_init_state(NvmeCtrl *n)
>  
>  static void nvme_init_blk(NvmeCtrl *n, Error **errp)
>  {
> +    int64_t bs_size;
> +
>      if (!blkconf_blocksizes(&n->conf, errp)) {
>          return;
>      }
>      blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
>                                    false, errp);
> -}
> -
> -static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> -{
> -    int64_t bs_size;
> -    NvmeIdNs *id_ns = &ns->id_ns;
> -    int lba_index;
>  
>      bs_size = blk_getlength(n->conf.blk);
>      if (bs_size < 0) {
> @@ -1694,6 +2564,12 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>      }
>  
>      n->ns_size = bs_size;
> +}
> +
> +static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> +{
> +    NvmeIdNs *id_ns = &ns->id_ns;
> +    int lba_index;
>  
>      ns->csi = NVME_CSI_NVM;
>      qemu_uuid_generate(&ns->uuid); /* TODO make UUIDs persistent */
> @@ -1701,8 +2577,18 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>      id_ns->lbaf[lba_index].ds = nvme_ilog2(n->conf.logical_block_size);
>      id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(n, ns));
>  
> +    if (n->params.zoned) {
> +        ns->csi = NVME_CSI_ZONED;
> +        id_ns->ncap = cpu_to_le64(n->params.zone_capacity * n->num_zones);

Ah yes, right. It is in MiB's when the user specifies it, but now its in
LBAs so this is correct. Please, in general, do not overwrite the device
parameters, it's super confusing ;)

> +        if (nvme_zoned_init_ns(n, ns, lba_index, errp) != 0) {
> +            return;
> +        }
> +    } else {
> +        ns->csi = NVME_CSI_NVM;
> +        id_ns->ncap = id_ns->nsze;
> +    }
> +
>      /* no thin provisioning */
> -    id_ns->ncap = id_ns->nsze;
>      id_ns->nuse = id_ns->ncap;
>  }
>  
> @@ -1817,7 +2703,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>      id->ieee[2] = 0xb3;
>      id->oacs = cpu_to_le16(0);
>      id->frmw = 7 << 1;
> -    id->lpa = 1 << 0;
> +    id->lpa = 1 << 1;

This probably belongs in the CSE patch.

>      id->sqes = (0x6 << 4) | 0x6;
>      id->cqes = (0x4 << 4) | 0x4;
>      id->nn = cpu_to_le32(n->num_namespaces);
> @@ -1834,8 +2720,9 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>      NVME_CAP_SET_CQR(n->bar.cap, 1);
>      NVME_CAP_SET_TO(n->bar.cap, 0xf);
>      /*
> -     * The driver now always supports NS Types, but all commands that
> -     * support CSI field will only handle NVM Command Set.
> +     * The driver now always supports NS Types, even when "zoned" property
> +     * is set to zero. If this is the case, all commands that support CSI field
> +     * only handle NVM Command Set.
>       */
>      NVME_CAP_SET_CSS(n->bar.cap, (CAP_CSS_NVM | CAP_CSS_CSI_SUPP));
>      NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
> @@ -1871,6 +2758,13 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>          return;
>      }
>  
> +    if (n->params.zoned) {
> +        nvme_zoned_init_ctrl(n, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);

I don't think the propagate is needed if you are not doing anything with
the error, you can use errp directly. I think.

> +            return;
> +        }
> +    }
>      nvme_init_ctrl(n, pci_dev);
>  
>      ns = n->namespaces;
> @@ -1889,6 +2783,9 @@ static void nvme_exit(PCIDevice *pci_dev)
>      NvmeCtrl *n = NVME(pci_dev);
>  
>      nvme_clear_ctrl(n);
> +    if (n->params.zoned) {
> +        nvme_zoned_clear(n);
> +    }
>      g_free(n->namespaces);
>      g_free(n->cq);
>      g_free(n->sq);
> @@ -1912,6 +2809,12 @@ static Property nvme_props[] = {
>      DEFINE_PROP_UINT32("num_queues", NvmeCtrl, params.num_queues, 0),
>      DEFINE_PROP_UINT32("max_ioqpairs", NvmeCtrl, params.max_ioqpairs, 64),
>      DEFINE_PROP_UINT16("msix_qsize", NvmeCtrl, params.msix_qsize, 65),
> +    DEFINE_PROP_BOOL("zoned", NvmeCtrl, params.zoned, false),
> +    DEFINE_PROP_UINT64("zone_size", NvmeCtrl, params.zone_size, 512),
> +    DEFINE_PROP_UINT64("zone_capacity", NvmeCtrl, params.zone_capacity, 512),

There is a wierd mismatch between the parameter default and the
NVME_DEFAULT_ZONE_SIZE - should probably use the macro here.

In nvme_zoned_init_ctrl a default is set if the user specifically
specifies 0. I think that is very surprising behavior and surprised me
when I configured it. If the user specifies zero, then error out - the
property already sets a default. This goes for zone_capacity as well.

> +    DEFINE_PROP_UINT32("zone_append_max_size", NvmeCtrl, params.zamds_bs, 0),

Same, use macro here, but I actually think that 0 is a reasonable
default.

> +    DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
> +    DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 05/18] hw/block/nvme: Introduce the Namespace Types definitions
  2020-06-30  4:57   ` Klaus Jensen
@ 2020-06-30 16:04     ` Niklas Cassel
  0 siblings, 0 replies; 49+ messages in thread
From: Niklas Cassel @ 2020-06-30 16:04 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, Damien Le Moal, qemu-block, Dmitry Fomichev,
	qemu-devel, Keith Busch, Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Tue, Jun 30, 2020 at 06:57:16AM +0200, Klaus Jensen wrote:
> On Jun 18 06:34, Dmitry Fomichev wrote:
> > From: Niklas Cassel <niklas.cassel@wdc.com>
> > 
> > Define the structures and constants required to implement
> > Namespace Types support.
> > 
> > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  hw/block/nvme.h      |  3 ++
> >  include/block/nvme.h | 75 +++++++++++++++++++++++++++++++++++++++++---
> >  2 files changed, 73 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index 4f0dac39ae..4fd155c409 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -63,6 +63,9 @@ typedef struct NvmeCQueue {
> >  
> >  typedef struct NvmeNamespace {
> >      NvmeIdNs        id_ns;
> > +    uint32_t        nsid;
> > +    uint8_t         csi;
> > +    QemuUUID        uuid;
> >  } NvmeNamespace;
> >  
> >  static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 6a58bac0c2..5a1e5e137c 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -50,6 +50,11 @@ enum NvmeCapMask {
> >      CAP_PMR_MASK       = 0x1,
> >  };
> >  
> > +enum NvmeCapCssBits {
> > +    CAP_CSS_NVM        = 0x01,
> > +    CAP_CSS_CSI_SUPP   = 0x40,
> > +};
> > +
> >  #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
> >  #define NVME_CAP_CQR(cap)   (((cap) >> CAP_CQR_SHIFT)    & CAP_CQR_MASK)
> >  #define NVME_CAP_AMS(cap)   (((cap) >> CAP_AMS_SHIFT)    & CAP_AMS_MASK)
> > @@ -101,6 +106,12 @@ enum NvmeCcMask {
> >      CC_IOCQES_MASK  = 0xf,
> >  };
> >  
> > +enum NvmeCcCss {
> > +    CSS_NVM_ONLY        = 0,
> > +    CSS_ALL_NSTYPES     = 6,
> 
> Maybe we could call this CSS_CSI, since it just specifies that one or
> more command sets are supported, not that ALL namespace types are
> supported.

The enum name here is CcCss, so this represents CC.CSS,
which specifies which Command Sets to enable,
not which Command Sets that are supported.

(Supported Command Sets are defined by CAP.CSS and the
I/O Command Set data structure.)

So it indeed says to enable ALL command sets supported by the
controller. (Although for the CSI case, you need to check the
I/O Command Set data structure to know what is actually supported.)


However, I agree, the name CSS_ALL_NSTYPES is a bit misleading.
ALL_SUPPORTED_CSI would have been a more precise name.
However, simply naming it CSS_CSI, like you suggest, is more intuitive,
and is what we use in the Linux kernel patches, so let's use that :)


Kind regards,
Niklas

> 
> Otherwise,
> Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
> 
> > +    CSS_ADMIN_ONLY      = 7,
> > +};
> > +
> >  #define NVME_CC_EN(cc)     ((cc >> CC_EN_SHIFT)     & CC_EN_MASK)
> >  #define NVME_CC_CSS(cc)    ((cc >> CC_CSS_SHIFT)    & CC_CSS_MASK)
> >  #define NVME_CC_MPS(cc)    ((cc >> CC_MPS_SHIFT)    & CC_MPS_MASK)
> > @@ -109,6 +120,21 @@ enum NvmeCcMask {
> >  #define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK)
> >  #define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK)
> >  
> > +#define NVME_SET_CC_EN(cc, val)     \
> > +    (cc |= (uint32_t)((val) & CC_EN_MASK) << CC_EN_SHIFT)
> > +#define NVME_SET_CC_CSS(cc, val)    \
> > +    (cc |= (uint32_t)((val) & CC_CSS_MASK) << CC_CSS_SHIFT)
> > +#define NVME_SET_CC_MPS(cc, val)    \
> > +    (cc |= (uint32_t)((val) & CC_MPS_MASK) << CC_MPS_SHIFT)
> > +#define NVME_SET_CC_AMS(cc, val)    \
> > +    (cc |= (uint32_t)((val) & CC_AMS_MASK) << CC_AMS_SHIFT)
> > +#define NVME_SET_CC_SHN(cc, val)    \
> > +    (cc |= (uint32_t)((val) & CC_SHN_MASK) << CC_SHN_SHIFT)
> > +#define NVME_SET_CC_IOSQES(cc, val) \
> > +    (cc |= (uint32_t)((val) & CC_IOSQES_MASK) << CC_IOSQES_SHIFT)
> > +#define NVME_SET_CC_IOCQES(cc, val) \
> > +    (cc |= (uint32_t)((val) & CC_IOCQES_MASK) << CC_IOCQES_SHIFT)
> > +
> >  enum NvmeCstsShift {
> >      CSTS_RDY_SHIFT      = 0,
> >      CSTS_CFS_SHIFT      = 1,
> > @@ -482,10 +508,41 @@ typedef struct NvmeIdentify {
> >      uint64_t    rsvd2[2];
> >      uint64_t    prp1;
> >      uint64_t    prp2;
> > -    uint32_t    cns;
> > -    uint32_t    rsvd11[5];
> > +    uint8_t     cns;
> > +    uint8_t     rsvd4;
> > +    uint16_t    ctrlid;
> > +    uint16_t    nvmsetid;
> > +    uint8_t     rsvd3;
> > +    uint8_t     csi;
> > +    uint32_t    rsvd12[4];
> >  } NvmeIdentify;
> >  
> > +typedef struct NvmeNsIdDesc {
> > +    uint8_t     nidt;
> > +    uint8_t     nidl;
> > +    uint16_t    rsvd2;
> > +} NvmeNsIdDesc;
> > +
> > +enum NvmeNidType {
> > +    NVME_NIDT_EUI64             = 0x01,
> > +    NVME_NIDT_NGUID             = 0x02,
> > +    NVME_NIDT_UUID              = 0x03,
> > +    NVME_NIDT_CSI               = 0x04,
> > +};
> > +
> > +enum NvmeNidLength {
> > +    NVME_NIDL_EUI64             = 8,
> > +    NVME_NIDL_NGUID             = 16,
> > +    NVME_NIDL_UUID              = 16,
> > +    NVME_NIDL_CSI               = 1,
> > +};
> > +
> > +enum NvmeCsi {
> > +    NVME_CSI_NVM                = 0x00,
> > +};
> > +
> > +#define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
> > +
> >  typedef struct NvmeRwCmd {
> >      uint8_t     opcode;
> >      uint8_t     flags;
> > @@ -603,6 +660,7 @@ enum NvmeStatusCodes {
> >      NVME_CMD_ABORT_MISSING_FUSE = 0x000a,
> >      NVME_INVALID_NSID           = 0x000b,
> >      NVME_CMD_SEQ_ERROR          = 0x000c,
> > +    NVME_CMD_SET_CMB_REJECTED   = 0x002b,
> >      NVME_LBA_RANGE              = 0x0080,
> >      NVME_CAP_EXCEEDED           = 0x0081,
> >      NVME_NS_NOT_READY           = 0x0082,
> > @@ -729,9 +787,14 @@ typedef struct NvmePSD {
> >  #define NVME_IDENTIFY_DATA_SIZE 4096
> >  
> >  enum {
> > -    NVME_ID_CNS_NS             = 0x0,
> > -    NVME_ID_CNS_CTRL           = 0x1,
> > -    NVME_ID_CNS_NS_ACTIVE_LIST = 0x2,
> > +    NVME_ID_CNS_NS                = 0x0,
> > +    NVME_ID_CNS_CTRL              = 0x1,
> > +    NVME_ID_CNS_NS_ACTIVE_LIST    = 0x2,
> > +    NVME_ID_CNS_NS_DESC_LIST      = 0x03,
> > +    NVME_ID_CNS_CS_NS             = 0x05,
> > +    NVME_ID_CNS_CS_CTRL           = 0x06,
> > +    NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
> > +    NVME_ID_CNS_IO_COMMAND_SET    = 0x1c,
> >  };
> >  
> >  typedef struct NvmeIdCtrl {
> > @@ -825,6 +888,7 @@ enum NvmeFeatureIds {
> >      NVME_WRITE_ATOMICITY            = 0xa,
> >      NVME_ASYNCHRONOUS_EVENT_CONF    = 0xb,
> >      NVME_TIMESTAMP                  = 0xe,
> > +    NVME_COMMAND_SET_PROFILE        = 0x19,
> >      NVME_SOFTWARE_PROGRESS_MARKER   = 0x80
> >  };
> >  
> > @@ -914,6 +978,7 @@ static inline void _nvme_check_size(void)
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
> > +    QEMU_BUILD_BUG_ON(sizeof(NvmeNsIdDesc) != 4);
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
> >      QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
> >  }
> > -- 
> > 2.21.0
> > 
> > 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 05/18] hw/block/nvme: Introduce the Namespace Types definitions
  2020-06-30 10:02     ` Niklas Cassel
@ 2020-06-30 17:02       ` Keith Busch
  0 siblings, 0 replies; 49+ messages in thread
From: Keith Busch @ 2020-06-30 17:02 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Kevin Wolf, Damien Le Moal, Qemu-block, Dmitry Fomichev,
	qemu-devel@nongnu.org Developers, Alistair Francis,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Tue, Jun 30, 2020 at 10:02:15AM +0000, Niklas Cassel wrote:
> On Mon, Jun 29, 2020 at 07:12:47PM -0700, Alistair Francis wrote:
> > On Wed, Jun 17, 2020 at 2:47 PM Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
> > > +    uint16_t    ctrlid;
> > 
> > Shouldn't this be CNTID?
> 
> From the NVMe spec:
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf
> 
> Figure 241:
> Controller  Identifier  (CNTID)
> 
> So you are correct, this is the official abbreviation.
> 
> I guess that I tried wanted to keep it in sync with Linux:
> https://github.com/torvalds/linux/blob/master/include/linux/nvme.h#L974
> 
> Which uses ctrlid.
> 
> 
> Looking further at the NVMe spec:
> In Figure 247 (Identify Controller Data Structure) they use other names
> for fields:
> 
> Controller  ID  (CNTLID)
> Controller Attributes (CTRATT)
> 
> I can understand if they want to have unique names for fields, but it
> seems like they have trouble deciding how to abbreviate controller :)
> 
> Personally I think that ctrlid is more obvious that we are talking about
> a controller and not a count. But I'm fine regardless.

They shouldn't have shortened controller to "CNT". For those of us that
can't help but pronounce these as words, that is a vulgarity in English.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 06/18] hw/block/nvme: Define trace events related to NS Types
  2020-06-17 21:34 ` [PATCH v2 06/18] hw/block/nvme: Define trace events related to NS Types Dmitry Fomichev
  2020-06-30 10:20   ` Klaus Jensen
@ 2020-06-30 20:18   ` Alistair Francis
  1 sibling, 0 replies; 49+ messages in thread
From: Alistair Francis @ 2020-06-30 20:18 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, Qemu-block,
	qemu-devel@nongnu.org Developers, Keith Busch,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Wed, Jun 17, 2020 at 2:46 PM Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
>
> A few trace events are defined that are relevant to implementing
> Namespace Types (NVMe TP 4056).
>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

Reviewed-by: Alistair Francis <alistair.francis@wdc.com>

Alistair

> ---
>  hw/block/trace-events | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index 423d491e27..3f3323fe38 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -39,8 +39,13 @@ pci_nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size,
>  pci_nvme_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
>  pci_nvme_del_cq(uint16_t cqid) "deleted completion queue, cqid=%"PRIu16""
>  pci_nvme_identify_ctrl(void) "identify controller"
> +pci_nvme_identify_ctrl_csi(uint8_t csi) "identify controller, csi=0x%"PRIx8""
>  pci_nvme_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
> +pci_nvme_identify_ns_csi(uint16_t ns, uint8_t csi) "identify namespace, nsid=%"PRIu16", csi=0x%"PRIx8""
>  pci_nvme_identify_nslist(uint16_t ns) "identify namespace list, nsid=%"PRIu16""
> +pci_nvme_identify_nslist_csi(uint16_t ns, uint8_t csi) "identify namespace list, nsid=%"PRIu16", csi=0x%"PRIx8""
> +pci_nvme_list_ns_descriptors(void) "identify namespace descriptors"
> +pci_nvme_identify_cmd_set(void) "identify i/o command set"
>  pci_nvme_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
>  pci_nvme_getfeat_numq(int result) "get feature number of queues, result=%d"
>  pci_nvme_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
> @@ -59,6 +64,8 @@ pci_nvme_mmio_stopped(void) "cleared controller enable bit"
>  pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
>  pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
>  pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
> +pci_nvme_css_nvm_cset_selected_by_host(uint32_t cc) "NVM command set selected by host, bar.cc=0x%"PRIx32""
> +pci_nvme_css_all_csets_sel_by_host(uint32_t cc) "all supported command sets selected by host, bar.cc=0x%"PRIx32""
>
>  # nvme traces for error conditions
>  pci_nvme_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
> @@ -72,6 +79,9 @@ pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
>  pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
>  pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
>  pci_nvme_err_invalid_effects_log_len(uint32_t len) "commands supported and effects log size is 4096, got %"PRIu32""
> +pci_nvme_err_change_css_when_enabled(void) "changing CC.CSS while controller is enabled"
> +pci_nvme_err_only_nvm_cmd_set_avail(void) "setting 110b CC.CSS, but only NVM command set is enabled"
> +pci_nvme_err_invalid_iocsci(uint32_t idx) "unsupported command set combination index %"PRIu32""
>  pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
>  pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
>  pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
> @@ -127,6 +137,7 @@ pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t new_head) "completion qu
>  pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write for nonexistent queue, sqid=%"PRIu32", ignoring"
>  pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission queue doorbell write value beyond queue size, sqid=%"PRIu32", new_head=%"PRIu16", ignoring"
>  pci_nvme_unsupported_log_page(uint16_t lid) "unsupported log page 0x%"PRIx16""
> +pci_nvme_ub_unknown_css_value(void) "unknown value in cc.css field"
>
>  # xen-block.c
>  xen_block_realize(const char *type, uint32_t disk, uint32_t partition) "%s d%up%u"
> --
> 2.21.0
>
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 08/18] hw/block/nvme: Make Zoned NS Command Set definitions
  2020-06-17 21:34 ` [PATCH v2 08/18] hw/block/nvme: Make Zoned NS Command Set definitions Dmitry Fomichev
  2020-06-30 11:44   ` Klaus Jensen
@ 2020-06-30 22:11   ` Alistair Francis
  1 sibling, 0 replies; 49+ messages in thread
From: Alistair Francis @ 2020-06-30 22:11 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, Qemu-block,
	qemu-devel@nongnu.org Developers, Keith Busch,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Wed, Jun 17, 2020 at 2:51 PM Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
>
> Define values and structures that are needed to support Zoned
> Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.
>
> All new protocol definitions are located in include/block/nvme.h
> and everything added that is specific to this implementation is kept
> in hw/block/nvme.h.
>
> In order to improve scalability, all open, closed and full zones
> are organized in separate linked lists. Consequently, almost all
> zone operations don't require scanning of the entire zone array
> (which potentially can be quite large) - it is only necessary to
> enumerate one or more zone lists. Zone lists are designed to be
> position-independent as they can be persisted to the backing file
> as a part of zone metadata. NvmeZoneList struct defined in this patch
> serves as a head of every zone list.
>
> NvmeZone structure encapsulates NvmeZoneDescriptor defined in Zoned
> Command Set specification and adds a few more fields that are
> internal to this implementation.
>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Signed-off-by: Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme.h      | 130 +++++++++++++++++++++++++++++++++++++++++++
>  include/block/nvme.h | 119 ++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 248 insertions(+), 1 deletion(-)
>
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 0d29f75475..2c932b5e29 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -3,12 +3,22 @@
>
>  #include "block/nvme.h"
>
> +#define NVME_DEFAULT_ZONE_SIZE   128 /* MiB */
> +#define NVME_DEFAULT_MAX_ZA_SIZE 128 /* KiB */
> +
>  typedef struct NvmeParams {
>      char     *serial;
>      uint32_t num_queues; /* deprecated since 5.1 */
>      uint32_t max_ioqpairs;
>      uint16_t msix_qsize;
>      uint32_t cmb_size_mb;
> +
> +    bool        zoned;
> +    bool        cross_zone_read;
> +    uint8_t     fill_pattern;
> +    uint32_t    zamds_bs;
> +    uint64_t    zone_size;
> +    uint64_t    zone_capacity;
>  } NvmeParams;
>
>  typedef struct NvmeAsyncEvent {
> @@ -17,6 +27,8 @@ typedef struct NvmeAsyncEvent {
>
>  enum NvmeRequestFlags {
>      NVME_REQ_FLG_HAS_SG   = 1 << 0,
> +    NVME_REQ_FLG_FILL     = 1 << 1,
> +    NVME_REQ_FLG_APPEND   = 1 << 2,
>  };
>
>  typedef struct NvmeRequest {
> @@ -24,6 +36,7 @@ typedef struct NvmeRequest {
>      BlockAIOCB              *aiocb;
>      uint16_t                status;
>      uint16_t                flags;
> +    uint64_t                fill_ofs;
>      NvmeCqe                 cqe;
>      BlockAcctCookie         acct;
>      QEMUSGList              qsg;
> @@ -61,11 +74,35 @@ typedef struct NvmeCQueue {
>      QTAILQ_HEAD(, NvmeRequest) req_list;
>  } NvmeCQueue;
>
> +typedef struct NvmeZone {
> +    NvmeZoneDescr   d;
> +    uint64_t        tstamp;
> +    uint32_t        next;
> +    uint32_t        prev;
> +    uint8_t         rsvd80[8];
> +} NvmeZone;
> +
> +#define NVME_ZONE_LIST_NIL    UINT_MAX
> +
> +typedef struct NvmeZoneList {
> +    uint32_t        head;
> +    uint32_t        tail;
> +    uint32_t        size;
> +    uint8_t         rsvd12[4];
> +} NvmeZoneList;
> +
>  typedef struct NvmeNamespace {
>      NvmeIdNs        id_ns;
>      uint32_t        nsid;
>      uint8_t         csi;
>      QemuUUID        uuid;
> +
> +    NvmeIdNsZoned   *id_ns_zoned;
> +    NvmeZone        *zone_array;
> +    NvmeZoneList    *exp_open_zones;
> +    NvmeZoneList    *imp_open_zones;
> +    NvmeZoneList    *closed_zones;
> +    NvmeZoneList    *full_zones;
>  } NvmeNamespace;
>
>  static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
> @@ -100,6 +137,7 @@ typedef struct NvmeCtrl {
>      uint32_t    num_namespaces;
>      uint32_t    max_q_ents;
>      uint64_t    ns_size;
> +
>      uint8_t     *cmbuf;
>      uint32_t    irq_status;
>      uint64_t    host_timestamp;                 /* Timestamp sent by the host */
> @@ -107,6 +145,12 @@ typedef struct NvmeCtrl {
>
>      HostMemoryBackend *pmrdev;
>
> +    int             zone_file_fd;
> +    uint32_t        num_zones;
> +    uint64_t        zone_size_bs;
> +    uint64_t        zone_array_size;
> +    uint8_t         zamds;
> +
>      NvmeNamespace   *namespaces;
>      NvmeSQueue      **sq;
>      NvmeCQueue      **cq;
> @@ -121,6 +165,86 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
>      return n->ns_size >> nvme_ns_lbads(ns);
>  }
>
> +static inline uint8_t nvme_get_zone_state(NvmeZone *zone)
> +{
> +    return zone->d.zs >> 4;
> +}
> +
> +static inline void nvme_set_zone_state(NvmeZone *zone, enum NvmeZoneState state)
> +{
> +    zone->d.zs = state << 4;
> +}
> +
> +static inline uint64_t nvme_zone_rd_boundary(NvmeCtrl *n, NvmeZone *zone)
> +{
> +    return zone->d.zslba + n->params.zone_size;
> +}
> +
> +static inline uint64_t nvme_zone_wr_boundary(NvmeZone *zone)
> +{
> +    return zone->d.zslba + zone->d.zcap;
> +}
> +
> +static inline bool nvme_wp_is_valid(NvmeZone *zone)
> +{
> +    uint8_t st = nvme_get_zone_state(zone);
> +
> +    return st != NVME_ZONE_STATE_FULL &&
> +           st != NVME_ZONE_STATE_READ_ONLY &&
> +           st != NVME_ZONE_STATE_OFFLINE;
> +}
> +
> +/*
> + * Initialize a zone list head.
> + */
> +static inline void nvme_init_zone_list(NvmeZoneList *zl)
> +{
> +    zl->head = NVME_ZONE_LIST_NIL;
> +    zl->tail = NVME_ZONE_LIST_NIL;
> +    zl->size = 0;
> +}
> +
> +/*
> + * Initialize the number of entries contained in a zone list.
> + */

This should be retrieve (or something similar) instead of initialise.

> +static inline uint32_t nvme_zone_list_size(NvmeZoneList *zl)
> +{
> +    return zl->size;
> +}
> +
> +/*
> + * Check if the zone is not currently included into any zone list.
> + */
> +static inline bool nvme_zone_not_in_list(NvmeZone *zone)
> +{
> +    return (bool)(zone->prev == 0 && zone->next == 0);

You don't need the cast to bool.

Besides that it looks good. I didn't check every value against the spec though.

Acked-by: Alistair Francis <alistair.francis@wdc.com>

Alistair

> +}
> +
> +/*
> + * Return the zone at the head of zone list or NULL if the list is empty.
> + */
> +static inline NvmeZone *nvme_peek_zone_head(NvmeNamespace *ns, NvmeZoneList *zl)
> +{
> +    if (zl->head == NVME_ZONE_LIST_NIL) {
> +        return NULL;
> +    }
> +    return &ns->zone_array[zl->head];
> +}
> +
> +/*
> + * Return the next zone in the list.
> + */
> +static inline NvmeZone *nvme_next_zone_in_list(NvmeNamespace *ns, NvmeZone *z,
> +    NvmeZoneList *zl)
> +{
> +    assert(!nvme_zone_not_in_list(z));
> +
> +    if (z->next == NVME_ZONE_LIST_NIL) {
> +        return NULL;
> +    }
> +    return &ns->zone_array[z->next];
> +}
> +
>  static inline int nvme_ilog2(uint64_t i)
>  {
>      int log = -1;
> @@ -132,4 +256,10 @@ static inline int nvme_ilog2(uint64_t i)
>      return log;
>  }
>
> +static inline void _hw_nvme_check_size(void)
> +{
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeZoneList) != 16);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeZone) != 88);
> +}
> +
>  #endif /* HW_NVME_H */
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 5a1e5e137c..596c39162b 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -446,6 +446,9 @@ enum NvmeIoCommands {
>      NVME_CMD_COMPARE            = 0x05,
>      NVME_CMD_WRITE_ZEROS        = 0x08,
>      NVME_CMD_DSM                = 0x09,
> +    NVME_CMD_ZONE_MGMT_SEND     = 0x79,
> +    NVME_CMD_ZONE_MGMT_RECV     = 0x7a,
> +    NVME_CMD_ZONE_APND          = 0x7d,
>  };
>
>  typedef struct NvmeDeleteQ {
> @@ -539,6 +542,7 @@ enum NvmeNidLength {
>
>  enum NvmeCsi {
>      NVME_CSI_NVM                = 0x00,
> +    NVME_CSI_ZONED              = 0x02,
>  };
>
>  #define NVME_SET_CSI(vec, csi) (vec |= (uint8_t)(1 << (csi)))
> @@ -661,6 +665,7 @@ enum NvmeStatusCodes {
>      NVME_INVALID_NSID           = 0x000b,
>      NVME_CMD_SEQ_ERROR          = 0x000c,
>      NVME_CMD_SET_CMB_REJECTED   = 0x002b,
> +    NVME_INVALID_CMD_SET        = 0x002c,
>      NVME_LBA_RANGE              = 0x0080,
>      NVME_CAP_EXCEEDED           = 0x0081,
>      NVME_NS_NOT_READY           = 0x0082,
> @@ -684,6 +689,14 @@ enum NvmeStatusCodes {
>      NVME_CONFLICTING_ATTRS      = 0x0180,
>      NVME_INVALID_PROT_INFO      = 0x0181,
>      NVME_WRITE_TO_RO            = 0x0182,
> +    NVME_ZONE_BOUNDARY_ERROR    = 0x01b8,
> +    NVME_ZONE_FULL              = 0x01b9,
> +    NVME_ZONE_READ_ONLY         = 0x01ba,
> +    NVME_ZONE_OFFLINE           = 0x01bb,
> +    NVME_ZONE_INVALID_WRITE     = 0x01bc,
> +    NVME_ZONE_TOO_MANY_ACTIVE   = 0x01bd,
> +    NVME_ZONE_TOO_MANY_OPEN     = 0x01be,
> +    NVME_ZONE_INVAL_TRANSITION  = 0x01bf,
>      NVME_WRITE_FAULT            = 0x0280,
>      NVME_UNRECOVERED_READ       = 0x0281,
>      NVME_E2E_GUARD_ERROR        = 0x0282,
> @@ -807,7 +820,17 @@ typedef struct NvmeIdCtrl {
>      uint8_t     ieee[3];
>      uint8_t     cmic;
>      uint8_t     mdts;
> -    uint8_t     rsvd255[178];
> +    uint16_t    cntlid;
> +    uint32_t    ver;
> +    uint32_t    rtd3r;
> +    uint32_t    rtd3e;
> +    uint32_t    oaes;
> +    uint32_t    ctratt;
> +    uint8_t     rsvd100[28];
> +    uint16_t    crdt1;
> +    uint16_t    crdt2;
> +    uint16_t    crdt3;
> +    uint8_t     rsvd134[122];
>      uint16_t    oacs;
>      uint8_t     acl;
>      uint8_t     aerl;
> @@ -832,6 +855,11 @@ typedef struct NvmeIdCtrl {
>      uint8_t     vs[1024];
>  } NvmeIdCtrl;
>
> +typedef struct NvmeIdCtrlZoned {
> +    uint8_t     zamds;
> +    uint8_t     rsvd1[4095];
> +} NvmeIdCtrlZoned;
> +
>  enum NvmeIdCtrlOacs {
>      NVME_OACS_SECURITY  = 1 << 0,
>      NVME_OACS_FORMAT    = 1 << 1,
> @@ -908,6 +936,12 @@ typedef struct NvmeLBAF {
>      uint8_t     rp;
>  } NvmeLBAF;
>
> +typedef struct NvmeLBAFE {
> +    uint64_t    zsze;
> +    uint8_t     zdes;
> +    uint8_t     rsvd9[7];
> +} NvmeLBAFE;
> +
>  typedef struct NvmeIdNs {
>      uint64_t    nsze;
>      uint64_t    ncap;
> @@ -930,6 +964,19 @@ typedef struct NvmeIdNs {
>      uint8_t     vs[3712];
>  } NvmeIdNs;
>
> +typedef struct NvmeIdNsZoned {
> +    uint16_t    zoc;
> +    uint16_t    ozcs;
> +    uint32_t    mar;
> +    uint32_t    mor;
> +    uint32_t    rrl;
> +    uint32_t    frl;
> +    uint8_t     rsvd20[2796];
> +    NvmeLBAFE   lbafe[16];
> +    uint8_t     rsvd3072[768];
> +    uint8_t     vs[256];
> +} NvmeIdNsZoned;
> +
>
>  /*Deallocate Logical Block Features*/
>  #define NVME_ID_NS_DLFEAT_GUARD_CRC(dlfeat)       ((dlfeat) & 0x10)
> @@ -962,6 +1009,71 @@ enum NvmeIdNsDps {
>      DPS_FIRST_EIGHT = 8,
>  };
>
> +enum NvmeZoneAttr {
> +    NVME_ZA_FINISHED_BY_CTLR         = 1 << 0,
> +    NVME_ZA_FINISH_RECOMMENDED       = 1 << 1,
> +    NVME_ZA_RESET_RECOMMENDED        = 1 << 2,
> +    NVME_ZA_ZD_EXT_VALID             = 1 << 7,
> +};
> +
> +typedef struct NvmeZoneReportHeader {
> +    uint64_t    nr_zones;
> +    uint8_t     rsvd[56];
> +} NvmeZoneReportHeader;
> +
> +enum NvmeZoneReceiveAction {
> +    NVME_ZONE_REPORT                 = 0,
> +    NVME_ZONE_REPORT_EXTENDED        = 1,
> +};
> +
> +enum NvmeZoneReportType {
> +    NVME_ZONE_REPORT_ALL             = 0,
> +    NVME_ZONE_REPORT_EMPTY           = 1,
> +    NVME_ZONE_REPORT_IMPLICITLY_OPEN = 2,
> +    NVME_ZONE_REPORT_EXPLICITLY_OPEN = 3,
> +    NVME_ZONE_REPORT_CLOSED          = 4,
> +    NVME_ZONE_REPORT_FULL            = 5,
> +    NVME_ZONE_REPORT_READ_ONLY       = 6,
> +    NVME_ZONE_REPORT_OFFLINE         = 7,
> +};
> +
> +typedef struct NvmeZoneDescr {
> +    uint8_t     zt;
> +    uint8_t     zs;
> +    uint8_t     za;
> +    uint8_t     rsvd3[5];
> +    uint64_t    zcap;
> +    uint64_t    zslba;
> +    uint64_t    wp;
> +    uint8_t     rsvd32[32];
> +} NvmeZoneDescr;
> +
> +enum NvmeZoneState {
> +    NVME_ZONE_STATE_RESERVED         = 0x00,
> +    NVME_ZONE_STATE_EMPTY            = 0x01,
> +    NVME_ZONE_STATE_IMPLICITLY_OPEN  = 0x02,
> +    NVME_ZONE_STATE_EXPLICITLY_OPEN  = 0x03,
> +    NVME_ZONE_STATE_CLOSED           = 0x04,
> +    NVME_ZONE_STATE_READ_ONLY        = 0x0D,
> +    NVME_ZONE_STATE_FULL             = 0x0E,
> +    NVME_ZONE_STATE_OFFLINE          = 0x0F,
> +};
> +
> +enum NvmeZoneType {
> +    NVME_ZONE_TYPE_RESERVED          = 0x00,
> +    NVME_ZONE_TYPE_SEQ_WRITE         = 0x02,
> +};
> +
> +enum NvmeZoneSendAction {
> +    NVME_ZONE_ACTION_RSD             = 0x00,
> +    NVME_ZONE_ACTION_CLOSE           = 0x01,
> +    NVME_ZONE_ACTION_FINISH          = 0x02,
> +    NVME_ZONE_ACTION_OPEN            = 0x03,
> +    NVME_ZONE_ACTION_RESET           = 0x04,
> +    NVME_ZONE_ACTION_OFFLINE         = 0x05,
> +    NVME_ZONE_ACTION_SET_ZD_EXT      = 0x10,
> +};
> +
>  static inline void _nvme_check_size(void)
>  {
>      QEMU_BUILD_BUG_ON(sizeof(NvmeCqe) != 16);
> @@ -978,8 +1090,13 @@ static inline void _nvme_check_size(void)
>      QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrlZoned) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeNsIdDesc) != 4);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeLBAF) != 4);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeLBAFE) != 16);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsZoned) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeZoneDescr) != 64);
>  }
>  #endif
> --
> 2.21.0
>
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 11/18] hw/block/nvme: Introduce max active and open zone limits
  2020-06-17 21:34 ` [PATCH v2 11/18] hw/block/nvme: Introduce max active and open zone limits Dmitry Fomichev
@ 2020-07-01  0:26   ` Alistair Francis
  2020-07-01  6:41   ` Klaus Jensen
  1 sibling, 0 replies; 49+ messages in thread
From: Alistair Francis @ 2020-07-01  0:26 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, Qemu-block,
	qemu-devel@nongnu.org Developers, Keith Busch,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Wed, Jun 17, 2020 at 3:07 PM Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
>
> Added two module properties, "max_active" and "max_open" to control
> the maximum number of zones that can be active or open. Once these
> variables are set to non-default values, the driver checks these
> limits during I/O and returns Too Many Active or Too Many Open
> command status if they are exceeded.
>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme.c | 183 +++++++++++++++++++++++++++++++++++++++++++++++-
>  hw/block/nvme.h |   4 ++
>  2 files changed, 185 insertions(+), 2 deletions(-)
>
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 2e03b0b6ed..05a7cbcfcc 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -120,6 +120,87 @@ static void nvme_remove_zone(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
>      zone->prev = zone->next = 0;
>  }
>
> +/*
> + * Take the first zone out from a list, return NULL if the list is empty.
> + */
> +static NvmeZone *nvme_remove_zone_head(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZoneList *zl)
> +{
> +    NvmeZone *zone = nvme_peek_zone_head(ns, zl);
> +
> +    if (zone) {
> +        --zl->size;
> +        if (zl->size == 0) {
> +            zl->head = NVME_ZONE_LIST_NIL;
> +            zl->tail = NVME_ZONE_LIST_NIL;
> +        } else {
> +            zl->head = zone->next;
> +            ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
> +        }
> +        zone->prev = zone->next = 0;
> +    }
> +
> +    return zone;
> +}
> +
> +/*
> + * Check if we can open a zone without exceeding open/active limits.
> + * AOR stands for "Active and Open Resources" (see TP 4053 section 2.5).
> + */
> +static int nvme_aor_check(NvmeCtrl *n, NvmeNamespace *ns,
> +     uint32_t act, uint32_t opn)
> +{
> +    if (n->params.max_active_zones != 0 &&
> +        ns->nr_active_zones + act > n->params.max_active_zones) {
> +        trace_pci_nvme_err_insuff_active_res(n->params.max_active_zones);
> +        return NVME_ZONE_TOO_MANY_ACTIVE | NVME_DNR;
> +    }
> +    if (n->params.max_open_zones != 0 &&
> +        ns->nr_open_zones + opn > n->params.max_open_zones) {
> +        trace_pci_nvme_err_insuff_open_res(n->params.max_open_zones);
> +        return NVME_ZONE_TOO_MANY_OPEN | NVME_DNR;
> +    }
> +
> +    return NVME_SUCCESS;
> +}
> +
> +static inline void nvme_aor_inc_open(NvmeCtrl *n, NvmeNamespace *ns)
> +{
> +    assert(ns->nr_open_zones >= 0);
> +    if (n->params.max_open_zones) {
> +        ns->nr_open_zones++;
> +        assert(ns->nr_open_zones <= n->params.max_open_zones);
> +    }
> +}
> +
> +static inline void nvme_aor_dec_open(NvmeCtrl *n, NvmeNamespace *ns)
> +{
> +    if (n->params.max_open_zones) {
> +        assert(ns->nr_open_zones > 0);
> +        ns->nr_open_zones--;
> +    }
> +    assert(ns->nr_open_zones >= 0);
> +}
> +
> +static inline void nvme_aor_inc_active(NvmeCtrl *n, NvmeNamespace *ns)
> +{
> +    assert(ns->nr_active_zones >= 0);
> +    if (n->params.max_active_zones) {
> +        ns->nr_active_zones++;
> +        assert(ns->nr_active_zones <= n->params.max_active_zones);
> +    }
> +}
> +
> +static inline void nvme_aor_dec_active(NvmeCtrl *n, NvmeNamespace *ns)
> +{
> +    if (n->params.max_active_zones) {
> +        assert(ns->nr_active_zones > 0);
> +        ns->nr_active_zones--;
> +        assert(ns->nr_active_zones >= ns->nr_open_zones);
> +    }
> +    assert(ns->nr_active_zones >= 0);
> +}
> +
>  static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
>      NvmeZone *zone, uint8_t state)
>  {
> @@ -454,6 +535,24 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
>      timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>  }
>
> +static void nvme_auto_transition_zone(NvmeCtrl *n, NvmeNamespace *ns,
> +    bool implicit, bool adding_active)
> +{
> +    NvmeZone *zone;
> +
> +    if (implicit && n->params.max_open_zones &&
> +        ns->nr_open_zones == n->params.max_open_zones) {
> +        zone = nvme_remove_zone_head(n, ns, ns->imp_open_zones);
> +        if (zone) {
> +            /*
> +             * Automatically close this implicitly open zone.
> +             */
> +            nvme_aor_dec_open(n, ns);
> +            nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
> +        }
> +    }
> +}
> +
>  static uint16_t nvme_check_zone_write(NvmeZone *zone, uint64_t slba,
>      uint32_t nlb)
>  {
> @@ -531,6 +630,23 @@ static uint16_t nvme_check_zone_read(NvmeCtrl *n, NvmeZone *zone, uint64_t slba,
>      return status;
>  }
>
> +static uint16_t nvme_auto_open_zone(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZone *zone)
> +{
> +    uint16_t status = NVME_SUCCESS;
> +    uint8_t zs = nvme_get_zone_state(zone);
> +
> +    if (zs == NVME_ZONE_STATE_EMPTY) {
> +        nvme_auto_transition_zone(n, ns, true, true);
> +        status = nvme_aor_check(n, ns, 1, 1);
> +    } else if (zs == NVME_ZONE_STATE_CLOSED) {
> +        nvme_auto_transition_zone(n, ns, true, false);
> +        status = nvme_aor_check(n, ns, 0, 1);
> +    }
> +
> +    return status;
> +}
> +
>  static uint64_t nvme_finalize_zone_write(NvmeCtrl *n, NvmeNamespace *ns,
>      NvmeZone *zone, uint32_t nlb)
>  {
> @@ -543,7 +659,11 @@ static uint64_t nvme_finalize_zone_write(NvmeCtrl *n, NvmeNamespace *ns,
>          switch (zs) {
>          case NVME_ZONE_STATE_IMPLICITLY_OPEN:
>          case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +            nvme_aor_dec_open(n, ns);
> +            /* fall through */
>          case NVME_ZONE_STATE_CLOSED:
> +            nvme_aor_dec_active(n, ns);
> +            /* fall through */
>          case NVME_ZONE_STATE_EMPTY:
>              break;
>          default:
> @@ -553,7 +673,10 @@ static uint64_t nvme_finalize_zone_write(NvmeCtrl *n, NvmeNamespace *ns,
>      } else {
>          switch (zs) {
>          case NVME_ZONE_STATE_EMPTY:
> +            nvme_aor_inc_active(n, ns);
> +            /* fall through */
>          case NVME_ZONE_STATE_CLOSED:
> +            nvme_aor_inc_open(n, ns);
>              nvme_assign_zone_state(n, ns, zone,
>                                     NVME_ZONE_STATE_IMPLICITLY_OPEN);
>          }
> @@ -636,6 +759,11 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>                                                 zone->d.wp);
>              return NVME_ZONE_INVALID_WRITE | NVME_DNR;
>          }
> +
> +        status = nvme_auto_open_zone(n, ns, zone);
> +        if (status != NVME_SUCCESS) {
> +            return status;
> +        }
>      }
>
>      block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
> @@ -709,6 +837,11 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>                                                     zone->d.wp);
>                  return NVME_ZONE_INVALID_WRITE | NVME_DNR;
>              }
> +
> +            status = nvme_auto_open_zone(n, ns, zone);
> +            if (status != NVME_SUCCESS) {
> +                return status;
> +            }
>          } else {
>              status = nvme_check_zone_read(n, zone, slba, nlb,
>                                            n->params.cross_zone_read);
> @@ -804,9 +937,27 @@ static uint16_t nvme_get_mgmt_zone_slba_idx(NvmeCtrl *n, NvmeNamespace *ns,
>  static uint16_t nvme_open_zone(NvmeCtrl *n, NvmeNamespace *ns,
>      NvmeZone *zone, uint8_t state)
>  {
> +    uint16_t status;
> +
>      switch (state) {
>      case NVME_ZONE_STATE_EMPTY:
> +        nvme_auto_transition_zone(n, ns, false, true);
> +        status = nvme_aor_check(n, ns, 1, 0);
> +        if (status != NVME_SUCCESS) {
> +            return status;
> +        }
> +        nvme_aor_inc_active(n, ns);
> +        /* fall through */
>      case NVME_ZONE_STATE_CLOSED:
> +        status = nvme_aor_check(n, ns, 0, 1);
> +        if (status != NVME_SUCCESS) {
> +            if (state == NVME_ZONE_STATE_EMPTY) {
> +                nvme_aor_dec_active(n, ns);
> +            }
> +            return status;
> +        }
> +        nvme_aor_inc_open(n, ns);
> +        /* fall through */
>      case NVME_ZONE_STATE_IMPLICITLY_OPEN:
>          nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_EXPLICITLY_OPEN);
>          /* fall through */
> @@ -828,6 +979,7 @@ static uint16_t nvme_close_zone(NvmeCtrl *n,  NvmeNamespace *ns,
>      switch (state) {
>      case NVME_ZONE_STATE_EXPLICITLY_OPEN:
>      case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        nvme_aor_dec_open(n, ns);
>          nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
>          /* fall through */
>      case NVME_ZONE_STATE_CLOSED:
> @@ -849,7 +1001,11 @@ static uint16_t nvme_finish_zone(NvmeCtrl *n, NvmeNamespace *ns,
>      switch (state) {
>      case NVME_ZONE_STATE_EXPLICITLY_OPEN:
>      case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        nvme_aor_dec_open(n, ns);
> +        /* fall through */
>      case NVME_ZONE_STATE_CLOSED:
> +        nvme_aor_dec_active(n, ns);
> +        /* fall through */
>      case NVME_ZONE_STATE_EMPTY:
>          zone->d.wp = nvme_zone_wr_boundary(zone);
>          nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_FULL);
> @@ -874,7 +1030,11 @@ static uint16_t nvme_reset_zone(NvmeCtrl *n, NvmeNamespace *ns,
>      switch (state) {
>      case NVME_ZONE_STATE_EXPLICITLY_OPEN:
>      case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        nvme_aor_dec_open(n, ns);
> +        /* fall through */
>      case NVME_ZONE_STATE_CLOSED:
> +        nvme_aor_dec_active(n, ns);
> +        /* fall through */
>      case NVME_ZONE_STATE_FULL:
>          zone->d.wp = zone->d.zslba;
>          nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_EMPTY);
> @@ -2412,6 +2572,15 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
>      uint64_t zone_size = 0, capacity;
>      uint32_t nz;
>
> +    if (n->params.max_open_zones < 0) {
> +        error_setg(errp, "invalid max_open_zones value");
> +        return;
> +    }
> +    if (n->params.max_active_zones < 0) {
> +        error_setg(errp, "invalid max_active_zones value");
> +        return;
> +    }
> +
>      if (n->params.zone_size) {
>          zone_size = n->params.zone_size;
>      } else {
> @@ -2435,6 +2604,14 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
>      n->num_zones = nz;
>      n->zone_array_size = sizeof(NvmeZone) * nz;
>
> +    /* Make sure that the values of all Zoned Command Set properties are sane */
> +    if (n->params.max_open_zones > nz) {
> +        n->params.max_open_zones = nz;
> +    }
> +    if (n->params.max_active_zones > nz) {
> +        n->params.max_active_zones = nz;
> +    }

Should there be some warning here? You are overwriting the property
that was set by the board, it seems like you should tell someone.

Alistair

> +
>      return;
>  }
>
> @@ -2452,8 +2629,8 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
>      ns->id_ns_zoned = g_malloc0(sizeof(*ns->id_ns_zoned));
>
>      /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
> -    ns->id_ns_zoned->mar = 0xffffffff;
> -    ns->id_ns_zoned->mor = 0xffffffff;
> +    ns->id_ns_zoned->mar = cpu_to_le32(n->params.max_active_zones - 1);
> +    ns->id_ns_zoned->mor = cpu_to_le32(n->params.max_open_zones - 1);
>      ns->id_ns_zoned->zoc = 0;
>      ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
>
> @@ -2813,6 +2990,8 @@ static Property nvme_props[] = {
>      DEFINE_PROP_UINT64("zone_size", NvmeCtrl, params.zone_size, 512),
>      DEFINE_PROP_UINT64("zone_capacity", NvmeCtrl, params.zone_capacity, 512),
>      DEFINE_PROP_UINT32("zone_append_max_size", NvmeCtrl, params.zamds_bs, 0),
> +    DEFINE_PROP_INT32("max_active", NvmeCtrl, params.max_active_zones, 0),
> +    DEFINE_PROP_INT32("max_open", NvmeCtrl, params.max_open_zones, 0),
>      DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
>      DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
>      DEFINE_PROP_END_OF_LIST(),
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 2c932b5e29..f5a4679702 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -19,6 +19,8 @@ typedef struct NvmeParams {
>      uint32_t    zamds_bs;
>      uint64_t    zone_size;
>      uint64_t    zone_capacity;
> +    int32_t     max_active_zones;
> +    int32_t     max_open_zones;
>  } NvmeParams;
>
>  typedef struct NvmeAsyncEvent {
> @@ -103,6 +105,8 @@ typedef struct NvmeNamespace {
>      NvmeZoneList    *imp_open_zones;
>      NvmeZoneList    *closed_zones;
>      NvmeZoneList    *full_zones;
> +    int32_t         nr_open_zones;
> +    int32_t         nr_active_zones;
>  } NvmeNamespace;
>
>  static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
> --
> 2.21.0
>
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 12/18] hw/block/nvme: Simulate Zone Active excursions
  2020-06-17 21:34 ` [PATCH v2 12/18] hw/block/nvme: Simulate Zone Active excursions Dmitry Fomichev
@ 2020-07-01  0:30   ` Alistair Francis
  2020-07-01  6:12   ` Klaus Jensen
  1 sibling, 0 replies; 49+ messages in thread
From: Alistair Francis @ 2020-07-01  0:30 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, Qemu-block,
	qemu-devel@nongnu.org Developers, Keith Busch,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Wed, Jun 17, 2020 at 2:52 PM Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
>
> Added a Boolean flag to turn on simulation of Zone Active Excursions.
> If the flag, "active_excursions", is set to true, the driver will try
> to finish one of the currently open zone if max active zones limit is
> going to get exceeded.
>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

Reviewed-by: Alistair Francis <alistair.francis@wdc.com>

Alistair

> ---
>  hw/block/nvme.c | 24 +++++++++++++++++++++++-
>  hw/block/nvme.h |  1 +
>  2 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 05a7cbcfcc..a29cbfcc96 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -540,6 +540,26 @@ static void nvme_auto_transition_zone(NvmeCtrl *n, NvmeNamespace *ns,
>  {
>      NvmeZone *zone;
>
> +    if (n->params.active_excursions && adding_active &&
> +        n->params.max_active_zones &&
> +        ns->nr_active_zones == n->params.max_active_zones) {
> +        zone = nvme_peek_zone_head(ns, ns->closed_zones);
> +        if (zone) {
> +            /*
> +             * The namespace is at the limit of active zones.
> +             * Try to finish one of the currently active zones
> +             * to make the needed active zone resource available.
> +             */
> +            nvme_aor_dec_active(n, ns);
> +            nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_FULL);
> +            zone->d.za &= ~(NVME_ZA_FINISH_RECOMMENDED |
> +                            NVME_ZA_RESET_RECOMMENDED);
> +            zone->d.za |= NVME_ZA_FINISHED_BY_CTLR;
> +            zone->tstamp = 0;
> +            trace_pci_nvme_zone_finished_by_controller(zone->d.zslba);
> +        }
> +    }
> +
>      if (implicit && n->params.max_open_zones &&
>          ns->nr_open_zones == n->params.max_open_zones) {
>          zone = nvme_remove_zone_head(n, ns, ns->imp_open_zones);
> @@ -2631,7 +2651,7 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
>      /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
>      ns->id_ns_zoned->mar = cpu_to_le32(n->params.max_active_zones - 1);
>      ns->id_ns_zoned->mor = cpu_to_le32(n->params.max_open_zones - 1);
> -    ns->id_ns_zoned->zoc = 0;
> +    ns->id_ns_zoned->zoc = cpu_to_le16(n->params.active_excursions ? 0x2 : 0);
>      ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
>
>      ns->id_ns_zoned->lbafe[lba_index].zsze = cpu_to_le64(n->params.zone_size);
> @@ -2993,6 +3013,8 @@ static Property nvme_props[] = {
>      DEFINE_PROP_INT32("max_active", NvmeCtrl, params.max_active_zones, 0),
>      DEFINE_PROP_INT32("max_open", NvmeCtrl, params.max_open_zones, 0),
>      DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
> +    DEFINE_PROP_BOOL("active_excursions", NvmeCtrl, params.active_excursions,
> +                     false),
>      DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
>      DEFINE_PROP_END_OF_LIST(),
>  };
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index f5a4679702..8a0aaeb09a 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -15,6 +15,7 @@ typedef struct NvmeParams {
>
>      bool        zoned;
>      bool        cross_zone_read;
> +    bool        active_excursions;
>      uint8_t     fill_pattern;
>      uint32_t    zamds_bs;
>      uint64_t    zone_size;
> --
> 2.21.0
>
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 12/18] hw/block/nvme: Simulate Zone Active excursions
  2020-06-17 21:34 ` [PATCH v2 12/18] hw/block/nvme: Simulate Zone Active excursions Dmitry Fomichev
  2020-07-01  0:30   ` Alistair Francis
@ 2020-07-01  6:12   ` Klaus Jensen
  1 sibling, 0 replies; 49+ messages in thread
From: Klaus Jensen @ 2020-07-01  6:12 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Keith Busch, Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Jun 18 06:34, Dmitry Fomichev wrote:
> Added a Boolean flag to turn on simulation of Zone Active Excursions.
> If the flag, "active_excursions", is set to true, the driver will try
> to finish one of the currently open zone if max active zones limit is
> going to get exceeded.
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme.c | 24 +++++++++++++++++++++++-
>  hw/block/nvme.h |  1 +
>  2 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 05a7cbcfcc..a29cbfcc96 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -540,6 +540,26 @@ static void nvme_auto_transition_zone(NvmeCtrl *n, NvmeNamespace *ns,
>  {
>      NvmeZone *zone;
>  
> +    if (n->params.active_excursions && adding_active &&
> +        n->params.max_active_zones &&
> +        ns->nr_active_zones == n->params.max_active_zones) {
> +        zone = nvme_peek_zone_head(ns, ns->closed_zones);
> +        if (zone) {
> +            /*
> +             * The namespace is at the limit of active zones.
> +             * Try to finish one of the currently active zones
> +             * to make the needed active zone resource available.
> +             */
> +            nvme_aor_dec_active(n, ns);
> +            nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_FULL);
> +            zone->d.za &= ~(NVME_ZA_FINISH_RECOMMENDED |
> +                            NVME_ZA_RESET_RECOMMENDED);
> +            zone->d.za |= NVME_ZA_FINISHED_BY_CTLR;
> +            zone->tstamp = 0;
> +            trace_pci_nvme_zone_finished_by_controller(zone->d.zslba);
> +        }
> +    }

Open Zones should also be considered for excursions.

> +
>      if (implicit && n->params.max_open_zones &&
>          ns->nr_open_zones == n->params.max_open_zones) {
>          zone = nvme_remove_zone_head(n, ns, ns->imp_open_zones);
> @@ -2631,7 +2651,7 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
>      /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
>      ns->id_ns_zoned->mar = cpu_to_le32(n->params.max_active_zones - 1);
>      ns->id_ns_zoned->mor = cpu_to_le32(n->params.max_open_zones - 1);
> -    ns->id_ns_zoned->zoc = 0;
> +    ns->id_ns_zoned->zoc = cpu_to_le16(n->params.active_excursions ? 0x2 : 0);
>      ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
>  
>      ns->id_ns_zoned->lbafe[lba_index].zsze = cpu_to_le64(n->params.zone_size);
> @@ -2993,6 +3013,8 @@ static Property nvme_props[] = {
>      DEFINE_PROP_INT32("max_active", NvmeCtrl, params.max_active_zones, 0),
>      DEFINE_PROP_INT32("max_open", NvmeCtrl, params.max_open_zones, 0),
>      DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
> +    DEFINE_PROP_BOOL("active_excursions", NvmeCtrl, params.active_excursions,
> +                     false),
>      DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
>      DEFINE_PROP_END_OF_LIST(),
>  };
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index f5a4679702..8a0aaeb09a 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -15,6 +15,7 @@ typedef struct NvmeParams {
>  
>      bool        zoned;
>      bool        cross_zone_read;
> +    bool        active_excursions;
>      uint8_t     fill_pattern;
>      uint32_t    zamds_bs;
>      uint64_t    zone_size;
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 11/18] hw/block/nvme: Introduce max active and open zone limits
  2020-06-17 21:34 ` [PATCH v2 11/18] hw/block/nvme: Introduce max active and open zone limits Dmitry Fomichev
  2020-07-01  0:26   ` Alistair Francis
@ 2020-07-01  6:41   ` Klaus Jensen
  1 sibling, 0 replies; 49+ messages in thread
From: Klaus Jensen @ 2020-07-01  6:41 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Keith Busch, Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Jun 18 06:34, Dmitry Fomichev wrote:
> Added two module properties, "max_active" and "max_open" to control
> the maximum number of zones that can be active or open. Once these
> variables are set to non-default values, the driver checks these
> limits during I/O and returns Too Many Active or Too Many Open
> command status if they are exceeded.
> 
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme.c | 183 +++++++++++++++++++++++++++++++++++++++++++++++-
>  hw/block/nvme.h |   4 ++
>  2 files changed, 185 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 2e03b0b6ed..05a7cbcfcc 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -120,6 +120,87 @@ static void nvme_remove_zone(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
>      zone->prev = zone->next = 0;
>  }
>  
> +/*
> + * Take the first zone out from a list, return NULL if the list is empty.
> + */
> +static NvmeZone *nvme_remove_zone_head(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZoneList *zl)
> +{
> +    NvmeZone *zone = nvme_peek_zone_head(ns, zl);
> +
> +    if (zone) {
> +        --zl->size;
> +        if (zl->size == 0) {
> +            zl->head = NVME_ZONE_LIST_NIL;
> +            zl->tail = NVME_ZONE_LIST_NIL;
> +        } else {
> +            zl->head = zone->next;
> +            ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
> +        }
> +        zone->prev = zone->next = 0;
> +    }
> +
> +    return zone;
> +}
> +
> +/*
> + * Check if we can open a zone without exceeding open/active limits.
> + * AOR stands for "Active and Open Resources" (see TP 4053 section 2.5).
> + */
> +static int nvme_aor_check(NvmeCtrl *n, NvmeNamespace *ns,
> +     uint32_t act, uint32_t opn)
> +{
> +    if (n->params.max_active_zones != 0 &&
> +        ns->nr_active_zones + act > n->params.max_active_zones) {
> +        trace_pci_nvme_err_insuff_active_res(n->params.max_active_zones);
> +        return NVME_ZONE_TOO_MANY_ACTIVE | NVME_DNR;
> +    }
> +    if (n->params.max_open_zones != 0 &&
> +        ns->nr_open_zones + opn > n->params.max_open_zones) {
> +        trace_pci_nvme_err_insuff_open_res(n->params.max_open_zones);
> +        return NVME_ZONE_TOO_MANY_OPEN | NVME_DNR;
> +    }
> +
> +    return NVME_SUCCESS;
> +}
> +
> +static inline void nvme_aor_inc_open(NvmeCtrl *n, NvmeNamespace *ns)
> +{
> +    assert(ns->nr_open_zones >= 0);
> +    if (n->params.max_open_zones) {
> +        ns->nr_open_zones++;
> +        assert(ns->nr_open_zones <= n->params.max_open_zones);
> +    }
> +}
> +
> +static inline void nvme_aor_dec_open(NvmeCtrl *n, NvmeNamespace *ns)
> +{
> +    if (n->params.max_open_zones) {
> +        assert(ns->nr_open_zones > 0);
> +        ns->nr_open_zones--;
> +    }
> +    assert(ns->nr_open_zones >= 0);
> +}
> +
> +static inline void nvme_aor_inc_active(NvmeCtrl *n, NvmeNamespace *ns)
> +{
> +    assert(ns->nr_active_zones >= 0);
> +    if (n->params.max_active_zones) {
> +        ns->nr_active_zones++;
> +        assert(ns->nr_active_zones <= n->params.max_active_zones);
> +    }
> +}
> +
> +static inline void nvme_aor_dec_active(NvmeCtrl *n, NvmeNamespace *ns)
> +{
> +    if (n->params.max_active_zones) {
> +        assert(ns->nr_active_zones > 0);
> +        ns->nr_active_zones--;
> +        assert(ns->nr_active_zones >= ns->nr_open_zones);
> +    }
> +    assert(ns->nr_active_zones >= 0);
> +}
> +
>  static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
>      NvmeZone *zone, uint8_t state)
>  {
> @@ -454,6 +535,24 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
>      timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>  }
>  
> +static void nvme_auto_transition_zone(NvmeCtrl *n, NvmeNamespace *ns,
> +    bool implicit, bool adding_active)
> +{
> +    NvmeZone *zone;
> +
> +    if (implicit && n->params.max_open_zones &&
> +        ns->nr_open_zones == n->params.max_open_zones) {
> +        zone = nvme_remove_zone_head(n, ns, ns->imp_open_zones);
> +        if (zone) {
> +            /*
> +             * Automatically close this implicitly open zone.
> +             */
> +            nvme_aor_dec_open(n, ns);
> +            nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
> +        }
> +    }
> +}
> +
>  static uint16_t nvme_check_zone_write(NvmeZone *zone, uint64_t slba,
>      uint32_t nlb)
>  {
> @@ -531,6 +630,23 @@ static uint16_t nvme_check_zone_read(NvmeCtrl *n, NvmeZone *zone, uint64_t slba,
>      return status;
>  }
>  
> +static uint16_t nvme_auto_open_zone(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZone *zone)
> +{
> +    uint16_t status = NVME_SUCCESS;
> +    uint8_t zs = nvme_get_zone_state(zone);
> +
> +    if (zs == NVME_ZONE_STATE_EMPTY) {
> +        nvme_auto_transition_zone(n, ns, true, true);
> +        status = nvme_aor_check(n, ns, 1, 1);
> +    } else if (zs == NVME_ZONE_STATE_CLOSED) {
> +        nvme_auto_transition_zone(n, ns, true, false);
> +        status = nvme_aor_check(n, ns, 0, 1);
> +    }
> +
> +    return status;
> +}
> +
>  static uint64_t nvme_finalize_zone_write(NvmeCtrl *n, NvmeNamespace *ns,
>      NvmeZone *zone, uint32_t nlb)
>  {
> @@ -543,7 +659,11 @@ static uint64_t nvme_finalize_zone_write(NvmeCtrl *n, NvmeNamespace *ns,
>          switch (zs) {
>          case NVME_ZONE_STATE_IMPLICITLY_OPEN:
>          case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +            nvme_aor_dec_open(n, ns);
> +            /* fall through */
>          case NVME_ZONE_STATE_CLOSED:
> +            nvme_aor_dec_active(n, ns);
> +            /* fall through */
>          case NVME_ZONE_STATE_EMPTY:
>              break;
>          default:
> @@ -553,7 +673,10 @@ static uint64_t nvme_finalize_zone_write(NvmeCtrl *n, NvmeNamespace *ns,
>      } else {
>          switch (zs) {
>          case NVME_ZONE_STATE_EMPTY:
> +            nvme_aor_inc_active(n, ns);
> +            /* fall through */
>          case NVME_ZONE_STATE_CLOSED:
> +            nvme_aor_inc_open(n, ns);
>              nvme_assign_zone_state(n, ns, zone,
>                                     NVME_ZONE_STATE_IMPLICITLY_OPEN);
>          }
> @@ -636,6 +759,11 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>                                                 zone->d.wp);
>              return NVME_ZONE_INVALID_WRITE | NVME_DNR;
>          }
> +
> +        status = nvme_auto_open_zone(n, ns, zone);
> +        if (status != NVME_SUCCESS) {
> +            return status;
> +        }
>      }
>  
>      block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
> @@ -709,6 +837,11 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>                                                     zone->d.wp);
>                  return NVME_ZONE_INVALID_WRITE | NVME_DNR;
>              }
> +
> +            status = nvme_auto_open_zone(n, ns, zone);
> +            if (status != NVME_SUCCESS) {
> +                return status;
> +            }
>          } else {
>              status = nvme_check_zone_read(n, zone, slba, nlb,
>                                            n->params.cross_zone_read);
> @@ -804,9 +937,27 @@ static uint16_t nvme_get_mgmt_zone_slba_idx(NvmeCtrl *n, NvmeNamespace *ns,
>  static uint16_t nvme_open_zone(NvmeCtrl *n, NvmeNamespace *ns,
>      NvmeZone *zone, uint8_t state)
>  {
> +    uint16_t status;
> +
>      switch (state) {
>      case NVME_ZONE_STATE_EMPTY:
> +        nvme_auto_transition_zone(n, ns, false, true);
> +        status = nvme_aor_check(n, ns, 1, 0);
> +        if (status != NVME_SUCCESS) {
> +            return status;
> +        }
> +        nvme_aor_inc_active(n, ns);
> +        /* fall through */
>      case NVME_ZONE_STATE_CLOSED:
> +        status = nvme_aor_check(n, ns, 0, 1);
> +        if (status != NVME_SUCCESS) {
> +            if (state == NVME_ZONE_STATE_EMPTY) {
> +                nvme_aor_dec_active(n, ns);
> +            }
> +            return status;
> +        }
> +        nvme_aor_inc_open(n, ns);
> +        /* fall through */
>      case NVME_ZONE_STATE_IMPLICITLY_OPEN:
>          nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_EXPLICITLY_OPEN);
>          /* fall through */
> @@ -828,6 +979,7 @@ static uint16_t nvme_close_zone(NvmeCtrl *n,  NvmeNamespace *ns,
>      switch (state) {
>      case NVME_ZONE_STATE_EXPLICITLY_OPEN:
>      case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        nvme_aor_dec_open(n, ns);
>          nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
>          /* fall through */
>      case NVME_ZONE_STATE_CLOSED:
> @@ -849,7 +1001,11 @@ static uint16_t nvme_finish_zone(NvmeCtrl *n, NvmeNamespace *ns,
>      switch (state) {
>      case NVME_ZONE_STATE_EXPLICITLY_OPEN:
>      case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        nvme_aor_dec_open(n, ns);
> +        /* fall through */
>      case NVME_ZONE_STATE_CLOSED:
> +        nvme_aor_dec_active(n, ns);
> +        /* fall through */
>      case NVME_ZONE_STATE_EMPTY:
>          zone->d.wp = nvme_zone_wr_boundary(zone);
>          nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_FULL);
> @@ -874,7 +1030,11 @@ static uint16_t nvme_reset_zone(NvmeCtrl *n, NvmeNamespace *ns,
>      switch (state) {
>      case NVME_ZONE_STATE_EXPLICITLY_OPEN:
>      case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        nvme_aor_dec_open(n, ns);
> +        /* fall through */
>      case NVME_ZONE_STATE_CLOSED:
> +        nvme_aor_dec_active(n, ns);
> +        /* fall through */
>      case NVME_ZONE_STATE_FULL:
>          zone->d.wp = zone->d.zslba;
>          nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_EMPTY);
> @@ -2412,6 +2572,15 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
>      uint64_t zone_size = 0, capacity;
>      uint32_t nz;
>  
> +    if (n->params.max_open_zones < 0) {
> +        error_setg(errp, "invalid max_open_zones value");
> +        return;
> +    }
> +    if (n->params.max_active_zones < 0) {
> +        error_setg(errp, "invalid max_active_zones value");
> +        return;
> +    }
> +
>      if (n->params.zone_size) {
>          zone_size = n->params.zone_size;
>      } else {
> @@ -2435,6 +2604,14 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
>      n->num_zones = nz;
>      n->zone_array_size = sizeof(NvmeZone) * nz;
>  
> +    /* Make sure that the values of all Zoned Command Set properties are sane */
> +    if (n->params.max_open_zones > nz) {
> +        n->params.max_open_zones = nz;
> +    }
> +    if (n->params.max_active_zones > nz) {
> +        n->params.max_active_zones = nz;
> +    }

As Alistair already pointed out, a warning would be nice.

> +
>      return;
>  }
>  
> @@ -2452,8 +2629,8 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
>      ns->id_ns_zoned = g_malloc0(sizeof(*ns->id_ns_zoned));
>  
>      /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
> -    ns->id_ns_zoned->mar = 0xffffffff;
> -    ns->id_ns_zoned->mor = 0xffffffff;
> +    ns->id_ns_zoned->mar = cpu_to_le32(n->params.max_active_zones - 1);
> +    ns->id_ns_zoned->mor = cpu_to_le32(n->params.max_open_zones - 1);
>      ns->id_ns_zoned->zoc = 0;
>      ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
>  
> @@ -2813,6 +2990,8 @@ static Property nvme_props[] = {
>      DEFINE_PROP_UINT64("zone_size", NvmeCtrl, params.zone_size, 512),
>      DEFINE_PROP_UINT64("zone_capacity", NvmeCtrl, params.zone_capacity, 512),
>      DEFINE_PROP_UINT32("zone_append_max_size", NvmeCtrl, params.zamds_bs, 0),
> +    DEFINE_PROP_INT32("max_active", NvmeCtrl, params.max_active_zones, 0),
> +    DEFINE_PROP_INT32("max_open", NvmeCtrl, params.max_open_zones, 0),

max_active and max_open should be unsigned. 0xfffffffe is a valid value
for MAR/MOR.

>      DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
>      DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
>      DEFINE_PROP_END_OF_LIST(),
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 2c932b5e29..f5a4679702 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -19,6 +19,8 @@ typedef struct NvmeParams {
>      uint32_t    zamds_bs;
>      uint64_t    zone_size;
>      uint64_t    zone_capacity;
> +    int32_t     max_active_zones;
> +    int32_t     max_open_zones;
>  } NvmeParams;
>  
>  typedef struct NvmeAsyncEvent {
> @@ -103,6 +105,8 @@ typedef struct NvmeNamespace {
>      NvmeZoneList    *imp_open_zones;
>      NvmeZoneList    *closed_zones;
>      NvmeZoneList    *full_zones;
> +    int32_t         nr_open_zones;
> +    int32_t         nr_active_zones;
>  } NvmeNamespace;
>  
>  static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 14/18] hw/block/nvme: Generate zone AENs
  2020-06-17 21:34 ` [PATCH v2 14/18] hw/block/nvme: Generate zone AENs Dmitry Fomichev
@ 2020-07-01 11:44   ` Klaus Jensen
  0 siblings, 0 replies; 49+ messages in thread
From: Klaus Jensen @ 2020-07-01 11:44 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Keith Busch, Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Jun 18 06:34, Dmitry Fomichev wrote:
> Added an optional Boolean "zone_async_events" property to the driver.
> Once it's turned on, the namespace will be sending "Zone Descriptor
> Changed" asynchronous events to the host in particular situations
> defined by the protocol. In order to clear these AENs, the host needs
> to read the newly added Changed Zones Log.
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

This was a tough review ;)


  * I don't like the monkey patching of the completion queue path to
    handle AERs and it took me way too much time to figure out what was
    going on with the extra timer_mod's on the cq->timer.

    Please consider taking a look at

      https://github.com/birkelund/qemu/commit/928a6ead98ba3b0a293d90496c3fa54d51a052a5

    which is already reviewed and gets AERs right I think. But if my
    v1.3 series are merged, that will be in-tree anyway.

  * Handling the RRL and FRL delays and limits can be handled using a
    single timer like I'm doing here in my version of the ZNS
    emulation:

      https://github.com/birkelund/qemu/blob/for-master/nvme/hw/block/nvme-ns.c#L52

    This is infinitely more efficient since it removes the need for
    continuously kicking the event loop every 10ms. And this patch
    *really* needs to get get rid of that polling ;)


More comments inline.


> ---
>  hw/block/nvme.c      | 300 ++++++++++++++++++++++++++++++++++++++++++-
>  hw/block/nvme.h      |  13 +-
>  include/block/nvme.h |  23 +++-
>  3 files changed, 328 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index c3898448c7..b9135a6b1f 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -201,12 +201,66 @@ static inline void nvme_aor_dec_active(NvmeCtrl *n, NvmeNamespace *ns)
>      assert(ns->nr_active_zones >= 0);
>  }
>  
> +static bool nvme_complete_async_req(NvmeCtrl *n, NvmeNamespace *ns,
> +    enum NvmeAsyncEventType type, uint8_t info)
> +{
> +    NvmeAsyncEvent *ae;
> +    uint32_t nsid = 0;
> +    uint8_t log_page = 0;
> +
> +    switch (type) {
> +    case NVME_AER_TYPE_ERROR:
> +    case NVME_AER_TYPE_SMART:
> +        break;
> +    case NVME_AER_TYPE_NOTICE:
> +        switch (info) {
> +        case NVME_AER_NOTICE_ZONE_DESCR_CHANGED:
> +            log_page = NVME_LOG_ZONE_CHANGED_LIST;
> +            nsid = ns->nsid;
> +            if (!(n->ae_cfg & NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES)) {
> +                trace_pci_nvme_zone_ae_not_enabled(info, log_page, nsid);
> +                return false;
> +            }
> +            if (ns->aen_pending) {
> +                trace_pci_nvme_zone_ae_not_cleared(info, log_page, nsid);
> +                return false;
> +            }
> +            ns->aen_pending = true;
> +        }
> +        break;
> +    case NVME_AER_TYPE_CMDSET_SPECIFIC:
> +    case NVME_AER_TYPE_VENDOR_SPECIFIC:
> +        break;
> +    }
> +
> +    ae = g_malloc0(sizeof(*ae));
> +    ae->res = type;
> +    ae->res |= (info << 8) & 0xff00;
> +    ae->res |= (log_page << 16) & 0xff0000;
> +    ae->nsid = nsid;
> +
> +    QTAILQ_INSERT_TAIL(&n->async_reqs, ae, entry);
> +    timer_mod(n->admin_cq.timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> +    return true;
> +}
> +
> +static inline void nvme_notify_zone_changed(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZone *zone)
> +{
> +    if (n->ae_cfg) {
> +        zone->flags |= NVME_ZFLAGS_AEN_PEND;
> +        nvme_complete_async_req(n, ns, NVME_AER_TYPE_NOTICE,
> +                                NVME_AER_NOTICE_ZONE_DESCR_CHANGED);
> +    }
> +}
> +
>  static void nvme_set_rzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
>  {
>      assert(zone->flags & NVME_ZFLAGS_SET_RZR);
>      zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>      zone->flags &= ~NVME_ZFLAGS_TS_DELAY;
>      zone->d.za |= NVME_ZA_RESET_RECOMMENDED;
> +    nvme_notify_zone_changed(n, ns, zone);
>      zone->flags &= ~NVME_ZFLAGS_SET_RZR;
>      trace_pci_nvme_zone_reset_recommended(zone->d.zslba);
>  }
> @@ -215,10 +269,14 @@ static void nvme_clear_rzr(NvmeCtrl *n, NvmeNamespace *ns,
>      NvmeZone *zone, bool notify)
>  {
>      if (n->params.rrl_usec) {
> -        zone->flags &= ~(NVME_ZFLAGS_SET_RZR | NVME_ZFLAGS_TS_DELAY);
> +        zone->flags &= ~(NVME_ZFLAGS_SET_RZR | NVME_ZFLAGS_TS_DELAY |
> +                         NVME_ZFLAGS_AEN_PEND);
>          notify = notify && (zone->d.za & NVME_ZA_RESET_RECOMMENDED);
>          zone->d.za &= ~NVME_ZA_RESET_RECOMMENDED;
>          zone->tstamp = 0;
> +        if (notify) {
> +            nvme_notify_zone_changed(n, ns, zone);
> +        }
>      }
>  }
>  
> @@ -228,6 +286,7 @@ static void nvme_set_fzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
>      zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>      zone->flags &= ~NVME_ZFLAGS_TS_DELAY;
>      zone->d.za |= NVME_ZA_FINISH_RECOMMENDED;
> +    nvme_notify_zone_changed(n, ns, zone);
>      zone->flags &= ~NVME_ZFLAGS_SET_FZR;
>      trace_pci_nvme_zone_finish_recommended(zone->d.zslba);
>  }
> @@ -236,13 +295,61 @@ static void nvme_clear_fzr(NvmeCtrl *n, NvmeNamespace *ns,
>      NvmeZone *zone, bool notify)
>  {
>      if (n->params.frl_usec) {
> -        zone->flags &= ~(NVME_ZFLAGS_SET_FZR | NVME_ZFLAGS_TS_DELAY);
> +        zone->flags &= ~(NVME_ZFLAGS_SET_FZR | NVME_ZFLAGS_TS_DELAY |
> +                         NVME_ZFLAGS_AEN_PEND);
>          notify = notify && (zone->d.za & NVME_ZA_FINISH_RECOMMENDED);
>          zone->d.za &= ~NVME_ZA_FINISH_RECOMMENDED;
>          zone->tstamp = 0;
> +        if (notify) {
> +            nvme_notify_zone_changed(n, ns, zone);
> +        }
>      }
>  }
>  
> +static bool nvme_process_rrl(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
> +{
> +    if (zone->flags & NVME_ZFLAGS_SET_RZR) {
> +        if (zone->flags & NVME_ZFLAGS_TS_DELAY) {
> +            assert(!(zone->d.za & NVME_ZA_RESET_RECOMMENDED));
> +            if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - zone->tstamp >=
> +                n->params.rzr_delay_usec) {
> +                nvme_set_rzr(n, ns, zone);
> +                return true;
> +            }
> +        } else if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - zone->tstamp >=
> +                   n->params.rrl_usec) {
> +            assert(zone->d.za & NVME_ZA_RESET_RECOMMENDED);
> +            nvme_clear_rzr(n, ns, zone, true);
> +            trace_pci_nvme_zone_reset_internal_op(zone->d.zslba);
> +            return true;
> +        }
> +    }
> +
> +    return false;
> +}
> +
> +static bool nvme_process_frl(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
> +{
> +    if (zone->flags & NVME_ZFLAGS_SET_FZR) {
> +        if (zone->flags & NVME_ZFLAGS_TS_DELAY) {
> +            assert(!(zone->d.za & NVME_ZA_FINISH_RECOMMENDED));
> +            if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - zone->tstamp >=
> +                n->params.fzr_delay_usec) {
> +                nvme_set_fzr(n, ns, zone);
> +                return true;
> +            }
> +        } else if (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - zone->tstamp >=
> +                   n->params.frl_usec) {
> +            assert(zone->d.za & NVME_ZA_FINISH_RECOMMENDED);
> +            nvme_clear_fzr(n, ns, zone, true);
> +            trace_pci_nvme_zone_finish_internal_op(zone->d.zslba);
> +            return true;
> +        }
> +    }
> +
> +    return false;
> +}
> +
>  static void nvme_schedule_rzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
>  {
>      if (n->params.frl_usec) {
> @@ -279,6 +386,48 @@ static void nvme_schedule_fzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
>      }
>  }
>  
> +static void nvme_observe_ns_zone_time_limits(NvmeCtrl *n, NvmeNamespace *ns)
> +{
> +    NvmeZone *zone;
> +
> +    if (n->params.frl_usec) {
> +        for (zone = nvme_peek_zone_head(ns, ns->closed_zones);
> +             zone;
> +             zone = nvme_next_zone_in_list(ns, zone, ns->closed_zones)) {
> +            nvme_process_frl(n, ns, zone);
> +        }
> +
> +        for (zone = nvme_peek_zone_head(ns, ns->imp_open_zones);
> +             zone;
> +             zone = nvme_next_zone_in_list(ns, zone, ns->imp_open_zones)) {
> +            nvme_process_frl(n, ns, zone);
> +        }
> +
> +        for (zone = nvme_peek_zone_head(ns, ns->exp_open_zones);
> +             zone;
> +             zone = nvme_next_zone_in_list(ns, zone, ns->exp_open_zones)) {
> +            nvme_process_frl(n, ns, zone);
> +        }
> +    }
> +
> +    if (n->params.rrl_usec) {
> +        for (zone = nvme_peek_zone_head(ns, ns->full_zones);
> +             zone;
> +             zone = nvme_next_zone_in_list(ns, zone, ns->full_zones)) {
> +            nvme_process_rrl(n, ns, zone);
> +        }
> +    }
> +}
> +
> +static void nvme_observe_zone_time_limits(NvmeCtrl *n)
> +{
> +    int i;
> +
> +    for (i = 0; i < n->num_namespaces; i++) {
> +        nvme_observe_ns_zone_time_limits(n, &n->namespaces[i]);
> +    }
> +}
> +
>  static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
>      NvmeZone *zone, uint8_t state)
>  {
> @@ -563,6 +712,7 @@ static void nvme_post_cqes(void *opaque)
>      NvmeCQueue *cq = opaque;
>      NvmeCtrl *n = cq->ctrl;
>      NvmeRequest *req, *next;
> +    NvmeAsyncEvent *ae;
>  
>      QTAILQ_FOREACH_SAFE(req, &cq->req_list, entry, next) {
>          NvmeSQueue *sq;
> @@ -572,8 +722,26 @@ static void nvme_post_cqes(void *opaque)
>              break;
>          }
>  
> +        ae = NULL;
> +        if (req->flags & NVME_REQ_FLG_AER) {
> +            if (likely(QTAILQ_EMPTY(&n->async_reqs))) {
> +                continue;
> +            } else {
> +                ae = QTAILQ_FIRST(&n->async_reqs);
> +                QTAILQ_REMOVE(&n->async_reqs, ae, entry);
> +            }
> +        }

Since AERs are kept in the completion queue req_list, they simply linger
there if there is nothing to complete and we have to iterate over them
on every invocation of nvme_post_cqes. And since you are kicking the
timer every 10ms, this is a lot of doing for doing mostly nothing.

> +
>          QTAILQ_REMOVE(&cq->req_list, req, entry);
>          sq = req->sq;
> +        if (unlikely(ae)) {
> +            assert(!sq->sqid);
> +            req->cqe.ae.info = cpu_to_le32(ae->res);
> +            req->cqe.ae.nsid = cpu_to_le32(ae->nsid);
> +            g_free(ae);
> +            assert(n->nr_aers);
> +            n->nr_aers--;
> +        }
>  
>          req->cqe.status = cpu_to_le16((req->status << 1) | cq->phase);
>          req->cqe.sq_id = cpu_to_le16(sq->sqid);
> @@ -587,6 +755,15 @@ static void nvme_post_cqes(void *opaque)
>      if (cq->tail != cq->head) {
>          nvme_irq_assert(n, cq);
>      }
> +
> +    if (cq == &n->admin_cq &&
> +        n->params.zoned && n->params.zone_async_events) {
> +        nvme_observe_zone_time_limits(n);
> +        if (timer_expired(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL))) {
> +            timer_mod(cq->timer,
> +                      qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 10 * SCALE_MS);
> +        }
> +    }

I don't like this polling on the admin queue to check the limits.

>  }
>  
>  static void nvme_fill_data(QEMUSGList *qsg, QEMUIOVector *iov,
> @@ -618,7 +795,9 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
>      assert(cq->cqid == req->sq->cqid);
>      QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
>      QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
> -    timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> +    if (!(req->flags & NVME_REQ_FLG_AER)) {
> +        timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> +    }
>  }
>  
>  static void nvme_auto_transition_zone(NvmeCtrl *n, NvmeNamespace *ns,
> @@ -643,6 +822,7 @@ static void nvme_auto_transition_zone(NvmeCtrl *n, NvmeNamespace *ns,
>              zone->d.za |= NVME_ZA_FINISHED_BY_CTLR;
>              zone->flags = 0;
>              zone->tstamp = 0;
> +            nvme_notify_zone_changed(n, ns, zone);
>              trace_pci_nvme_zone_finished_by_controller(zone->d.zslba);
>          }
>      }
> @@ -1978,6 +2158,10 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          break;
>      case NVME_TIMESTAMP:
>          return nvme_get_feature_timestamp(n, cmd);
> +    case NVME_ASYNCHRONOUS_EVENT_CONF:
> +        result = cpu_to_le32(n->ae_cfg);
> +        trace_pci_nvme_getfeat_aen_cfg(result);
> +        break;
>      case NVME_COMMAND_SET_PROFILE:
>          result = 0;
>          break;
> @@ -2029,6 +2213,19 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          return nvme_set_feature_timestamp(n, cmd);
>          break;
>  
> +    case NVME_ASYNCHRONOUS_EVENT_CONF:
> +        if (dw11 & NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES) {
> +            if (!(n->ae_cfg & NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES)) {
> +                trace_pci_nvme_zone_aen_not_requested(dw11);
> +            } else {
> +                trace_pci_nvme_setfeat_zone_info_aer_on();
> +            }
> +        } else if (n->ae_cfg & NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES) {
> +            trace_pci_nvme_setfeat_zone_info_aer_off();
> +            n->ae_cfg &= ~NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES;
> +        }
> +        break;
> +
>      case NVME_COMMAND_SET_PROFILE:
>          if (dw11 & 0x1ff) {
>              trace_pci_nvme_err_invalid_iocsci(dw11 & 0x1ff);
> @@ -2043,6 +2240,18 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      return NVME_SUCCESS;
>  }
>  
> +static uint16_t nvme_async_req(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +{
> +    if (n->nr_aers >= NVME_MAX_ASYNC_EVENTS) {
> +        return NVME_AER_LIMIT_EXCEEDED | NVME_DNR;
> +    }
> +
> +    assert(!(req->flags & NVME_REQ_FLG_AER));
> +    req->flags |= NVME_REQ_FLG_AER;
> +    n->nr_aers++;
> +    return NVME_SUCCESS;

Yuck. Don't return NVME_SUCCESS and monkey patch the completion path
like you do above; it feel hacky. Just queue up the request in a list
and return NVME_NO_COMPLETE. Then, when you have an AEN to issue, just
dequeue the oldest AER and call nvme_enqueue_req_completion.

> +}
> +
>  static uint16_t nvme_handle_cmd_effects(NvmeCtrl *n, NvmeCmd *cmd,
>      uint64_t prp1, uint64_t prp2, uint64_t ofs, uint32_t len, uint8_t csi)
>  {
> @@ -2068,6 +2277,7 @@ static uint16_t nvme_handle_cmd_effects(NvmeCtrl *n, NvmeCmd *cmd,
>      iocs[NVME_ADM_CMD_SET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
>      iocs[NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFFECTS_CSUPP;
>      iocs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;
> +    iocs[NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFFECTS_CSUPP;
>  
>      if (NVME_CC_CSS(n->bar.cc) != CSS_ADMIN_ONLY) {
>          iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> @@ -2086,6 +2296,67 @@ static uint16_t nvme_handle_cmd_effects(NvmeCtrl *n, NvmeCmd *cmd,
>      return nvme_dma_read_prp(n, (uint8_t *)&cmd_eff_log, len, prp1, prp2);
>  }
>  
> +static uint16_t nvme_handle_changed_zone_log(NvmeCtrl *n, NvmeCmd *cmd,
> +    uint64_t prp1, uint64_t prp2, uint16_t nsid, uint64_t ofs, uint32_t len,
> +    uint8_t csi, bool rae)
> +{
> +    NvmeNamespace *ns;
> +    NvmeChangedZoneLog zc_log = {};
> +    NvmeZone *zone;
> +    uint64_t *zid_ptr = &zc_log.zone_ids[0];
> +    uint64_t *zid_end = zid_ptr + ARRAY_SIZE(zc_log.zone_ids);
> +    int i, nids = 0, num_aen_zones = 0;
> +
> +    trace_pci_nvme_changed_zone_log_read(nsid);
> +
> +    if (!n->params.zoned || !n->params.zone_async_events) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> +        trace_pci_nvme_err_invalid_ns(nsid, n->num_namespaces);
> +        return NVME_INVALID_FIELD | NVME_DNR;

This should be NVME_INVALID_NSID.

> +    }
> +    ns = &n->namespaces[nsid - 1];
> +    if (csi != ns->csi) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }

I don't think the TP 4056 requires CSI to be set. It's only used for the
effects log page.

> +
> +    if (ofs != 0) {
> +        trace_pci_nvme_err_invalid_changed_zone_list_offset(ofs);
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }

It might be weird that the host reads at an offset on this dynamic log
page, but its not invalid. The offset should not be larger than the size
of the log page though.

> +    if (len != sizeof(zc_log)) {
> +        trace_pci_nvme_err_invalid_changed_zone_list_len(len);
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }

"The host *should* read the entire page ..". Again, it might be stupid,
but it is not invalid to read more or less.

> +
> +    zone = ns->zone_array;
> +    for (i = 0; i < n->num_zones && zid_ptr < zid_end; i++, zone++) {
> +        if (!(zone->flags & NVME_ZFLAGS_AEN_PEND)) {
> +            continue;
> +        }
> +        num_aen_zones++;
> +        if (zone->d.za) {
> +            trace_pci_nvme_reporting_changed_zone(zone->d.zslba, zone->d.za);
> +            *zid_ptr++ = cpu_to_le64(zone->d.zslba);
> +            nids++;
> +        }

Hmm. So a zone is only included if it has an attribute set? What about
when the controller has cleared the RZR attribute? That also should also
be reflected here.

> +        if (!rae) {
> +            zone->flags &= ~NVME_ZFLAGS_AEN_PEND;
> +        }

I'm not sure the semantics around RAE is correct here. It doesnt really
have anything to do with the individual zone flags. Even though
multiple zones has changed state and may cause multiple Zone Descriptor
Change events to be generated internally, only one should result in an
AER being completed. The event is then masked until the associated log
page is read with RAE set to zero.

> +    }
> +
> +    if (num_aen_zones && !nids) {
> +        trace_pci_nvme_empty_changed_zone_list();
> +        nids = 0xffff;
> +    }

It doesn't look like the case of more than 511 changed zones is handled?
In that case the remainder of the list *shall* be zero filled.

> +    zc_log.nr_zone_ids = cpu_to_le16(nids);
> +    ns->aen_pending = false;
> +
> +    return nvme_dma_read_prp(n, (uint8_t *)&zc_log, len, prp1, prp2);
> +}
> +
>  static uint16_t nvme_get_log_page(NvmeCtrl *n, NvmeCmd *cmd)
>  {
>      uint64_t prp1 = le64_to_cpu(cmd->prp1);
> @@ -2095,9 +2366,11 @@ static uint16_t nvme_get_log_page(NvmeCtrl *n, NvmeCmd *cmd)
>      uint64_t dw12 = le32_to_cpu(cmd->cdw12);
>      uint64_t dw13 = le32_to_cpu(cmd->cdw13);
>      uint64_t ofs = (dw13 << 32) | dw12;
> +    uint32_t nsid = le32_to_cpu(cmd->nsid);
>      uint32_t numdl, numdu, len;
>      uint16_t lid = dw10 & 0xff;
>      uint8_t csi = le32_to_cpu(cmd->cdw14) >> 24;
> +    bool rae = !!(dw10 & (1 << 15));
>  
>      numdl = dw10 >> 16;
>      numdu = dw11 & 0xffff;
> @@ -2106,6 +2379,9 @@ static uint16_t nvme_get_log_page(NvmeCtrl *n, NvmeCmd *cmd)
>      switch (lid) {
>      case NVME_LOG_CMD_EFFECTS:
>          return nvme_handle_cmd_effects(n, cmd, prp1, prp2, ofs, len, csi);
> +    case NVME_LOG_ZONE_CHANGED_LIST:
> +        return nvme_handle_changed_zone_log(n, cmd, prp1, prp2, nsid,
> +                                            ofs, len, csi, rae);
>       }
>  
>      trace_pci_nvme_unsupported_log_page(lid);
> @@ -2131,6 +2407,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          return nvme_get_feature(n, cmd, req);
>      case NVME_ADM_CMD_GET_LOG_PAGE:
>          return nvme_get_log_page(n, cmd);
> +    case NVME_ADM_CMD_ASYNC_EV_REQ:
> +        return nvme_async_req(n, cmd, req);
>      default:
>          trace_pci_nvme_err_invalid_admin_opc(cmd->opcode);
>          return NVME_INVALID_OPCODE | NVME_DNR;
> @@ -2171,6 +2449,7 @@ static void nvme_process_sq(void *opaque)
>  
>  static void nvme_clear_ctrl(NvmeCtrl *n)
>  {
> +    NvmeAsyncEvent *ae_entry, *next;
>      int i;
>  
>      blk_drain(n->conf.blk);
> @@ -2186,6 +2465,11 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
>          }
>      }
>  
> +    QTAILQ_FOREACH_SAFE(ae_entry, &n->async_reqs, entry, next) {
> +        g_free(ae_entry);
> +    }
> +    n->nr_aers = 0;
> +
>      blk_flush(n->conf.blk);
>      n->bar.cc = 0;
>  }
> @@ -2290,6 +2574,9 @@ static int nvme_start_ctrl(NvmeCtrl *n)
>  
>      nvme_set_timestamp(n, 0ULL);
>  
> +    QTAILQ_INIT(&n->async_reqs);
> +    n->nr_aers = 0;
> +
>      return 0;
>  }
>  
> @@ -2724,6 +3011,10 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
>          n->params.max_active_zones = nz;
>      }
>  
> +    if (n->params.zone_async_events) {
> +        n->ae_cfg |= NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES;
> +    }
> +
>      return;
>  }
>  
> @@ -2993,6 +3284,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>      id->ieee[1] = 0x02;
>      id->ieee[2] = 0xb3;
>      id->oacs = cpu_to_le16(0);
> +    id->oaes = cpu_to_le32(n->ae_cfg);

I don't see why this can't always be supported. The host still has to
request it with the AEC feature for it to become active (assuming a
default of 0 for the AEC feature).

>      id->frmw = 7 << 1;
>      id->lpa = 1 << 1;
>      id->sqes = (0x6 << 4) | 0x6;
> @@ -3111,6 +3403,8 @@ static Property nvme_props[] = {
>      DEFINE_PROP_UINT64("finish_rcmnd_delay", NvmeCtrl,
>                         params.fzr_delay_usec, 0),
>      DEFINE_PROP_UINT64("finish_rcmnd_limit", NvmeCtrl, params.frl_usec, 0),
> +    DEFINE_PROP_BOOL("zone_async_events", NvmeCtrl, params.zone_async_events,
> +                     true),
>      DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
>      DEFINE_PROP_BOOL("active_excursions", NvmeCtrl, params.active_excursions,
>                       false),
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index be1920f1ef..e63f7736d7 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -3,6 +3,7 @@
>  
>  #include "block/nvme.h"
>  
> +#define NVME_MAX_ASYNC_EVENTS    16
>  #define NVME_DEFAULT_ZONE_SIZE   128 /* MiB */
>  #define NVME_DEFAULT_MAX_ZA_SIZE 128 /* KiB */
>  
> @@ -15,6 +16,7 @@ typedef struct NvmeParams {
>  
>      bool        zoned;
>      bool        cross_zone_read;
> +    bool        zone_async_events;
>      bool        active_excursions;
>      uint8_t     fill_pattern;
>      uint32_t    zamds_bs;
> @@ -29,13 +31,16 @@ typedef struct NvmeParams {
>  } NvmeParams;
>  
>  typedef struct NvmeAsyncEvent {
> -    QSIMPLEQ_ENTRY(NvmeAsyncEvent) entry;
> +    QTAILQ_ENTRY(NvmeAsyncEvent) entry;
> +    uint32_t                     res;
> +    uint32_t                     nsid;
>  } NvmeAsyncEvent;
>  
>  enum NvmeRequestFlags {
>      NVME_REQ_FLG_HAS_SG   = 1 << 0,
>      NVME_REQ_FLG_FILL     = 1 << 1,
>      NVME_REQ_FLG_APPEND   = 1 << 2,
> +    NVME_REQ_FLG_AER      = 1 << 3,
>  };
>  
>  typedef struct NvmeRequest {
> @@ -85,6 +90,7 @@ enum NvmeZoneFlags {
>      NVME_ZFLAGS_TS_DELAY = 1 << 0,
>      NVME_ZFLAGS_SET_RZR  = 1 << 1,
>      NVME_ZFLAGS_SET_FZR  = 1 << 2,
> +    NVME_ZFLAGS_AEN_PEND = 1 << 3,
>  };
>  
>  typedef struct NvmeZone {
> @@ -119,6 +125,7 @@ typedef struct NvmeNamespace {
>      NvmeZoneList    *full_zones;
>      int32_t         nr_open_zones;
>      int32_t         nr_active_zones;
> +    bool            aen_pending;
>  } NvmeNamespace;
>  
>  static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
> @@ -173,6 +180,10 @@ typedef struct NvmeCtrl {
>      NvmeSQueue      admin_sq;
>      NvmeCQueue      admin_cq;
>      NvmeIdCtrl      id_ctrl;
> +
> +    QTAILQ_HEAD(, NvmeAsyncEvent) async_reqs;
> +    uint32_t        nr_aers;
> +    uint32_t        ae_cfg;
>  } NvmeCtrl;
>  
>  /* calculate the number of LBAs that the namespace can accomodate */
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 596c39162b..e06fb97337 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -633,16 +633,22 @@ enum NvmeAsyncErrorInfo {
>  
>  enum NvmeAsyncNoticeInfo {
>      NVME_AER_NOTICE_NS_CHANGED              = 0x00,
> +    NVME_AER_NOTICE_ZONE_DESCR_CHANGED      = 0xef,
>  };
>  
>  enum NvmeAsyncEventCfg {
>      NVME_AEN_CFG_NS_ATTR                    = 1 << 8,
> +    NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES   = 1 << 27,
>  };
>  
>  typedef struct NvmeCqe {
>      union {
>          uint64_t     result64;
>          uint32_t     result32;
> +        struct {
> +            uint32_t info;
> +            uint32_t nsid;
> +        } ae;
>      };
>      uint16_t    sq_head;
>      uint16_t    sq_id;
> @@ -778,11 +784,19 @@ enum {
>     NVME_CMD_EFFECTS_UUID_SEL          = 1 << 19,
>  };
>  
> +typedef struct NvmeChangedZoneLog {
> +    uint16_t    nr_zone_ids;
> +    uint8_t     rsvd2[6];
> +    uint64_t    zone_ids[511];
> +} NvmeChangedZoneLog;
> +
>  enum LogIdentifier {
> -    NVME_LOG_ERROR_INFO     = 0x01,
> -    NVME_LOG_SMART_INFO     = 0x02,
> -    NVME_LOG_FW_SLOT_INFO   = 0x03,
> -    NVME_LOG_CMD_EFFECTS    = 0x05,
> +    NVME_LOG_ERROR_INFO               = 0x01,
> +    NVME_LOG_SMART_INFO               = 0x02,
> +    NVME_LOG_FW_SLOT_INFO             = 0x03,
> +    NVME_LOG_CHANGED_NS_LIST          = 0x04,
> +    NVME_LOG_CMD_EFFECTS              = 0x05,
> +    NVME_LOG_ZONE_CHANGED_LIST        = 0xbf,
>  };
>  
>  typedef struct NvmePSD {
> @@ -1097,6 +1111,7 @@ static inline void _nvme_check_size(void)
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsZoned) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeChangedZoneLog) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeZoneDescr) != 64);
>  }
>  #endif
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 13/18] hw/block/nvme: Set Finish/Reset Zone Recommended attributes
  2020-06-17 21:34 ` [PATCH v2 13/18] hw/block/nvme: Set Finish/Reset Zone Recommended attributes Dmitry Fomichev
@ 2020-07-01 16:23   ` Klaus Jensen
  0 siblings, 0 replies; 49+ messages in thread
From: Klaus Jensen @ 2020-07-01 16:23 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Keith Busch, Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Jun 18 06:34, Dmitry Fomichev wrote:
> Added logic to set and reset FZR and RZR zone attributes. Four new
> driver properties are added to control the timing of setting and
> resetting these attributes. FZR/RZR delay lasts from the zone
> operation and until when the corresponding zone attribute is set.
> FZR/RZR limits set the time period between setting FZR or RZR
> attribute and resetting it simulating the internal controller action
> on that zone.
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

Please correct me if I am wrong here, but I want to raise a question
about the use of QEMU_CLOCK_REALTIME here. I agree that it makes sense
that the limits are "absolute", but does this hold for emulation? In my
view, when emulation is stopped, the world is stopped. Should we emulate
the need for background operations in this case? I don't think so.

> ---
>  hw/block/nvme.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/block/nvme.h | 13 ++++++-
>  2 files changed, 111 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index a29cbfcc96..c3898448c7 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -201,6 +201,84 @@ static inline void nvme_aor_dec_active(NvmeCtrl *n, NvmeNamespace *ns)
>      assert(ns->nr_active_zones >= 0);
>  }
>  
> +static void nvme_set_rzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
> +{
> +    assert(zone->flags & NVME_ZFLAGS_SET_RZR);
> +    zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> +    zone->flags &= ~NVME_ZFLAGS_TS_DELAY;
> +    zone->d.za |= NVME_ZA_RESET_RECOMMENDED;
> +    zone->flags &= ~NVME_ZFLAGS_SET_RZR;
> +    trace_pci_nvme_zone_reset_recommended(zone->d.zslba);
> +}
> +
> +static void nvme_clear_rzr(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZone *zone, bool notify)
> +{
> +    if (n->params.rrl_usec) {
> +        zone->flags &= ~(NVME_ZFLAGS_SET_RZR | NVME_ZFLAGS_TS_DELAY);
> +        notify = notify && (zone->d.za & NVME_ZA_RESET_RECOMMENDED);
> +        zone->d.za &= ~NVME_ZA_RESET_RECOMMENDED;
> +        zone->tstamp = 0;
> +    }
> +}
> +
> +static void nvme_set_fzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
> +{
> +    assert(zone->flags & NVME_ZFLAGS_SET_FZR);
> +    zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> +    zone->flags &= ~NVME_ZFLAGS_TS_DELAY;
> +    zone->d.za |= NVME_ZA_FINISH_RECOMMENDED;
> +    zone->flags &= ~NVME_ZFLAGS_SET_FZR;
> +    trace_pci_nvme_zone_finish_recommended(zone->d.zslba);
> +}
> +
> +static void nvme_clear_fzr(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZone *zone, bool notify)
> +{
> +    if (n->params.frl_usec) {
> +        zone->flags &= ~(NVME_ZFLAGS_SET_FZR | NVME_ZFLAGS_TS_DELAY);
> +        notify = notify && (zone->d.za & NVME_ZA_FINISH_RECOMMENDED);
> +        zone->d.za &= ~NVME_ZA_FINISH_RECOMMENDED;
> +        zone->tstamp = 0;
> +    }
> +}
> +
> +static void nvme_schedule_rzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
> +{
> +    if (n->params.frl_usec) {
> +        zone->flags &= ~(NVME_ZFLAGS_SET_FZR | NVME_ZFLAGS_TS_DELAY);
> +        zone->d.za &= ~NVME_ZA_FINISH_RECOMMENDED;
> +        zone->tstamp = 0;
> +    }
> +    if (n->params.rrl_usec) {
> +        zone->flags |= NVME_ZFLAGS_SET_RZR;
> +        if (n->params.rzr_delay_usec) {
> +            zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> +            zone->flags |= NVME_ZFLAGS_TS_DELAY;
> +        } else {
> +            nvme_set_rzr(n, ns, zone);
> +        }
> +    }
> +}
> +
> +static void nvme_schedule_fzr(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone)
> +{
> +    if (n->params.rrl_usec) {
> +        zone->flags &= ~(NVME_ZFLAGS_SET_RZR | NVME_ZFLAGS_TS_DELAY);
> +        zone->d.za &= ~NVME_ZA_RESET_RECOMMENDED;
> +        zone->tstamp = 0;
> +    }
> +    if (n->params.frl_usec) {
> +        zone->flags |= NVME_ZFLAGS_SET_FZR;
> +        if (n->params.fzr_delay_usec) {
> +            zone->tstamp = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> +            zone->flags |= NVME_ZFLAGS_TS_DELAY;
> +        } else {
> +            nvme_set_fzr(n, ns, zone);
> +        }
> +    }
> +}
> +
>  static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
>      NvmeZone *zone, uint8_t state)
>  {
> @@ -208,15 +286,19 @@ static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
>          switch (nvme_get_zone_state(zone)) {
>          case NVME_ZONE_STATE_EXPLICITLY_OPEN:
>              nvme_remove_zone(n, ns, ns->exp_open_zones, zone);
> +            nvme_clear_fzr(n, ns, zone, false);
>              break;
>          case NVME_ZONE_STATE_IMPLICITLY_OPEN:
>              nvme_remove_zone(n, ns, ns->imp_open_zones, zone);
> +            nvme_clear_fzr(n, ns, zone, false);
>              break;
>          case NVME_ZONE_STATE_CLOSED:
>              nvme_remove_zone(n, ns, ns->closed_zones, zone);
> +            nvme_clear_fzr(n, ns, zone, false);
>              break;
>          case NVME_ZONE_STATE_FULL:
>              nvme_remove_zone(n, ns, ns->full_zones, zone);
> +            nvme_clear_rzr(n, ns, zone, false);
>          }
>     }
>  
> @@ -225,15 +307,19 @@ static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
>      switch (state) {
>      case NVME_ZONE_STATE_EXPLICITLY_OPEN:
>          nvme_add_zone_tail(n, ns, ns->exp_open_zones, zone);
> +        nvme_schedule_fzr(n, ns, zone);
>          break;
>      case NVME_ZONE_STATE_IMPLICITLY_OPEN:
>          nvme_add_zone_tail(n, ns, ns->imp_open_zones, zone);
> +        nvme_schedule_fzr(n, ns, zone);
>          break;
>      case NVME_ZONE_STATE_CLOSED:
>          nvme_add_zone_tail(n, ns, ns->closed_zones, zone);
> +        nvme_schedule_fzr(n, ns, zone);
>          break;
>      case NVME_ZONE_STATE_FULL:
>          nvme_add_zone_tail(n, ns, ns->full_zones, zone);
> +        nvme_schedule_rzr(n, ns, zone);
>          break;
>      default:
>          zone->d.za = 0;
> @@ -555,6 +641,7 @@ static void nvme_auto_transition_zone(NvmeCtrl *n, NvmeNamespace *ns,
>              zone->d.za &= ~(NVME_ZA_FINISH_RECOMMENDED |
>                              NVME_ZA_RESET_RECOMMENDED);
>              zone->d.za |= NVME_ZA_FINISHED_BY_CTLR;
> +            zone->flags = 0;
>              zone->tstamp = 0;
>              trace_pci_nvme_zone_finished_by_controller(zone->d.zslba);
>          }
> @@ -2624,6 +2711,11 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
>      n->num_zones = nz;
>      n->zone_array_size = sizeof(NvmeZone) * nz;
>  
> +    n->params.rzr_delay_usec *= SCALE_MS;
> +    n->params.rrl_usec *= SCALE_MS;
> +    n->params.fzr_delay_usec *= SCALE_MS;
> +    n->params.frl_usec *= SCALE_MS;
> +

I would prefer that user-given parameters are not changed like this.
Setting defaults for various reasons are OK, but when the meaning of the
parameter changes (like the scale), its confusing. I would suggest that
the namespace gets the set of *_usec members and the parameters are name
without the usec.

>      /* Make sure that the values of all Zoned Command Set properties are sane */
>      if (n->params.max_open_zones > nz) {
>          n->params.max_open_zones = nz;
> @@ -2651,6 +2743,8 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
>      /* MAR/MOR are zeroes-based, 0xffffffff means no limit */
>      ns->id_ns_zoned->mar = cpu_to_le32(n->params.max_active_zones - 1);
>      ns->id_ns_zoned->mor = cpu_to_le32(n->params.max_open_zones - 1);
> +    ns->id_ns_zoned->rrl = cpu_to_le32(n->params.rrl_usec / (1000 * SCALE_MS));
> +    ns->id_ns_zoned->frl = cpu_to_le32(n->params.frl_usec / (1000 * SCALE_MS));
>      ns->id_ns_zoned->zoc = cpu_to_le16(n->params.active_excursions ? 0x2 : 0);
>      ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
>  
> @@ -3012,6 +3106,11 @@ static Property nvme_props[] = {
>      DEFINE_PROP_UINT32("zone_append_max_size", NvmeCtrl, params.zamds_bs, 0),
>      DEFINE_PROP_INT32("max_active", NvmeCtrl, params.max_active_zones, 0),
>      DEFINE_PROP_INT32("max_open", NvmeCtrl, params.max_open_zones, 0),
> +    DEFINE_PROP_UINT64("reset_rcmnd_delay", NvmeCtrl, params.rzr_delay_usec, 0),
> +    DEFINE_PROP_UINT64("reset_rcmnd_limit", NvmeCtrl, params.rrl_usec, 0),
> +    DEFINE_PROP_UINT64("finish_rcmnd_delay", NvmeCtrl,
> +                       params.fzr_delay_usec, 0),
> +    DEFINE_PROP_UINT64("finish_rcmnd_limit", NvmeCtrl, params.frl_usec, 0),
>      DEFINE_PROP_BOOL("cross_zone_read", NvmeCtrl, params.cross_zone_read, true),
>      DEFINE_PROP_BOOL("active_excursions", NvmeCtrl, params.active_excursions,
>                       false),
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 8a0aaeb09a..be1920f1ef 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -22,6 +22,10 @@ typedef struct NvmeParams {
>      uint64_t    zone_capacity;
>      int32_t     max_active_zones;
>      int32_t     max_open_zones;
> +    uint64_t    rzr_delay_usec;
> +    uint64_t    rrl_usec;
> +    uint64_t    fzr_delay_usec;
> +    uint64_t    frl_usec;
>  } NvmeParams;
>  
>  typedef struct NvmeAsyncEvent {
> @@ -77,12 +81,19 @@ typedef struct NvmeCQueue {
>      QTAILQ_HEAD(, NvmeRequest) req_list;
>  } NvmeCQueue;
>  
> +enum NvmeZoneFlags {
> +    NVME_ZFLAGS_TS_DELAY = 1 << 0,
> +    NVME_ZFLAGS_SET_RZR  = 1 << 1,
> +    NVME_ZFLAGS_SET_FZR  = 1 << 2,
> +};
> +
>  typedef struct NvmeZone {
>      NvmeZoneDescr   d;
>      uint64_t        tstamp;
> +    uint32_t        flags;
>      uint32_t        next;
>      uint32_t        prev;
> -    uint8_t         rsvd80[8];
> +    uint8_t         rsvd84[4];
>  } NvmeZone;
>  
>  #define NVME_ZONE_LIST_NIL    UINT_MAX
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 15/18] hw/block/nvme: Support Zone Descriptor Extensions
  2020-06-17 21:34 ` [PATCH v2 15/18] hw/block/nvme: Support Zone Descriptor Extensions Dmitry Fomichev
@ 2020-07-01 16:32   ` Klaus Jensen
  0 siblings, 0 replies; 49+ messages in thread
From: Klaus Jensen @ 2020-07-01 16:32 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	qemu-devel, Keith Busch, Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Jun 18 06:34, Dmitry Fomichev wrote:
> Zone Descriptor Extension is a label that can be assigned to a zone.
> It can be set to an Empty zone and it stays assigned until the zone
> is reset.
> 
> This commit adds a new optional property, "zone_descr_ext_size", to
> the driver. Its value must be a multiple of 64 bytes. If this value
> is non-zero, it becomes possible to assign extensions of that size
> to any Empty zones. The default value for this property is 0,
> therefore setting extensions is disabled by default.
> 
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

Reviewed-by: Klaus Jensen <k.jensen@samsung.com>

> ---
>  hw/block/nvme.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++---
>  hw/block/nvme.h |  8 ++++++
>  2 files changed, 80 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index b9135a6b1f..eb41081627 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -1360,6 +1360,26 @@ static bool nvme_cond_offline_all(uint8_t state)
>      return state == NVME_ZONE_STATE_READ_ONLY;
>  }
>  
> +static uint16_t nvme_set_zd_ext(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZone *zone, uint8_t state)
> +{
> +    uint16_t status;
> +
> +    if (state == NVME_ZONE_STATE_EMPTY) {
> +        nvme_auto_transition_zone(n, ns, false, true);
> +        status = nvme_aor_check(n, ns, 1, 0);
> +        if (status != NVME_SUCCESS) {
> +            return status;
> +        }
> +        nvme_aor_inc_active(n, ns);
> +        zone->d.za |= NVME_ZA_ZD_EXT_VALID;
> +        nvme_assign_zone_state(n, ns, zone, NVME_ZONE_STATE_CLOSED);
> +        return NVME_SUCCESS;
> +    }
> +
> +    return NVME_ZONE_INVAL_TRANSITION;
> +}
> +
>  static uint16_t name_do_zone_op(NvmeCtrl *n, NvmeNamespace *ns,
>      NvmeZone *zone, uint8_t state, bool all,
>      uint16_t (*op_hndlr)(NvmeCtrl *, NvmeNamespace *, NvmeZone *,
> @@ -1388,13 +1408,16 @@ static uint16_t name_do_zone_op(NvmeCtrl *n, NvmeNamespace *ns,
>  static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeNamespace *ns,
>      NvmeCmd *cmd, NvmeRequest *req)
>  {
> +    NvmeRwCmd *rw;
>      uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> +    uint64_t prp1, prp2;
>      uint64_t slba = 0;
>      uint64_t zone_idx = 0;
>      uint16_t status;
>      uint8_t action, state;
>      bool all;
>      NvmeZone *zone;
> +    uint8_t *zd_ext;
>  
>      action = dw13 & 0xff;
>      all = dw13 & 0x100;
> @@ -1449,7 +1472,25 @@ static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeNamespace *ns,
>  
>      case NVME_ZONE_ACTION_SET_ZD_EXT:
>          trace_pci_nvme_set_descriptor_extension(slba, zone_idx);
> -        return NVME_INVALID_FIELD | NVME_DNR;
> +        if (all || !n->params.zd_extension_size) {
> +            return NVME_INVALID_FIELD | NVME_DNR;
> +        }
> +        zd_ext = nvme_get_zd_extension(n, ns, zone_idx);
> +        rw = (NvmeRwCmd *)cmd;
> +        prp1 = le64_to_cpu(rw->prp1);
> +        prp2 = le64_to_cpu(rw->prp2);
> +        status = nvme_dma_write_prp(n, zd_ext, n->params.zd_extension_size,
> +                                    prp1, prp2);
> +        if (status) {
> +            trace_pci_nvme_err_zd_extension_map_error(zone_idx);
> +            return status;
> +        }
> +
> +        status = nvme_set_zd_ext(n, ns, zone, state);
> +        if (status == NVME_SUCCESS) {
> +            trace_pci_nvme_zd_extension_set(zone_idx);
> +            return status;
> +        }
>          break;
>  
>      default:
> @@ -1528,7 +1569,7 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeNamespace *ns,
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> -    if (zra == NVME_ZONE_REPORT_EXTENDED) {
> +    if (zra == NVME_ZONE_REPORT_EXTENDED && !n->params.zd_extension_size) {
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> @@ -1540,6 +1581,9 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeNamespace *ns,
>      partial = (dw13 >> 16) & 0x01;
>  
>      zone_entry_sz = sizeof(NvmeZoneDescr);
> +    if (zra == NVME_ZONE_REPORT_EXTENDED) {
> +        zone_entry_sz += n->params.zd_extension_size;
> +    }
>  
>      max_zones = (len - sizeof(NvmeZoneReportHeader)) / zone_entry_sz;
>      buf = g_malloc0(len);
> @@ -1571,6 +1615,14 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeNamespace *ns,
>              z->wp = cpu_to_le64(~0ULL);
>          }
>  
> +        if (zra == NVME_ZONE_REPORT_EXTENDED) {
> +            if (zs->d.za & NVME_ZA_ZD_EXT_VALID) {
> +                memcpy(buf_p, nvme_get_zd_extension(n, ns, zone_index),
> +                       n->params.zd_extension_size);
> +            }
> +            buf_p += n->params.zd_extension_size;
> +        }
> +
>          zone_index++;
>      }
>  
> @@ -2337,7 +2389,7 @@ static uint16_t nvme_handle_changed_zone_log(NvmeCtrl *n, NvmeCmd *cmd,
>              continue;
>          }
>          num_aen_zones++;
> -        if (zone->d.za) {
> +        if (zone->d.za & ~NVME_ZA_ZD_EXT_VALID) {
>              trace_pci_nvme_reporting_changed_zone(zone->d.zslba, zone->d.za);
>              *zid_ptr++ = cpu_to_le64(zone->d.zslba);
>              nids++;
> @@ -2936,6 +2988,7 @@ static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
>      ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
>      ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
>      ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
> +    ns->zd_extensions = g_malloc0(n->params.zd_extension_size * n->num_zones);
>      zone = ns->zone_array;
>  
>      nvme_init_zone_list(ns->exp_open_zones);
> @@ -3010,6 +3063,17 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
>      if (n->params.max_active_zones > nz) {
>          n->params.max_active_zones = nz;
>      }
> +    if (n->params.zd_extension_size) {
> +        if (n->params.zd_extension_size & 0x3f) {
> +            error_setg(errp,
> +                "zone descriptor extension size must be a multiple of 64B");
> +            return;
> +        }
> +        if ((n->params.zd_extension_size >> 6) > 0xff) {
> +            error_setg(errp, "zone descriptor extension size is too large");
> +            return;
> +        }
> +    }
>  
>      if (n->params.zone_async_events) {
>          n->ae_cfg |= NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES;
> @@ -3040,7 +3104,8 @@ static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
>      ns->id_ns_zoned->ozcs = n->params.cross_zone_read ? 0x01 : 0x00;
>  
>      ns->id_ns_zoned->lbafe[lba_index].zsze = cpu_to_le64(n->params.zone_size);
> -    ns->id_ns_zoned->lbafe[lba_index].zdes = 0;
> +    ns->id_ns_zoned->lbafe[lba_index].zdes =
> +        n->params.zd_extension_size >> 6; /* Units of 64B */
>  
>      if (n->params.fill_pattern == 0) {
>          ns->id_ns.dlfeat = 0x01;
> @@ -3063,6 +3128,7 @@ static void nvme_zoned_clear(NvmeCtrl *n)
>          g_free(ns->imp_open_zones);
>          g_free(ns->closed_zones);
>          g_free(ns->full_zones);
> +        g_free(ns->zd_extensions);
>      }
>  }
>  
> @@ -3396,6 +3462,8 @@ static Property nvme_props[] = {
>      DEFINE_PROP_UINT64("zone_size", NvmeCtrl, params.zone_size, 512),
>      DEFINE_PROP_UINT64("zone_capacity", NvmeCtrl, params.zone_capacity, 512),
>      DEFINE_PROP_UINT32("zone_append_max_size", NvmeCtrl, params.zamds_bs, 0),
> +    DEFINE_PROP_UINT32("zone_descr_ext_size", NvmeCtrl,
> +                       params.zd_extension_size, 0),
>      DEFINE_PROP_INT32("max_active", NvmeCtrl, params.max_active_zones, 0),
>      DEFINE_PROP_INT32("max_open", NvmeCtrl, params.max_open_zones, 0),
>      DEFINE_PROP_UINT64("reset_rcmnd_delay", NvmeCtrl, params.rzr_delay_usec, 0),
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index e63f7736d7..4251295917 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -24,6 +24,7 @@ typedef struct NvmeParams {
>      uint64_t    zone_capacity;
>      int32_t     max_active_zones;
>      int32_t     max_open_zones;
> +    uint32_t    zd_extension_size;
>      uint64_t    rzr_delay_usec;
>      uint64_t    rrl_usec;
>      uint64_t    fzr_delay_usec;
> @@ -123,6 +124,7 @@ typedef struct NvmeNamespace {
>      NvmeZoneList    *imp_open_zones;
>      NvmeZoneList    *closed_zones;
>      NvmeZoneList    *full_zones;
> +    uint8_t         *zd_extensions;
>      int32_t         nr_open_zones;
>      int32_t         nr_active_zones;
>      bool            aen_pending;
> @@ -221,6 +223,12 @@ static inline bool nvme_wp_is_valid(NvmeZone *zone)
>             st != NVME_ZONE_STATE_OFFLINE;
>  }
>  
> +static inline uint8_t *nvme_get_zd_extension(NvmeCtrl *n,
> +    NvmeNamespace *ns, uint32_t zone_idx)
> +{
> +    return &ns->zd_extensions[zone_idx * n->params.zd_extension_size];
> +}
> +
>  /*
>   * Initialize a zone list head.
>   */
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v2 17/18] hw/block/nvme: Use zone metadata file for persistence
  2020-06-17 21:34 ` [PATCH v2 17/18] hw/block/nvme: Use zone metadata file for persistence Dmitry Fomichev
@ 2020-07-01 17:26   ` Klaus Jensen
  0 siblings, 0 replies; 49+ messages in thread
From: Klaus Jensen @ 2020-07-01 17:26 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Niklas Cassel, Damien Le Moal, qemu-block,
	Dmitry Fomichev, qemu-devel, Keith Busch,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, Matias Bjorling

On Jun 18 06:34, Dmitry Fomichev wrote:
> A ZNS drive that is emulated by this driver is currently initialized
> with all zones Empty upon startup. However, actual ZNS SSDs save the
> state and condition of all zones in their internal NVRAM in the event
> of power loss. When such a drive is powered up again, it closes or
> finishes all zones that were open at the moment of shutdown. Besides
> that, the write pointer position as well as the state and condition
> of all zones is preserved across power-downs.
> 
> This commit adds the capability to have a persistent zone metadata
> to the driver. The new optional driver property, "zone_file",
> is introduced. If added to the command line, this property specifies
> the name of the file that stores the zone metadata. If "zone_file" is
> omitted, the driver will initialize with all zones empty, the same as
> before.
> 
> If zone metadata is configured to be persistent, then zone descriptor
> extensions also persist across controller shutdowns.
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

Stefan, before I review this in depth, can you comment on if mmap'ing a
file from a device model and issuing regular msync's is an acceptable
approach to storing state persistently across QEMU invocations?

I could not find any examples of this in hw/, so I am unsure. I
implemented something like this using an additional blockdev on the
device and doing blk_aio's, but just mmaping a file seems much simpler,
but at the cost of portability? On the other hand, I can't find any
examples of using an additional blockdev either.

Can you shed any light on the preferred approach?

> ---
>  hw/block/nvme.c | 371 +++++++++++++++++++++++++++++++++++++++++++++---
>  hw/block/nvme.h |  38 +++++
>  2 files changed, 388 insertions(+), 21 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 14d5f1d155..63e7a6352e 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -69,6 +69,8 @@
>      } while (0)
>  
>  static void nvme_process_sq(void *opaque);
> +static void nvme_sync_zone_file(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZone *zone, int len);
>  
>  /*
>   * Add a zone to the tail of a zone list.
> @@ -90,6 +92,7 @@ static void nvme_add_zone_tail(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
>          zl->tail = idx;
>      }
>      zl->size++;
> +    nvme_set_zone_meta_dirty(n, ns, true);
>  }
>  
>  /*
> @@ -106,12 +109,15 @@ static void nvme_remove_zone(NvmeCtrl *n, NvmeNamespace *ns, NvmeZoneList *zl,
>      if (zl->size == 0) {
>          zl->head = NVME_ZONE_LIST_NIL;
>          zl->tail = NVME_ZONE_LIST_NIL;
> +        nvme_set_zone_meta_dirty(n, ns, true);
>      } else if (idx == zl->head) {
>          zl->head = zone->next;
>          ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
> +        nvme_set_zone_meta_dirty(n, ns, true);
>      } else if (idx == zl->tail) {
>          zl->tail = zone->prev;
>          ns->zone_array[zl->tail].next = NVME_ZONE_LIST_NIL;
> +        nvme_set_zone_meta_dirty(n, ns, true);
>      } else {
>          ns->zone_array[zone->next].prev = zone->prev;
>          ns->zone_array[zone->prev].next = zone->next;
> @@ -138,6 +144,7 @@ static NvmeZone *nvme_remove_zone_head(NvmeCtrl *n, NvmeNamespace *ns,
>              ns->zone_array[zl->head].prev = NVME_ZONE_LIST_NIL;
>          }
>          zone->prev = zone->next = 0;
> +        nvme_set_zone_meta_dirty(n, ns, true);
>      }
>  
>      return zone;
> @@ -476,6 +483,7 @@ static void nvme_assign_zone_state(NvmeCtrl *n, NvmeNamespace *ns,
>      case NVME_ZONE_STATE_READ_ONLY:
>          zone->tstamp = 0;
>      }
> +    nvme_sync_zone_file(n, ns, zone, sizeof(NvmeZone));
>  }
>  
>  static bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> @@ -2976,9 +2984,114 @@ static const MemoryRegionOps nvme_cmb_ops = {
>      },
>  };
>  
> -static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
> +static int nvme_validate_zone_file(NvmeCtrl *n, NvmeNamespace *ns,
>      uint64_t capacity)
>  {
> +    NvmeZoneMeta *meta = ns->zone_meta;
> +    NvmeZone *zone = ns->zone_array;
> +    uint64_t start = 0, zone_size = n->params.zone_size;
> +    int i, n_imp_open = 0, n_exp_open = 0, n_closed = 0, n_full = 0;
> +
> +    if (meta->magic != NVME_ZONE_META_MAGIC) {
> +        return 1;
> +    }
> +    if (meta->version != NVME_ZONE_META_VER) {
> +        return 2;
> +    }
> +    if (meta->zone_size != zone_size) {
> +        return 3;
> +    }
> +    if (meta->zone_capacity != n->params.zone_capacity) {
> +        return 4;
> +    }
> +    if (meta->nr_offline_zones != n->params.nr_offline_zones) {
> +        return 5;
> +    }
> +    if (meta->nr_rdonly_zones != n->params.nr_rdonly_zones) {
> +        return 6;
> +    }
> +    if (meta->lba_size != n->conf.logical_block_size) {
> +        return 7;
> +    }
> +    if (meta->zd_extension_size != n->params.zd_extension_size) {
> +        return 8;
> +    }
> +
> +    for (i = 0; i < n->num_zones; i++, zone++) {
> +        if (start + zone_size > capacity) {
> +            zone_size = capacity - start;
> +        }
> +        if (zone->d.zt != NVME_ZONE_TYPE_SEQ_WRITE) {
> +            return 9;
> +        }
> +        if (zone->d.zcap != n->params.zone_capacity) {
> +            return 10;
> +        }
> +        if (zone->d.zslba != start) {
> +            return 11;
> +        }
> +        switch (nvme_get_zone_state(zone)) {
> +        case NVME_ZONE_STATE_EMPTY:
> +        case NVME_ZONE_STATE_OFFLINE:
> +        case NVME_ZONE_STATE_READ_ONLY:
> +            if (zone->d.wp != start) {
> +                return 12;
> +            }
> +            break;
> +        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +            if (zone->d.wp < start ||
> +                zone->d.wp >= zone->d.zslba + zone->d.zcap) {
> +                return 13;
> +            }
> +            n_imp_open++;
> +            break;
> +        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +            if (zone->d.wp < start ||
> +                zone->d.wp >= zone->d.zslba + zone->d.zcap) {
> +                return 13;
> +            }
> +            n_exp_open++;
> +            break;
> +        case NVME_ZONE_STATE_CLOSED:
> +            if (zone->d.wp < start ||
> +                zone->d.wp >= zone->d.zslba + zone->d.zcap) {
> +                return 13;
> +            }
> +            n_closed++;
> +            break;
> +        case NVME_ZONE_STATE_FULL:
> +            if (zone->d.wp != zone->d.zslba + zone->d.zcap) {
> +                return 14;
> +            }
> +            n_full++;
> +            break;
> +        default:
> +            return 15;
> +        }
> +
> +        start += zone_size;
> +    }
> +
> +    if (n_imp_open != nvme_zone_list_size(ns->exp_open_zones)) {
> +        return 16;
> +    }
> +    if (n_exp_open != nvme_zone_list_size(ns->imp_open_zones)) {
> +        return 17;
> +    }
> +    if (n_closed != nvme_zone_list_size(ns->closed_zones)) {
> +        return 18;
> +    }
> +    if (n_full != nvme_zone_list_size(ns->full_zones)) {
> +        return 19;
> +    }
> +
> +    return 0;
> +}
> +
> +static int nvme_init_zone_file(NvmeCtrl *n, NvmeNamespace *ns,
> +    uint64_t capacity)
> +{
> +    NvmeZoneMeta *meta = ns->zone_meta;
>      NvmeZone *zone;
>      Error *err;
>      uint64_t start = 0, zone_size = n->params.zone_size;
> @@ -2986,18 +3099,33 @@ static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
>      int i;
>      uint16_t zs;
>  
> -    ns->zone_array = g_malloc0(n->zone_array_size);
> -    ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
> -    ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
> -    ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
> -    ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
> -    ns->zd_extensions = g_malloc0(n->params.zd_extension_size * n->num_zones);
> +    if (n->params.zone_file) {
> +        meta->magic = NVME_ZONE_META_MAGIC;
> +        meta->version = NVME_ZONE_META_VER;
> +        meta->zone_size = zone_size;
> +        meta->zone_capacity = n->params.zone_capacity;
> +        meta->lba_size = n->conf.logical_block_size;
> +        meta->nr_offline_zones = n->params.nr_offline_zones;
> +        meta->nr_rdonly_zones = n->params.nr_rdonly_zones;
> +        meta->zd_extension_size = n->params.zd_extension_size;
> +    } else {
> +        ns->zone_array = g_malloc0(n->zone_array_size);
> +        ns->exp_open_zones = g_malloc0(sizeof(NvmeZoneList));
> +        ns->imp_open_zones = g_malloc0(sizeof(NvmeZoneList));
> +        ns->closed_zones = g_malloc0(sizeof(NvmeZoneList));
> +        ns->full_zones = g_malloc0(sizeof(NvmeZoneList));
> +        ns->zd_extensions =
> +            g_malloc0(n->params.zd_extension_size * n->num_zones);
> +    }
>      zone = ns->zone_array;
>  
>      nvme_init_zone_list(ns->exp_open_zones);
>      nvme_init_zone_list(ns->imp_open_zones);
>      nvme_init_zone_list(ns->closed_zones);
>      nvme_init_zone_list(ns->full_zones);
> +    if (n->params.zone_file) {
> +        nvme_set_zone_meta_dirty(n, ns, true);
> +    }
>  
>      for (i = 0; i < n->num_zones; i++, zone++) {
>          if (start + zone_size > capacity) {
> @@ -3048,7 +3176,189 @@ static int nvme_init_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
>      return 0;
>  }
>  
> -static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
> +static int nvme_open_zone_file(NvmeCtrl *n, bool *init_meta)
> +{
> +    struct stat statbuf;
> +    size_t fsize;
> +    int ret;
> +
> +    ret = stat(n->params.zone_file, &statbuf);
> +    if (ret && errno == ENOENT) {
> +        *init_meta = true;
> +    } else if (!S_ISREG(statbuf.st_mode)) {
> +        fprintf(stderr, "%s is not a regular file\n", strerror(errno));
> +        return -1;
> +    }
> +
> +    n->zone_file_fd = open(n->params.zone_file,
> +                           O_RDWR | O_LARGEFILE | O_BINARY | O_CREAT, 644);
> +    if (n->zone_file_fd < 0) {
> +            fprintf(stderr, "failed to create zone file %s, err %s\n",
> +                    n->params.zone_file, strerror(errno));
> +            return -1;
> +    }
> +
> +    fsize = n->meta_size * n->num_namespaces;
> +
> +    if (stat(n->params.zone_file, &statbuf)) {
> +        fprintf(stderr, "can't stat zone file %s, err %s\n",
> +                n->params.zone_file, strerror(errno));
> +        return -1;
> +    }
> +    if (statbuf.st_size != fsize) {
> +        ret = ftruncate(n->zone_file_fd, fsize);
> +        if (ret < 0) {
> +            fprintf(stderr, "can't truncate zone file %s, err %s\n",
> +                    n->params.zone_file, strerror(errno));
> +            return -1;
> +        }
> +        *init_meta = true;
> +    }
> +
> +    return 0;
> +}
> +
> +static int nvme_map_zone_file(NvmeCtrl *n, NvmeNamespace *ns, bool *init_meta)
> +{
> +    off_t meta_ofs = n->meta_size * (ns->nsid - 1);
> +
> +    ns->zone_meta = mmap(0, n->meta_size, PROT_READ | PROT_WRITE,
> +                         MAP_SHARED, n->zone_file_fd, meta_ofs);
> +    if (ns->zone_meta == MAP_FAILED) {
> +        fprintf(stderr, "failed to map zone file %s, ofs %lu, err %s\n",
> +                n->params.zone_file, meta_ofs, strerror(errno));
> +        return -1;
> +    }
> +
> +    ns->zone_array = (NvmeZone *)(ns->zone_meta + 1);
> +    ns->exp_open_zones = &ns->zone_meta->exp_open_zones;
> +    ns->imp_open_zones = &ns->zone_meta->imp_open_zones;
> +    ns->closed_zones = &ns->zone_meta->closed_zones;
> +    ns->full_zones = &ns->zone_meta->full_zones;
> +
> +    if (n->params.zd_extension_size) {
> +        ns->zd_extensions = (uint8_t *)(ns->zone_meta + 1);
> +        ns->zd_extensions += n->zone_array_size;
> +    }
> +
> +    return 0;
> +}
> +
> +static void nvme_sync_zone_file(NvmeCtrl *n, NvmeNamespace *ns,
> +    NvmeZone *zone, int len)
> +{
> +    uintptr_t addr, zd = (uintptr_t)zone;
> +
> +    addr = zd & qemu_real_host_page_mask;
> +    len += zd - addr;
> +    if (msync((void *)addr, len, MS_ASYNC) < 0)
> +        fprintf(stderr, "msync: failed to sync zone descriptors, file %s\n",
> +                strerror(errno));
> +
> +    if (nvme_zone_meta_dirty(n, ns)) {
> +        nvme_set_zone_meta_dirty(n, ns, false);
> +        if (msync(ns->zone_meta, sizeof(NvmeZoneMeta), MS_ASYNC) < 0)
> +            fprintf(stderr, "msync: failed to sync zone meta, file %s\n",
> +                    strerror(errno));
> +    }
> +}
> +
> +/*
> + * Close or finish all the zones that might be still open after power-down.
> + */
> +static void nvme_prepare_zones(NvmeCtrl *n, NvmeNamespace *ns)
> +{
> +    NvmeZone *zone;
> +    uint32_t set_state;
> +    int i;
> +
> +    assert(!ns->nr_active_zones);
> +    assert(!ns->nr_open_zones);
> +
> +    zone = ns->zone_array;
> +    for (i = 0; i < n->num_zones; i++, zone++) {
> +        zone->flags = 0;
> +        zone->tstamp = 0;
> +
> +        switch (nvme_get_zone_state(zone)) {
> +        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +            break;
> +        case NVME_ZONE_STATE_CLOSED:
> +            nvme_aor_inc_active(n, ns);
> +            /* pass through */
> +        default:
> +            continue;
> +        }
> +
> +        if (zone->d.za & NVME_ZA_ZD_EXT_VALID) {
> +            set_state = NVME_ZONE_STATE_CLOSED;
> +        } else if (zone->d.wp == zone->d.zslba) {
> +            set_state = NVME_ZONE_STATE_EMPTY;
> +        } else if (n->params.max_active_zones == 0 ||
> +                   ns->nr_active_zones < n->params.max_active_zones) {
> +            set_state = NVME_ZONE_STATE_CLOSED;
> +        } else {
> +            set_state = NVME_ZONE_STATE_FULL;
> +        }
> +
> +        switch (set_state) {
> +        case NVME_ZONE_STATE_CLOSED:
> +            trace_pci_nvme_power_on_close(nvme_get_zone_state(zone),
> +                                          zone->d.zslba);
> +            nvme_aor_inc_active(n, ns);
> +            nvme_add_zone_tail(n, ns, ns->closed_zones, zone);
> +        break;
> +        case NVME_ZONE_STATE_EMPTY:
> +            trace_pci_nvme_power_on_reset(nvme_get_zone_state(zone),
> +                                          zone->d.zslba);
> +        break;
> +        case NVME_ZONE_STATE_FULL:
> +            trace_pci_nvme_power_on_full(nvme_get_zone_state(zone),
> +                                         zone->d.zslba);
> +            zone->d.wp = nvme_zone_wr_boundary(zone);
> +        }
> +
> +        nvme_set_zone_state(zone, set_state);
> +    }
> +}
> +
> +static int nvme_load_zone_meta(NvmeCtrl *n, NvmeNamespace *ns,
> +    uint64_t capacity, bool init_meta)
> +{
> +    int ret = 0;
> +
> +    if (n->params.zone_file) {
> +        ret = nvme_map_zone_file(n, ns, &init_meta);
> +        trace_pci_nvme_mapped_zone_file(n->params.zone_file, ret);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        if (!init_meta) {
> +            ret = nvme_validate_zone_file(n, ns, capacity);
> +            if (ret) {
> +                trace_pci_nvme_err_zone_file_invalid(ret);
> +                init_meta = true;
> +            }
> +        }
> +    } else {
> +        init_meta = true;
> +    }
> +
> +    if (init_meta) {
> +        ret = nvme_init_zone_file(n, ns, capacity);
> +    } else {
> +        nvme_prepare_zones(n, ns);
> +    }
> +    if (!ret && n->params.zone_file) {
> +        nvme_sync_zone_file(n, ns, ns->zone_array, n->zone_array_size);
> +    }
> +
> +    return ret;
> +}
> +
> +static void nvme_zoned_init_ctrl(NvmeCtrl *n, bool *init_meta, Error **errp)
>  {
>      uint64_t zone_size = 0, capacity;
>      uint32_t nz;
> @@ -3084,6 +3394,9 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
>      nz = DIV_ROUND_UP(capacity, zone_size);
>      n->num_zones = nz;
>      n->zone_array_size = sizeof(NvmeZone) * nz;
> +    n->meta_size = sizeof(NvmeZoneMeta) + n->zone_array_size +
> +                          nz * n->params.zd_extension_size;
> +    n->meta_size = ROUND_UP(n->meta_size, qemu_real_host_page_size);
>  
>      n->params.rzr_delay_usec *= SCALE_MS;
>      n->params.rrl_usec *= SCALE_MS;
> @@ -3119,6 +3432,13 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
>          }
>      }
>  
> +    if (n->params.zone_file) {
> +        if (nvme_open_zone_file(n, init_meta) < 0) {
> +            error_setg(errp, "cannot open zone metadata file");
> +            return;
> +        }
> +    }
> +
>      if (n->params.zone_async_events) {
>          n->ae_cfg |= NVME_AEN_CFG_ZONE_DESCR_CHNGD_NOTICES;
>      }
> @@ -3127,13 +3447,14 @@ static void nvme_zoned_init_ctrl(NvmeCtrl *n, Error **errp)
>  }
>  
>  static int nvme_zoned_init_ns(NvmeCtrl *n, NvmeNamespace *ns, int lba_index,
> -    Error **errp)
> +    bool init_meta, Error **errp)
>  {
>      int ret;
>  
> -    ret = nvme_init_zone_meta(n, ns, n->num_zones * n->params.zone_size);
> +    ret = nvme_load_zone_meta(n, ns, n->num_zones * n->params.zone_size,
> +                              init_meta);
>      if (ret) {
> -        error_setg(errp, "could not init zone metadata");
> +        error_setg(errp, "could not load/init zone metadata");
>          return -1;
>      }
>  
> @@ -3164,15 +3485,20 @@ static void nvme_zoned_clear(NvmeCtrl *n)
>  {
>      int i;
>  
> +    if (n->params.zone_file)  {
> +        close(n->zone_file_fd);
> +    }
>      for (i = 0; i < n->num_namespaces; i++) {
>          NvmeNamespace *ns = &n->namespaces[i];
>          g_free(ns->id_ns_zoned);
> -        g_free(ns->zone_array);
> -        g_free(ns->exp_open_zones);
> -        g_free(ns->imp_open_zones);
> -        g_free(ns->closed_zones);
> -        g_free(ns->full_zones);
> -        g_free(ns->zd_extensions);
> +        if (!n->params.zone_file) {
> +            g_free(ns->zone_array);
> +            g_free(ns->exp_open_zones);
> +            g_free(ns->imp_open_zones);
> +            g_free(ns->closed_zones);
> +            g_free(ns->full_zones);
> +            g_free(ns->zd_extensions);
> +        }
>      }
>  }
>  
> @@ -3258,7 +3584,8 @@ static void nvme_init_blk(NvmeCtrl *n, Error **errp)
>      n->ns_size = bs_size;
>  }
>  
> -static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> +static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, bool init_meta,
> +    Error **errp)
>  {
>      NvmeIdNs *id_ns = &ns->id_ns;
>      int lba_index;
> @@ -3272,7 +3599,7 @@ static void nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>      if (n->params.zoned) {
>          ns->csi = NVME_CSI_ZONED;
>          id_ns->ncap = cpu_to_le64(n->params.zone_capacity * n->num_zones);
> -        if (nvme_zoned_init_ns(n, ns, lba_index, errp) != 0) {
> +        if (nvme_zoned_init_ns(n, ns, lba_index, init_meta, errp) != 0) {
>              return;
>          }
>      } else {
> @@ -3429,6 +3756,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>      NvmeCtrl *n = NVME(pci_dev);
>      NvmeNamespace *ns;
>      Error *local_err = NULL;
> +    bool init_meta = false;
>  
>      int i;
>  
> @@ -3452,7 +3780,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>      }
>  
>      if (n->params.zoned) {
> -        nvme_zoned_init_ctrl(n, &local_err);
> +        nvme_zoned_init_ctrl(n, &init_meta, &local_err);
>          if (local_err) {
>              error_propagate(errp, local_err);
>              return;
> @@ -3463,7 +3791,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>      ns = n->namespaces;
>      for (i = 0; i < n->num_namespaces; i++, ns++) {
>          ns->nsid = i + 1;
> -        nvme_init_namespace(n, ns, &local_err);
> +        nvme_init_namespace(n, ns, init_meta, &local_err);
>          if (local_err) {
>              error_propagate(errp, local_err);
>              return;
> @@ -3506,6 +3834,7 @@ static Property nvme_props[] = {
>      DEFINE_PROP_UINT64("zone_size", NvmeCtrl, params.zone_size, 512),
>      DEFINE_PROP_UINT64("zone_capacity", NvmeCtrl, params.zone_capacity, 512),
>      DEFINE_PROP_UINT32("zone_append_max_size", NvmeCtrl, params.zamds_bs, 0),
> +    DEFINE_PROP_STRING("zone_file", NvmeCtrl, params.zone_file),
>      DEFINE_PROP_UINT32("zone_descr_ext_size", NvmeCtrl,
>                         params.zd_extension_size, 0),
>      DEFINE_PROP_INT32("max_active", NvmeCtrl, params.max_active_zones, 0),
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 900fc54809..5e9a3a62f7 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -14,6 +14,7 @@ typedef struct NvmeParams {
>      uint16_t msix_qsize;
>      uint32_t cmb_size_mb;
>  
> +    char        *zone_file;
>      bool        zoned;
>      bool        cross_zone_read;
>      bool        zone_async_events;
> @@ -114,6 +115,27 @@ typedef struct NvmeZoneList {
>      uint8_t         rsvd12[4];
>  } NvmeZoneList;
>  
> +#define NVME_ZONE_META_MAGIC 0x3aebaa70
> +#define NVME_ZONE_META_VER  1
> +
> +typedef struct NvmeZoneMeta {
> +    uint32_t        magic;
> +    uint32_t        version;
> +    uint64_t        zone_size;
> +    uint64_t        zone_capacity;
> +    uint32_t        nr_offline_zones;
> +    uint32_t        nr_rdonly_zones;
> +    uint32_t        lba_size;
> +    uint32_t        rsvd40;
> +    NvmeZoneList    exp_open_zones;
> +    NvmeZoneList    imp_open_zones;
> +    NvmeZoneList    closed_zones;
> +    NvmeZoneList    full_zones;
> +    uint8_t         zd_extension_size;
> +    uint8_t         dirty;
> +    uint8_t         rsvd594[3990];
> +} NvmeZoneMeta;
> +
>  typedef struct NvmeNamespace {
>      NvmeIdNs        id_ns;
>      uint32_t        nsid;
> @@ -122,6 +144,7 @@ typedef struct NvmeNamespace {
>  
>      NvmeIdNsZoned   *id_ns_zoned;
>      NvmeZone        *zone_array;
> +    NvmeZoneMeta    *zone_meta;
>      NvmeZoneList    *exp_open_zones;
>      NvmeZoneList    *imp_open_zones;
>      NvmeZoneList    *closed_zones;
> @@ -174,6 +197,7 @@ typedef struct NvmeCtrl {
>  
>      int             zone_file_fd;
>      uint32_t        num_zones;
> +    size_t          meta_size;
>      uint64_t        zone_size_bs;
>      uint64_t        zone_array_size;
>      uint8_t         zamds;
> @@ -282,6 +306,19 @@ static inline NvmeZone *nvme_next_zone_in_list(NvmeNamespace *ns, NvmeZone *z,
>      return &ns->zone_array[z->next];
>  }
>  
> +static inline bool nvme_zone_meta_dirty(NvmeCtrl *n, NvmeNamespace *ns)
> +{
> +    return n->params.zone_file ? ns->zone_meta->dirty : false;
> +}
> +
> +static inline void nvme_set_zone_meta_dirty(NvmeCtrl *n, NvmeNamespace *ns,
> +    bool yesno)
> +{
> +    if (n->params.zone_file) {
> +        ns->zone_meta->dirty = yesno;
> +    }
> +}
> +
>  static inline int nvme_ilog2(uint64_t i)
>  {
>      int log = -1;
> @@ -295,6 +332,7 @@ static inline int nvme_ilog2(uint64_t i)
>  
>  static inline void _hw_nvme_check_size(void)
>  {
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeZoneMeta) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeZoneList) != 16);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeZone) != 88);
>  }
> -- 
> 2.21.0
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2020-07-01 17:27 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-17 21:33 [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev
2020-06-17 21:33 ` [PATCH v2 01/18] hw/block/nvme: Move NvmeRequest has_sg field to a bit flag Dmitry Fomichev
2020-06-30  0:56   ` Alistair Francis
2020-06-30  4:09   ` Klaus Jensen
2020-06-17 21:33 ` [PATCH v2 02/18] hw/block/nvme: Define 64 bit cqe.result Dmitry Fomichev
2020-06-30  0:58   ` Alistair Francis
2020-06-30  4:15   ` Klaus Jensen
2020-06-17 21:34 ` [PATCH v2 03/18] hw/block/nvme: Clean up unused AER definitions Dmitry Fomichev
2020-06-30  1:00   ` Alistair Francis
2020-06-30  4:40   ` Klaus Jensen
2020-06-17 21:34 ` [PATCH v2 04/18] hw/block/nvme: Add Commands Supported and Effects log Dmitry Fomichev
2020-06-30  1:35   ` Alistair Francis
2020-06-30  4:46   ` Klaus Jensen
2020-06-17 21:34 ` [PATCH v2 05/18] hw/block/nvme: Introduce the Namespace Types definitions Dmitry Fomichev
2020-06-30  2:12   ` Alistair Francis
2020-06-30 10:02     ` Niklas Cassel
2020-06-30 17:02       ` Keith Busch
2020-06-30  4:57   ` Klaus Jensen
2020-06-30 16:04     ` Niklas Cassel
2020-06-17 21:34 ` [PATCH v2 06/18] hw/block/nvme: Define trace events related to NS Types Dmitry Fomichev
2020-06-30 10:20   ` Klaus Jensen
2020-06-30 20:18   ` Alistair Francis
2020-06-17 21:34 ` [PATCH v2 07/18] hw/block/nvme: Add support for Namespace Types Dmitry Fomichev
2020-06-30 11:31   ` Klaus Jensen
2020-06-17 21:34 ` [PATCH v2 08/18] hw/block/nvme: Make Zoned NS Command Set definitions Dmitry Fomichev
2020-06-30 11:44   ` Klaus Jensen
2020-06-30 12:08     ` Klaus Jensen
2020-06-30 22:11   ` Alistair Francis
2020-06-17 21:34 ` [PATCH v2 09/18] hw/block/nvme: Define Zoned NS Command Set trace events Dmitry Fomichev
2020-06-30 12:14   ` Klaus Jensen
2020-06-17 21:34 ` [PATCH v2 10/18] hw/block/nvme: Support Zoned Namespace Command Set Dmitry Fomichev
2020-06-30 13:31   ` Klaus Jensen
2020-06-17 21:34 ` [PATCH v2 11/18] hw/block/nvme: Introduce max active and open zone limits Dmitry Fomichev
2020-07-01  0:26   ` Alistair Francis
2020-07-01  6:41   ` Klaus Jensen
2020-06-17 21:34 ` [PATCH v2 12/18] hw/block/nvme: Simulate Zone Active excursions Dmitry Fomichev
2020-07-01  0:30   ` Alistair Francis
2020-07-01  6:12   ` Klaus Jensen
2020-06-17 21:34 ` [PATCH v2 13/18] hw/block/nvme: Set Finish/Reset Zone Recommended attributes Dmitry Fomichev
2020-07-01 16:23   ` Klaus Jensen
2020-06-17 21:34 ` [PATCH v2 14/18] hw/block/nvme: Generate zone AENs Dmitry Fomichev
2020-07-01 11:44   ` Klaus Jensen
2020-06-17 21:34 ` [PATCH v2 15/18] hw/block/nvme: Support Zone Descriptor Extensions Dmitry Fomichev
2020-07-01 16:32   ` Klaus Jensen
2020-06-17 21:34 ` [PATCH v2 16/18] hw/block/nvme: Add injection of Offline/Read-Only zones Dmitry Fomichev
2020-06-17 21:34 ` [PATCH v2 17/18] hw/block/nvme: Use zone metadata file for persistence Dmitry Fomichev
2020-07-01 17:26   ` Klaus Jensen
2020-06-17 21:34 ` [PATCH v2 18/18] hw/block/nvme: Document zoned parameters in usage text Dmitry Fomichev
2020-06-29 20:26 ` [PATCH v2 00/18] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set Dmitry Fomichev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).