All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] hw/block/nvme: zoned namespace command set
@ 2020-09-24 20:45 Klaus Jensen
  2020-09-24 20:45 ` [PATCH 01/16] hw/block/nvme: add nsid to get/setfeat trace events Klaus Jensen
                   ` (18 more replies)
  0 siblings, 19 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

While going through a few rounds of reviews on Dmitry's series I have
also continued nursing my own implementation since originally posted. I
did not receive any reviews originally, since it depended on a lot of
preceding series, but now, with the staging of multiple namespaces on
nvme-next yesterday, I think it deserves another shot since it now
applies directly. The series consists of a couple of trivial patches
followed by the "hw/block/nvme: add support for dulbe and block
utilization tracking", "hw/block/nvme: support namespace types" and the
set of zoned namespace support patches.

A couple of points on how this defers from Dmitry et. al.'s series and
why I think this implementation deserves a review.

  * Standard blockdev-based approach to persistent state. The
    implementation uses a plain blockdev associated with the nvme-ns
    device for storing state persistently. This same 'pstate' blockdev
    is also used for logical block allocation tracking.

  * Relies on automatic configuration of DLFEAT according to what the
    underlying blockdev provides (i.e. BDRV_O_UNMAP for guaranteeing
    zeroes on discarded blocks) for handling reads in the gaps between
    write pointer, ZCAP and ZSZE. Issues discards for zone resets. This
    removes the zero filling.

Finally, I wrote this. I am *NOT* saying that this somehow makes it
better, but as a maintainer, is a big deal to me since both series are
arguably a lot of code to maintain and support (both series are about
the same size). But - I am not the only maintainer, so if Keith (now
suddenly placed in the grim role as some sort of arbiter) signs off on
Dmitry's series, then so be it, I will rest my case.

I think we all want to see an implementation of zoned namespaces in QEMU
sooner rather than later, but I would lie if I said I wouldn't prefer
that it was this one.

Based-on: <20200922084533.1273962-1-its@irrelevant.dk>

Gollu Appalanaidu (1):
  hw/block/nvme: add commands supported and effects log page

Klaus Jensen (15):
  hw/block/nvme: add nsid to get/setfeat trace events
  hw/block/nvme: add trace event for requests with non-zero status code
  hw/block/nvme: make lba data size configurable
  hw/block/nvme: reject io commands if only admin command set selected
  hw/block/nvme: consolidate read, write and write zeroes
  hw/block/nvme: add support for dulbe and block utilization tracking
  hw/block/nvme: support namespace types
  hw/block/nvme: add basic read/write for zoned namespaces
  hw/block/nvme: add the zone management receive command
  hw/block/nvme: add the zone management send command
  hw/block/nvme: add the zone append command
  hw/block/nvme: track and enforce zone resources
  hw/block/nvme: allow open to close transitions by controller
  hw/block/nvme: support zone active excursions
  hw/block/nvme: support reset/finish recommended limits

 docs/specs/nvme.txt   |   49 +-
 hw/block/nvme-ns.h    |  166 +++-
 hw/block/nvme.h       |   24 +
 include/block/nvme.h  |  262 ++++++-
 block/nvme.c          |    4 +-
 hw/block/nvme-ns.c    |  411 +++++++++-
 hw/block/nvme.c       | 1727 +++++++++++++++++++++++++++++++++++++++--
 hw/block/trace-events |   50 +-
 8 files changed, 2580 insertions(+), 113 deletions(-)

-- 
2.28.0



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 01/16] hw/block/nvme: add nsid to get/setfeat trace events
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 20:45 ` [PATCH 02/16] hw/block/nvme: add trace event for requests with non-zero status code Klaus Jensen
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

Include the namespace id in the pci_nvme_{get,set}feat trace events.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c       | 4 ++--
 hw/block/trace-events | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index da8344f196a8..84b6b516fa7b 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1660,7 +1660,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
         [NVME_ARBITRATION] = NVME_ARB_AB_NOLIMIT,
     };
 
-    trace_pci_nvme_getfeat(nvme_cid(req), fid, sel, dw11);
+    trace_pci_nvme_getfeat(nvme_cid(req), nsid, fid, sel, dw11);
 
     if (!nvme_feature_support[fid]) {
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1798,7 +1798,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
     uint8_t fid = NVME_GETSETFEAT_FID(dw10);
     uint8_t save = NVME_SETFEAT_SAVE(dw10);
 
-    trace_pci_nvme_setfeat(nvme_cid(req), fid, save, dw11);
+    trace_pci_nvme_setfeat(nvme_cid(req), nsid, fid, save, dw11);
 
     if (save) {
         return NVME_FID_NOT_SAVEABLE | NVME_DNR;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 446cca08e9ab..5a239b80bf36 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -53,8 +53,8 @@ pci_nvme_identify_ns(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_nslist(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_ns_descr_list(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_get_log(uint16_t cid, uint8_t lid, uint8_t lsp, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" lsp 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
-pci_nvme_getfeat(uint16_t cid, uint8_t fid, uint8_t sel, uint32_t cdw11) "cid %"PRIu16" fid 0x%"PRIx8" sel 0x%"PRIx8" cdw11 0x%"PRIx32""
-pci_nvme_setfeat(uint16_t cid, uint8_t fid, uint8_t save, uint32_t cdw11) "cid %"PRIu16" fid 0x%"PRIx8" save 0x%"PRIx8" cdw11 0x%"PRIx32""
+pci_nvme_getfeat(uint16_t cid, uint32_t nsid, uint8_t fid, uint8_t sel, uint32_t cdw11) "cid %"PRIu16" nsid 0x%"PRIx32" fid 0x%"PRIx8" sel 0x%"PRIx8" cdw11 0x%"PRIx32""
+pci_nvme_setfeat(uint16_t cid, uint32_t nsid, uint8_t fid, uint8_t save, uint32_t cdw11) "cid %"PRIu16" nsid 0x%"PRIx32" fid 0x%"PRIx8" save 0x%"PRIx8" cdw11 0x%"PRIx32""
 pci_nvme_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
 pci_nvme_getfeat_numq(int result) "get feature number of queues, result=%d"
 pci_nvme_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 02/16] hw/block/nvme: add trace event for requests with non-zero status code
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
  2020-09-24 20:45 ` [PATCH 01/16] hw/block/nvme: add nsid to get/setfeat trace events Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 20:45 ` [PATCH 03/16] hw/block/nvme: make lba data size configurable Klaus Jensen
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

If a command results in a non-zero status code, trace it.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c       | 6 ++++++
 hw/block/trace-events | 1 +
 2 files changed, 7 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 84b6b516fa7b..3cbc3c7b75b1 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -777,6 +777,12 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
     assert(cq->cqid == req->sq->cqid);
     trace_pci_nvme_enqueue_req_completion(nvme_cid(req), cq->cqid,
                                           req->status);
+
+    if (req->status) {
+        trace_pci_nvme_err_req_status(nvme_cid(req), nvme_nsid(req->ns),
+                                      req->status, req->cmd.opcode);
+    }
+
     QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
     QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
     timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 5a239b80bf36..9e7507c5abde 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -89,6 +89,7 @@ pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
 
 # nvme traces for error conditions
 pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
+pci_nvme_err_req_status(uint16_t cid, uint32_t nsid, uint16_t status, uint8_t opc) "cid %"PRIu16" nsid %"PRIu32" status 0x%"PRIx16" opc 0x%"PRIx8""
 pci_nvme_err_addr_read(uint64_t addr) "addr 0x%"PRIx64""
 pci_nvme_err_addr_write(uint64_t addr) "addr 0x%"PRIx64""
 pci_nvme_err_cfs(void) "controller fatal status"
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 03/16] hw/block/nvme: make lba data size configurable
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
  2020-09-24 20:45 ` [PATCH 01/16] hw/block/nvme: add nsid to get/setfeat trace events Klaus Jensen
  2020-09-24 20:45 ` [PATCH 02/16] hw/block/nvme: add trace event for requests with non-zero status code Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 21:13   ` Klaus Jensen
  2020-09-24 20:45 ` [PATCH 04/16] hw/block/nvme: reject io commands if only admin command set selected Klaus Jensen
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen, Maxim Levitsky,
	Philippe Mathieu-Daudé

From: Klaus Jensen <k.jensen@samsung.com>

Allos the LBA data size (lbads) to be set between 9 and 12.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Acked-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
---
 docs/specs/nvme.txt | 11 ++++++++++-
 hw/block/nvme-ns.h  |  1 +
 hw/block/nvme-ns.c  |  8 +++++++-
 hw/block/nvme.c     |  1 +
 4 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
index 56d393884e7a..438ca50d698c 100644
--- a/docs/specs/nvme.txt
+++ b/docs/specs/nvme.txt
@@ -1,7 +1,16 @@
 NVM Express Controller
 ======================
 
-The nvme device (-device nvme) emulates an NVM Express Controller.
+The nvme device (-device nvme) emulates an NVM Express Controller. It is used
+together with nvme-ns devices (-device nvme-ns) which emulates an NVM Express
+Namespace.
+
+nvme-ns Options
+---------------
+
+  `lbads`; The "LBA Data Size (LBADS)" indicates the LBA data size used by the
+     namespace. It is specified in terms of a power of two. Only values between
+     9 and 12 (both inclusive) are supported.
 
 
 Reference Specifications
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 83734f4606e1..78b0d1a00672 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -21,6 +21,7 @@
 
 typedef struct NvmeNamespaceParams {
     uint32_t nsid;
+    uint8_t  lbads;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 2ba0263ddaca..576c7486f45b 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -36,7 +36,7 @@ static void nvme_ns_init(NvmeNamespace *ns)
         ns->id_ns.dlfeat = 0x9;
     }
 
-    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+    id_ns->lbaf[0].ds = ns->params.lbads;
 
     id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(ns));
 
@@ -77,6 +77,11 @@ static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
         return -1;
     }
 
+    if (ns->params.lbads < 9 || ns->params.lbads > 12) {
+        error_setg(errp, "unsupported lbads (supported: 9-12)");
+        return -1;
+    }
+
     return 0;
 }
 
@@ -125,6 +130,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 static Property nvme_ns_props[] = {
     DEFINE_BLOCK_PROPERTIES(NvmeNamespace, blkconf),
     DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
+    DEFINE_PROP_UINT8("lbads", NvmeNamespace, params.lbads, BDRV_SECTOR_BITS),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3cbc3c7b75b1..758f58c88026 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -2812,6 +2812,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     if (n->namespace.blkconf.blk) {
         ns = &n->namespace;
         ns->params.nsid = 1;
+        ns->params.lbads = BDRV_SECTOR_BITS;
 
         if (nvme_ns_setup(n, ns, errp)) {
             return;
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 04/16] hw/block/nvme: reject io commands if only admin command set selected
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (2 preceding siblings ...)
  2020-09-24 20:45 ` [PATCH 03/16] hw/block/nvme: make lba data size configurable Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 20:45 ` [PATCH 05/16] hw/block/nvme: consolidate read, write and write zeroes Klaus Jensen
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

If the host sets CC.CSS to 111b, all commands submitted to I/O queues
should be completed with status Invalid Command Opcode.

Note that this is technically a v1.4 feature, but it does not hurt to
implement before we finally bump the reported version implemented.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 include/block/nvme.h | 5 +++++
 hw/block/nvme.c      | 4 ++++
 2 files changed, 9 insertions(+)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index 58647bcdad0b..7a30cf285ae0 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -110,6 +110,11 @@ enum NvmeCcMask {
 #define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK)
 #define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK)
 
+enum NvmeCcCss {
+    NVME_CC_CSS_NVM        = 0x0,
+    NVME_CC_CSS_ADMIN_ONLY = 0x7,
+};
+
 enum NvmeCstsShift {
     CSTS_RDY_SHIFT      = 0,
     CSTS_CFS_SHIFT      = 1,
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 758f58c88026..27af2f0b38d5 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1065,6 +1065,10 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
     trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req),
                           req->cmd.opcode, nvme_io_opc_str(req->cmd.opcode));
 
+    if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_ADMIN_ONLY) {
+        return NVME_INVALID_OPCODE | NVME_DNR;
+    }
+
     if (!nvme_nsid_valid(n, nsid)) {
         return NVME_INVALID_NSID | NVME_DNR;
     }
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 05/16] hw/block/nvme: consolidate read, write and write zeroes
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (3 preceding siblings ...)
  2020-09-24 20:45 ` [PATCH 04/16] hw/block/nvme: reject io commands if only admin command set selected Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 20:45 ` [PATCH 06/16] hw/block/nvme: add support for dulbe and block utilization tracking Klaus Jensen
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

Consolidate the read/write and write zeroes functions.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.h       | 11 ++++++++
 include/block/nvme.h  |  2 ++
 hw/block/nvme.c       | 65 +++++++++++++++----------------------------
 hw/block/trace-events |  3 +-
 4 files changed, 37 insertions(+), 44 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index e080a2318a50..ccf52ac7bb82 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -36,6 +36,17 @@ typedef struct NvmeRequest {
     QTAILQ_ENTRY(NvmeRequest)entry;
 } NvmeRequest;
 
+static inline bool nvme_req_is_write(NvmeRequest *req)
+{
+    switch (req->cmd.opcode) {
+    case NVME_CMD_WRITE:
+    case NVME_CMD_WRITE_ZEROES:
+        return true;
+    default:
+        return false;
+    }
+}
+
 static inline const char *nvme_adm_opc_str(uint8_t opc)
 {
     switch (opc) {
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 7a30cf285ae0..999b4f8ae0d4 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -438,6 +438,8 @@ typedef struct QEMU_PACKED NvmeCmd {
     uint32_t    cdw15;
 } NvmeCmd;
 
+#define NVME_CMD_OPCODE_DATA_TRANSFER_MASK 0x3
+
 #define NVME_CMD_FLAGS_FUSE(flags) (flags & 0x3)
 #define NVME_CMD_FLAGS_PSDT(flags) ((flags >> 6) & 0x3)
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 27af2f0b38d5..795c7e7c529f 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -997,48 +997,19 @@ static uint16_t nvme_flush(NvmeCtrl *n, NvmeRequest *req)
     return nvme_do_aio(ns->blkconf.blk, 0, 0, req);
 }
 
-static uint16_t nvme_write_zeroes(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_rwz(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
     NvmeNamespace *ns = req->ns;
+
     uint64_t slba = le64_to_cpu(rw->slba);
     uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
-    uint64_t offset = nvme_l2b(ns, slba);
-    uint32_t count = nvme_l2b(ns, nlb);
+    size_t len = nvme_l2b(ns, nlb);
+
     uint16_t status;
 
-    trace_pci_nvme_write_zeroes(nvme_cid(req), nvme_nsid(ns), slba, nlb);
-
-    status = nvme_check_bounds(n, ns, slba, nlb);
-    if (status) {
-        trace_pci_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
-        return status;
-    }
-
-    return nvme_do_aio(ns->blkconf.blk, offset, count, req);
-}
-
-static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
-{
-    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
-    NvmeNamespace *ns = req->ns;
-    uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
-    uint64_t slba = le64_to_cpu(rw->slba);
-
-    uint64_t data_size = nvme_l2b(ns, nlb);
-    uint64_t data_offset = nvme_l2b(ns, slba);
-    enum BlockAcctType acct = req->cmd.opcode == NVME_CMD_WRITE ?
-        BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
-    uint16_t status;
-
-    trace_pci_nvme_rw(nvme_cid(req), nvme_io_opc_str(rw->opcode),
-                      nvme_nsid(ns), nlb, data_size, slba);
-
-    status = nvme_check_mdts(n, data_size);
-    if (status) {
-        trace_pci_nvme_err_mdts(nvme_cid(req), data_size);
-        goto invalid;
-    }
+    trace_pci_nvme_rwz(nvme_cid(req), nvme_io_opc_str(rw->opcode),
+                       nvme_nsid(ns), nlb, len, slba);
 
     status = nvme_check_bounds(n, ns, slba, nlb);
     if (status) {
@@ -1046,15 +1017,26 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
         goto invalid;
     }
 
-    status = nvme_map_dptr(n, data_size, req);
-    if (status) {
-        goto invalid;
+    if (req->cmd.opcode & NVME_CMD_OPCODE_DATA_TRANSFER_MASK) {
+        status = nvme_check_mdts(n, len);
+        if (status) {
+            trace_pci_nvme_err_mdts(nvme_cid(req), len);
+            goto invalid;
+        }
+
+        status = nvme_map_dptr(n, len, req);
+        if (status) {
+            goto invalid;
+        }
     }
 
-    return nvme_do_aio(ns->blkconf.blk, data_offset, data_size, req);
+    return nvme_do_aio(ns->blkconf.blk, nvme_l2b(ns, slba), len, req);
 
 invalid:
-    block_acct_invalid(blk_get_stats(ns->blkconf.blk), acct);
+    block_acct_invalid(blk_get_stats(ns->blkconf.blk),
+                       nvme_req_is_write(req) ? BLOCK_ACCT_WRITE :
+                       BLOCK_ACCT_READ);
+
     return status;
 }
 
@@ -1082,10 +1064,9 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
     case NVME_CMD_FLUSH:
         return nvme_flush(n, req);
     case NVME_CMD_WRITE_ZEROES:
-        return nvme_write_zeroes(n, req);
     case NVME_CMD_WRITE:
     case NVME_CMD_READ:
-        return nvme_rw(n, req);
+        return nvme_rwz(n, req);
     default:
         trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
         return NVME_INVALID_OPCODE | NVME_DNR;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 9e7507c5abde..b18056c49836 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -40,9 +40,8 @@ pci_nvme_map_prp(uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2,
 pci_nvme_map_sgl(uint16_t cid, uint8_t typ, uint64_t len) "cid %"PRIu16" type 0x%"PRIx8" len %"PRIu64""
 pci_nvme_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode, const char *opname) "cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8" opname '%s'"
 pci_nvme_admin_cmd(uint16_t cid, uint16_t sqid, uint8_t opcode, const char *opname) "cid %"PRIu16" sqid %"PRIu16" opc 0x%"PRIx8" opname '%s'"
-pci_nvme_rw(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, uint64_t count, uint64_t lba) "cid %"PRIu16" opname '%s' nsid %"PRIu32" nlb %"PRIu32" count %"PRIu64" lba 0x%"PRIx64""
+pci_nvme_rwz(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, uint64_t len, uint64_t lba) "cid %"PRIu16" opname '%s' nsid %"PRIu32" nlb %"PRIu32" len %"PRIu64" lba 0x%"PRIx64""
 pci_nvme_rw_cb(uint16_t cid, const char *blkname) "cid %"PRIu16" blk '%s'"
-pci_nvme_write_zeroes(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32""
 pci_nvme_do_aio(uint16_t cid, uint8_t opc, const char *opname, const char *blkname, int64_t offset, size_t len) "cid %"PRIu16" opc 0x%"PRIx8" opname '%s' blk '%s' offset %"PRId64" len %zu"
 pci_nvme_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16""
 pci_nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 06/16] hw/block/nvme: add support for dulbe and block utilization tracking
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (4 preceding siblings ...)
  2020-09-24 20:45 ` [PATCH 05/16] hw/block/nvme: consolidate read, write and write zeroes Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 20:45 ` [PATCH 07/16] hw/block/nvme: add commands supported and effects log page Klaus Jensen
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

This adds support for reporting the Deallocated or Unwritten Logical
Block error (DULBE). This requires tracking the allocated/deallocated
status of all logical blocks.

Introduce a bitmap that does this. The bitmap is persisted on the new
'pstate' blockdev that is associated with a namespace. If no such drive
is attached, the controller will not indicate support for DULBE.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 docs/specs/nvme.txt   |   7 +++
 hw/block/nvme-ns.h    |  13 +++++
 include/block/nvme.h  |   5 ++
 hw/block/nvme-ns.c    | 110 ++++++++++++++++++++++++++++++++++++++++
 hw/block/nvme.c       | 113 ++++++++++++++++++++++++++++++++++++++++--
 hw/block/trace-events |   2 +
 6 files changed, 247 insertions(+), 3 deletions(-)

diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
index 438ca50d698c..9a5c67f10b5d 100644
--- a/docs/specs/nvme.txt
+++ b/docs/specs/nvme.txt
@@ -12,6 +12,13 @@ nvme-ns Options
      namespace. It is specified in terms of a power of two. Only values between
      9 and 12 (both inclusive) are supported.
 
+  `pstate`; This parameter specifies another blockdev to be used for storing
+     persistent state such as logical block allocation tracking. Adding this
+     parameter enables various optional features of the device.
+
+       -drive id=pstate,file=pstate.img,format=raw
+       -device nvme-ns,pstate=pstate,...
+
 
 Reference Specifications
 ------------------------
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 78b0d1a00672..51141796909f 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -31,7 +31,20 @@ typedef struct NvmeNamespace {
     int64_t      size;
     NvmeIdNs     id_ns;
 
+    struct {
+        BlockBackend *blk;
+
+        struct {
+            unsigned long *map;
+            int64_t       offset;
+        } utilization;
+    } pstate;
+
     NvmeNamespaceParams params;
+
+    struct {
+        uint32_t err_rec;
+    } features;
 } NvmeNamespace;
 
 static inline uint32_t nvme_nsid(NvmeNamespace *ns)
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 999b4f8ae0d4..abd49d371e63 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -683,6 +683,7 @@ enum NvmeStatusCodes {
     NVME_E2E_REF_ERROR          = 0x0284,
     NVME_CMP_FAILURE            = 0x0285,
     NVME_ACCESS_DENIED          = 0x0286,
+    NVME_DULB                   = 0x0287,
     NVME_MORE                   = 0x2000,
     NVME_DNR                    = 0x4000,
     NVME_NO_COMPLETE            = 0xffff,
@@ -898,6 +899,9 @@ enum NvmeIdCtrlLpa {
 #define NVME_AEC_NS_ATTR(aec)       ((aec >> 8) & 0x1)
 #define NVME_AEC_FW_ACTIVATION(aec) ((aec >> 9) & 0x1)
 
+#define NVME_ERR_REC_TLER(err_rec)  (err_rec & 0xffff)
+#define NVME_ERR_REC_DULBE(err_rec) (err_rec & 0x10000)
+
 enum NvmeFeatureIds {
     NVME_ARBITRATION                = 0x1,
     NVME_POWER_MANAGEMENT           = 0x2,
@@ -1018,6 +1022,7 @@ enum NvmeNsIdentifierType {
 
 
 #define NVME_ID_NS_NSFEAT_THIN(nsfeat)      ((nsfeat & 0x1))
+#define NVME_ID_NS_NSFEAT_DULBE(nsfeat)     ((nsfeat >> 2) & 0x1)
 #define NVME_ID_NS_FLBAS_EXTENDED(flbas)    ((flbas >> 4) & 0x1)
 #define NVME_ID_NS_FLBAS_INDEX(flbas)       ((flbas & 0xf))
 #define NVME_ID_NS_MC_SEPARATE(mc)          ((mc >> 1) & 0x1)
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 576c7486f45b..9a63004c000a 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -25,9 +25,36 @@
 #include "hw/qdev-properties.h"
 #include "hw/qdev-core.h"
 
+#include "trace.h"
+
 #include "nvme.h"
 #include "nvme-ns.h"
 
+static int nvme_blk_truncate(BlockBackend *blk, size_t len, Error **errp)
+{
+    int ret;
+    uint64_t perm, shared_perm;
+
+    blk_get_perm(blk, &perm, &shared_perm);
+
+    ret = blk_set_perm(blk, perm | BLK_PERM_RESIZE, shared_perm, errp);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = blk_truncate(blk, len, false, PREALLOC_MODE_OFF, 0, errp);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = blk_set_perm(blk, perm, shared_perm, errp);
+    if (ret < 0) {
+        return ret;
+    }
+
+    return 0;
+}
+
 static void nvme_ns_init(NvmeNamespace *ns)
 {
     NvmeIdNs *id_ns = &ns->id_ns;
@@ -45,6 +72,67 @@ static void nvme_ns_init(NvmeNamespace *ns)
     id_ns->nuse = id_ns->ncap;
 }
 
+static int nvme_ns_setup_blk_pstate(NvmeNamespace *ns, Error **errp)
+{
+    BlockBackend *blk = ns->pstate.blk;
+    uint64_t perm, shared_perm;
+    ssize_t len;
+    size_t pstate_len;
+    int ret;
+
+    perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE;
+    shared_perm = BLK_PERM_ALL;
+
+    ret = blk_set_perm(blk, perm, shared_perm, errp);
+    if (ret) {
+        return ret;
+    }
+
+    pstate_len = ROUND_UP(DIV_ROUND_UP(nvme_ns_nlbas(ns), 8),
+                          BDRV_SECTOR_SIZE);
+
+    len = blk_getlength(blk);
+    if (len < 0) {
+        error_setg_errno(errp, -len, "could not determine pstate size");
+        return len;
+    }
+
+    unsigned long *map = bitmap_new(nvme_ns_nlbas(ns));
+    ns->pstate.utilization.offset = 0;
+
+    if (!len) {
+        ret = nvme_blk_truncate(blk, pstate_len, errp);
+        if (ret < 0) {
+            return ret;
+        }
+
+        ns->pstate.utilization.map = map;
+    } else {
+        if (len != pstate_len) {
+            error_setg(errp, "pstate size mismatch "
+                "(expected %zd bytes; was %zu bytes)",
+                pstate_len, len);
+            return -1;
+        }
+
+        ret = blk_pread(blk, 0, map, pstate_len);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "could not read pstate");
+            return ret;
+        }
+#ifdef HOST_WORDS_BIGENDIAN
+        ns->pstate.utilization.map = bitmap_new(nvme_ns_nlbas(ns));
+        bitmap_from_le(ns->pstate.utilization.map, map, nvme_ns_nlbas(ns));
+#else
+        ns->pstate.utilization.map = map;
+#endif
+
+        return 0;
+    }
+
+    return 0;
+}
+
 static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
 {
     if (!blkconf_blocksizes(&ns->blkconf, errp)) {
@@ -96,6 +184,19 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
     }
 
     nvme_ns_init(ns);
+
+    if (ns->pstate.blk) {
+        if (nvme_ns_setup_blk_pstate(ns, errp)) {
+            return -1;
+        }
+
+        /*
+         * With a pstate file in place we can enable the Deallocated or
+         * Unwritten Logical Block Error feature.
+         */
+        ns->id_ns.nsfeat |= 0x4;
+    }
+
     if (nvme_register_namespace(n, ns, errp)) {
         return -1;
     }
@@ -106,11 +207,19 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
 void nvme_ns_drain(NvmeNamespace *ns)
 {
     blk_drain(ns->blkconf.blk);
+
+    if (ns->pstate.blk) {
+        blk_drain(ns->pstate.blk);
+    }
 }
 
 void nvme_ns_flush(NvmeNamespace *ns)
 {
     blk_flush(ns->blkconf.blk);
+
+    if (ns->pstate.blk) {
+        blk_flush(ns->pstate.blk);
+    }
 }
 
 static void nvme_ns_realize(DeviceState *dev, Error **errp)
@@ -131,6 +240,7 @@ static Property nvme_ns_props[] = {
     DEFINE_BLOCK_PROPERTIES(NvmeNamespace, blkconf),
     DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
     DEFINE_PROP_UINT8("lbads", NvmeNamespace, params.lbads, BDRV_SECTOR_BITS),
+    DEFINE_PROP_DRIVE("pstate", NvmeNamespace, pstate.blk),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 795c7e7c529f..b16e089bda80 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -105,6 +105,7 @@ static const bool nvme_feature_support[NVME_FID_MAX] = {
 
 static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
     [NVME_TEMPERATURE_THRESHOLD]    = NVME_FEAT_CAP_CHANGE,
+    [NVME_ERROR_RECOVERY]           = NVME_FEAT_CAP_CHANGE | NVME_FEAT_CAP_NS,
     [NVME_VOLATILE_WRITE_CACHE]     = NVME_FEAT_CAP_CHANGE,
     [NVME_NUMBER_OF_QUEUES]         = NVME_FEAT_CAP_CHANGE,
     [NVME_ASYNCHRONOUS_EVENT_CONF]  = NVME_FEAT_CAP_CHANGE,
@@ -888,6 +889,61 @@ static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace *ns,
     return NVME_SUCCESS;
 }
 
+static inline uint16_t nvme_check_dulbe(NvmeNamespace *ns, uint64_t slba,
+                                        uint32_t nlb)
+{
+    uint64_t elba = slba + nlb;
+
+    if (find_next_zero_bit(ns->pstate.utilization.map, elba, slba) < elba) {
+        return NVME_DULB;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static int nvme_allocate(NvmeNamespace *ns, uint64_t slba, uint32_t nlb)
+{
+    int nlongs, first;
+    int64_t offset;
+    unsigned long *map, *src;
+    int ret;
+
+    if (!(ns->pstate.blk && nvme_check_dulbe(ns, slba, nlb))) {
+        return 0;
+    }
+
+    trace_pci_nvme_allocate(nvme_nsid(ns), slba, nlb);
+
+    bitmap_set(ns->pstate.utilization.map, slba, nlb);
+
+    nlongs = BITS_TO_LONGS(nlb) + 1;
+    first = slba / BITS_PER_LONG;
+    offset = ns->pstate.utilization.offset + first * sizeof(unsigned long);
+    src = ns->pstate.utilization.map;
+
+#ifdef HOST_WORDS_BIGENDIAN
+    map = g_new(nlongs, sizeof(unsigned long));
+    for (int i = first; i < first + nlongs; i++) {
+# if HOST_LONG_BITS == 64
+        map[i] = bswap64(src[i]);
+# else
+        map[i] = bswap32(src[i]);
+# endif
+    }
+#else
+    map = src;
+#endif
+
+    ret = blk_pwrite(ns->pstate.blk, offset, &map[first],
+                     nlongs * sizeof(unsigned long), 0);
+
+#ifdef HOST_WORDS_BIGENDIAN
+    g_free(map);
+#endif
+    return ret;
+}
+
+
 static void nvme_rw_cb(void *opaque, int ret)
 {
     NvmeRequest *req = opaque;
@@ -1006,6 +1062,7 @@ static uint16_t nvme_rwz(NvmeCtrl *n, NvmeRequest *req)
     uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
     size_t len = nvme_l2b(ns, nlb);
 
+    bool is_write = nvme_req_is_write(req);
     uint16_t status;
 
     trace_pci_nvme_rwz(nvme_cid(req), nvme_io_opc_str(rw->opcode),
@@ -1017,6 +1074,16 @@ static uint16_t nvme_rwz(NvmeCtrl *n, NvmeRequest *req)
         goto invalid;
     }
 
+    if (!is_write) {
+        if (NVME_ERR_REC_DULBE(ns->features.err_rec)) {
+            status = nvme_check_dulbe(ns, slba, nlb);
+            if (status) {
+                trace_pci_nvme_err_dulbe(nvme_cid(req), slba, nlb);
+                goto invalid;
+            }
+        }
+    }
+
     if (req->cmd.opcode & NVME_CMD_OPCODE_DATA_TRANSFER_MASK) {
         status = nvme_check_mdts(n, len);
         if (status) {
@@ -1030,12 +1097,18 @@ static uint16_t nvme_rwz(NvmeCtrl *n, NvmeRequest *req)
         }
     }
 
+    if (is_write) {
+        if (nvme_allocate(ns, slba, nlb) < 0) {
+            status = NVME_INTERNAL_DEV_ERROR;
+            goto invalid;
+        }
+    }
+
     return nvme_do_aio(ns->blkconf.blk, nvme_l2b(ns, slba), len, req);
 
 invalid:
     block_acct_invalid(blk_get_stats(ns->blkconf.blk),
-                       nvme_req_is_write(req) ? BLOCK_ACCT_WRITE :
-                       BLOCK_ACCT_READ);
+                       is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
 
     return status;
 }
@@ -1638,6 +1711,8 @@ static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeRequest *req)
 
 static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
 {
+    NvmeNamespace *ns;
+
     NvmeCmd *cmd = &req->cmd;
     uint32_t dw10 = le32_to_cpu(cmd->cdw10);
     uint32_t dw11 = le32_to_cpu(cmd->cdw11);
@@ -1708,6 +1783,18 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
         }
 
         return NVME_INVALID_FIELD | NVME_DNR;
+    case NVME_ERROR_RECOVERY:
+        if (!nvme_nsid_valid(n, nsid)) {
+            return NVME_INVALID_NSID | NVME_DNR;
+        }
+
+        ns = nvme_ns(n, nsid);
+        if (unlikely(!ns)) {
+            return NVME_INVALID_FIELD | NVME_DNR;
+        }
+
+        result = ns->features.err_rec;
+        goto out;
     case NVME_VOLATILE_WRITE_CACHE:
         result = n->features.vwc;
         trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
@@ -1780,7 +1867,7 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeRequest *req)
 
 static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
 {
-    NvmeNamespace *ns;
+    NvmeNamespace *ns = NULL;
 
     NvmeCmd *cmd = &req->cmd;
     uint32_t dw10 = le32_to_cpu(cmd->cdw10);
@@ -1847,6 +1934,26 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
                                NVME_LOG_SMART_INFO);
         }
 
+        break;
+    case NVME_ERROR_RECOVERY:
+        if (nsid == NVME_NSID_BROADCAST) {
+            for (int i = 1; i <= n->num_namespaces; i++) {
+                ns = nvme_ns(n, i);
+
+                if (!ns) {
+                    continue;
+                }
+
+                if (NVME_ID_NS_NSFEAT_DULBE(ns->id_ns.nsfeat)) {
+                    ns->features.err_rec = dw11;
+                }
+            }
+
+            break;
+        }
+
+        assert(ns);
+        ns->features.err_rec = dw11;
         break;
     case NVME_VOLATILE_WRITE_CACHE:
         n->features.vwc = dw11 & 0x1;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index b18056c49836..774513469274 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -42,6 +42,7 @@ pci_nvme_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode, cons
 pci_nvme_admin_cmd(uint16_t cid, uint16_t sqid, uint8_t opcode, const char *opname) "cid %"PRIu16" sqid %"PRIu16" opc 0x%"PRIx8" opname '%s'"
 pci_nvme_rwz(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, uint64_t len, uint64_t lba) "cid %"PRIu16" opname '%s' nsid %"PRIu32" nlb %"PRIu32" len %"PRIu64" lba 0x%"PRIx64""
 pci_nvme_rw_cb(uint16_t cid, const char *blkname) "cid %"PRIu16" blk '%s'"
+pci_nvme_allocate(uint32_t ns, uint64_t slba, uint32_t nlb) "nsid %"PRIu32" slba 0x%"PRIx64" nlb %"PRIu32""
 pci_nvme_do_aio(uint16_t cid, uint8_t opc, const char *opname, const char *blkname, int64_t offset, size_t len) "cid %"PRIu16" opc 0x%"PRIx8" opname '%s' blk '%s' offset %"PRId64" len %zu"
 pci_nvme_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16""
 pci_nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
@@ -89,6 +90,7 @@ pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
 # nvme traces for error conditions
 pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
 pci_nvme_err_req_status(uint16_t cid, uint32_t nsid, uint16_t status, uint8_t opc) "cid %"PRIu16" nsid %"PRIu32" status 0x%"PRIx16" opc 0x%"PRIx8""
+pci_nvme_err_dulbe(uint16_t cid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" slba 0x%"PRIx64" nlb %"PRIu32""
 pci_nvme_err_addr_read(uint64_t addr) "addr 0x%"PRIx64""
 pci_nvme_err_addr_write(uint64_t addr) "addr 0x%"PRIx64""
 pci_nvme_err_cfs(void) "controller fatal status"
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 07/16] hw/block/nvme: add commands supported and effects log page
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (5 preceding siblings ...)
  2020-09-24 20:45 ` [PATCH 06/16] hw/block/nvme: add support for dulbe and block utilization tracking Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 20:45 ` [PATCH 08/16] hw/block/nvme: support namespace types Klaus Jensen
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen,
	Gollu Appalanaidu, Max Reitz, Keith Busch, Klaus Jensen

From: Gollu Appalanaidu <anaidu.gollu@samsung.com>

This is to support for the Commands Supported and Effects log page. See
NVM Express Spec 1.3d, sec. 5.14.1.5 ("Commands Supported and Effects")

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 include/block/nvme.h | 21 +++++++++++++
 hw/block/nvme.c      | 75 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index abd49d371e63..5a5e19f6bedc 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -734,6 +734,24 @@ typedef struct QEMU_PACKED NvmeSmartLog {
     uint8_t     reserved2[320];
 } NvmeSmartLog;
 
+typedef struct QEMU_PACKED NvmeEffectsLog {
+    uint32_t acs[256];
+    uint32_t iocs[256];
+    uint8_t  rsvd2048[2048];
+} NvmeEffectsLog;
+
+enum {
+    NVME_EFFECTS_CSUPP      = 1 <<  0,
+    NVME_EFFECTS_LBCC       = 1 <<  1,
+    NVME_EFFECTS_NCC        = 1 <<  2,
+    NVME_EFFECTS_NIC        = 1 <<  3,
+    NVME_EFFECTS_CCC        = 1 <<  4,
+    NVME_EFFECTS_CSE_SINGLE = 1 << 16,
+    NVME_EFFECTS_CSE_MULTI  = 1 << 17,
+    NVME_EFFECTS_CSE_MASK   = 3 << 16,
+    NVME_EFFECTS_UUID_SEL   = 1 << 19,
+};
+
 enum NvmeSmartWarn {
     NVME_SMART_SPARE                  = 1 << 0,
     NVME_SMART_TEMPERATURE            = 1 << 1,
@@ -746,6 +764,7 @@ enum NvmeLogIdentifier {
     NVME_LOG_ERROR_INFO     = 0x01,
     NVME_LOG_SMART_INFO     = 0x02,
     NVME_LOG_FW_SLOT_INFO   = 0x03,
+    NVME_LOG_EFFECTS        = 0x05,
 };
 
 typedef struct QEMU_PACKED NvmePSD {
@@ -857,6 +876,7 @@ enum NvmeIdCtrlFrmw {
 };
 
 enum NvmeIdCtrlLpa {
+    NVME_LPA_EFFECTS_LOG  = 1 << 1,
     NVME_LPA_EXTENDED = 1 << 2,
 };
 
@@ -1064,5 +1084,6 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
 }
 #endif
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b16e089bda80..21f558f75559 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -81,6 +81,7 @@
 #define NVME_TEMPERATURE_WARNING 0x157
 #define NVME_TEMPERATURE_CRITICAL 0x175
 #define NVME_NUM_FW_SLOTS 1
+#define NVME_MAX_ADM_IO_CMDS 0xFF
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
     do { \
@@ -112,6 +113,46 @@ static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
     [NVME_TIMESTAMP]                = NVME_FEAT_CAP_CHANGE,
 };
 
+#define NVME_EFFECTS_ADMIN_INITIALIZER                 \
+    [NVME_ADM_CMD_DELETE_SQ]    = NVME_EFFECTS_CSUPP,  \
+    [NVME_ADM_CMD_CREATE_SQ]    = NVME_EFFECTS_CSUPP,  \
+    [NVME_ADM_CMD_GET_LOG_PAGE] = NVME_EFFECTS_CSUPP,  \
+    [NVME_ADM_CMD_DELETE_CQ]    = NVME_EFFECTS_CSUPP,  \
+    [NVME_ADM_CMD_CREATE_CQ]    = NVME_EFFECTS_CSUPP,  \
+    [NVME_ADM_CMD_IDENTIFY]     = NVME_EFFECTS_CSUPP,  \
+    [NVME_ADM_CMD_ABORT]        = NVME_EFFECTS_CSUPP,  \
+    [NVME_ADM_CMD_SET_FEATURES] = NVME_EFFECTS_CSUPP | \
+                                  NVME_EFFECTS_CCC |   \
+                                  NVME_EFFECTS_NIC |   \
+                                  NVME_EFFECTS_NCC,    \
+    [NVME_ADM_CMD_GET_FEATURES] = NVME_EFFECTS_CSUPP,  \
+    [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_EFFECTS_CSUPP
+
+#define NVME_EFFECTS_NVM_INITIALIZER                   \
+    [NVME_CMD_FLUSH]            = NVME_EFFECTS_CSUPP | \
+                                  NVME_EFFECTS_LBCC,   \
+    [NVME_CMD_WRITE]            = NVME_EFFECTS_CSUPP | \
+                                  NVME_EFFECTS_LBCC,   \
+    [NVME_CMD_READ]             = NVME_EFFECTS_CSUPP,  \
+    [NVME_CMD_WRITE_ZEROES]     = NVME_EFFECTS_CSUPP | \
+                                  NVME_EFFECTS_LBCC
+
+static const NvmeEffectsLog nvme_effects_admin_only = {
+    .acs = {
+        NVME_EFFECTS_ADMIN_INITIALIZER,
+    },
+};
+
+static const NvmeEffectsLog nvme_effects = {
+    .acs = {
+        NVME_EFFECTS_ADMIN_INITIALIZER,
+    },
+
+    .iocs = {
+        NVME_EFFECTS_NVM_INITIALIZER,
+    },
+};
+
 static void nvme_process_sq(void *opaque);
 
 static uint16_t nvme_cid(NvmeRequest *req)
@@ -1365,6 +1406,36 @@ static uint16_t nvme_error_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
                     DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static uint16_t nvme_effects_log(NvmeCtrl *n, uint32_t buf_len, uint64_t off,
+                                 NvmeRequest *req)
+{
+    const NvmeEffectsLog *effects;
+
+    uint32_t trans_len;
+
+    if (off > sizeof(NvmeEffectsLog)) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    switch (NVME_CC_CSS(n->bar.cc)) {
+    case NVME_CC_CSS_ADMIN_ONLY:
+        effects = &nvme_effects_admin_only;
+        break;
+
+    case NVME_CC_CSS_NVM:
+        effects = &nvme_effects;
+        break;
+
+    default:
+        return NVME_INTERNAL_DEV_ERROR | NVME_DNR;
+    }
+
+    trans_len = MIN(sizeof(NvmeEffectsLog) - off, buf_len);
+
+    return nvme_dma(n, (uint8_t *)effects + off, trans_len,
+                    DMA_DIRECTION_FROM_DEVICE, req);
+}
+
 static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeCmd *cmd = &req->cmd;
@@ -1408,6 +1479,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
         return nvme_smart_info(n, rae, len, off, req);
     case NVME_LOG_FW_SLOT_INFO:
         return nvme_fw_log_info(n, len, off, req);
+    case NVME_LOG_EFFECTS:
+        return nvme_effects_log(n, len, off, req);
     default:
         trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -2841,7 +2914,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     id->acl = 3;
     id->aerl = n->params.aerl;
     id->frmw = (NVME_NUM_FW_SLOTS << 1) | NVME_FRMW_SLOT1_RO;
-    id->lpa = NVME_LPA_EXTENDED;
+    id->lpa = NVME_LPA_EFFECTS_LOG | NVME_LPA_EXTENDED;
 
     /* recommended default value (~70 C) */
     id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 08/16] hw/block/nvme: support namespace types
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (6 preceding siblings ...)
  2020-09-24 20:45 ` [PATCH 07/16] hw/block/nvme: add commands supported and effects log page Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 20:45 ` [PATCH 09/16] hw/block/nvme: add basic read/write for zoned namespaces Klaus Jensen
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

Implement support for TP 4056 ("Namespace Types"). This adds the 'iocs'
(I/O Command Set) device parameter to the nvme-ns device.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 docs/specs/nvme.txt   |   3 +
 hw/block/nvme-ns.h    |  11 +-
 hw/block/nvme.h       |   3 +
 include/block/nvme.h  |  58 +++++++++--
 block/nvme.c          |   4 +-
 hw/block/nvme-ns.c    |  21 +++-
 hw/block/nvme.c       | 228 ++++++++++++++++++++++++++++++++++--------
 hw/block/trace-events |   6 +-
 8 files changed, 275 insertions(+), 59 deletions(-)

diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
index 9a5c67f10b5d..890990510b14 100644
--- a/docs/specs/nvme.txt
+++ b/docs/specs/nvme.txt
@@ -19,6 +19,9 @@ nvme-ns Options
        -drive id=pstate,file=pstate.img,format=raw
        -device nvme-ns,pstate=pstate,...
 
+  `iocs`; The "I/O Command Set" associated with the namespace. E.g. 0x0 for the
+     NVM Command Set (the default), or 0x2 for the Zoned Namespace Command Set.
+
 
 Reference Specifications
 ------------------------
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 51141796909f..06c0bc8b32a4 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -21,6 +21,7 @@
 
 typedef struct NvmeNamespaceParams {
     uint32_t nsid;
+    uint8_t  iocs;
     uint8_t  lbads;
 } NvmeNamespaceParams;
 
@@ -29,7 +30,8 @@ typedef struct NvmeNamespace {
     BlockConf    blkconf;
     int32_t      bootindex;
     int64_t      size;
-    NvmeIdNs     id_ns;
+    uint8_t      iocs;
+    void         *id_ns[NVME_IOCS_MAX];
 
     struct {
         BlockBackend *blk;
@@ -56,9 +58,14 @@ static inline uint32_t nvme_nsid(NvmeNamespace *ns)
     return -1;
 }
 
+static inline NvmeIdNsNvm *nvme_ns_id_nvm(NvmeNamespace *ns)
+{
+    return ns->id_ns[NVME_IOCS_NVM];
+}
+
 static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
 {
-    NvmeIdNs *id_ns = &ns->id_ns;
+    NvmeIdNsNvm *id_ns = nvme_ns_id_nvm(ns);
     return &id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
 }
 
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index ccf52ac7bb82..f66ed9ab7eff 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -123,6 +123,7 @@ typedef struct NvmeFeatureVal {
     };
     uint32_t    async_config;
     uint32_t    vwc;
+    uint32_t    iocsci;
 } NvmeFeatureVal;
 
 typedef struct NvmeCtrl {
@@ -150,6 +151,7 @@ typedef struct NvmeCtrl {
     uint64_t    timestamp_set_qemu_clock_ms;    /* QEMU clock time */
     uint64_t    starttime_ms;
     uint16_t    temperature;
+    uint64_t    iocscs[512];
 
     HostMemoryBackend *pmrdev;
 
@@ -165,6 +167,7 @@ typedef struct NvmeCtrl {
     NvmeSQueue      admin_sq;
     NvmeCQueue      admin_cq;
     NvmeIdCtrl      id_ctrl;
+    void            *id_ctrl_iocss[NVME_IOCS_MAX];
     NvmeFeatureVal  features;
 } NvmeCtrl;
 
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 5a5e19f6bedc..792fccf8c81f 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -82,6 +82,11 @@ enum NvmeCapMask {
 #define NVME_CAP_SET_PMRS(cap, val) (cap |= (uint64_t)(val & CAP_PMR_MASK)\
                                                             << CAP_PMR_SHIFT)
 
+enum NvmeCapCss {
+    NVME_CAP_CSS_NVM = 1 << 0,
+    NVME_CAP_CSS_CSI = 1 << 6,
+};
+
 enum NvmeCcShift {
     CC_EN_SHIFT     = 0,
     CC_CSS_SHIFT    = 4,
@@ -112,6 +117,7 @@ enum NvmeCcMask {
 
 enum NvmeCcCss {
     NVME_CC_CSS_NVM        = 0x0,
+    NVME_CC_CSS_ALL        = 0x6,
     NVME_CC_CSS_ADMIN_ONLY = 0x7,
 };
 
@@ -383,6 +389,11 @@ enum NvmePmrmscMask {
 #define NVME_PMRMSC_SET_CBA(pmrmsc, val)   \
     (pmrmsc |= (uint64_t)(val & PMRMSC_CBA_MASK) << PMRMSC_CBA_SHIFT)
 
+enum NvmeCommandSet {
+    NVME_IOCS_NVM = 0x0,
+    NVME_IOCS_MAX = 0x1,
+};
+
 enum NvmeSglDescriptorType {
     NVME_SGL_DESCR_TYPE_DATA_BLOCK          = 0x0,
     NVME_SGL_DESCR_TYPE_BIT_BUCKET          = 0x1,
@@ -531,8 +542,13 @@ typedef struct QEMU_PACKED NvmeIdentify {
     uint64_t    rsvd2[2];
     uint64_t    prp1;
     uint64_t    prp2;
-    uint32_t    cns;
-    uint32_t    rsvd11[5];
+    uint8_t     cns;
+    uint8_t     rsvd3;
+    uint16_t    cntid;
+    uint16_t    nvmsetid;
+    uint8_t     rsvd4;
+    uint8_t     csi;
+    uint32_t    rsvd11[4];
 } NvmeIdentify;
 
 typedef struct QEMU_PACKED NvmeRwCmd {
@@ -624,8 +640,15 @@ typedef struct QEMU_PACKED NvmeAerResult {
 } NvmeAerResult;
 
 typedef struct QEMU_PACKED NvmeCqe {
-    uint32_t    result;
-    uint32_t    rsvd;
+    union {
+        struct {
+            uint32_t    dw0;
+            uint32_t    dw1;
+        };
+
+        uint64_t qw0;
+    };
+
     uint16_t    sq_head;
     uint16_t    sq_id;
     uint16_t    cid;
@@ -673,6 +696,10 @@ enum NvmeStatusCodes {
     NVME_FEAT_NOT_CHANGEABLE    = 0x010e,
     NVME_FEAT_NOT_NS_SPEC       = 0x010f,
     NVME_FW_REQ_SUSYSTEM_RESET  = 0x0110,
+    NVME_IOCS_NOT_SUPPORTED     = 0x0127,
+    NVME_IOCS_NOT_ENABLED       = 0x0128,
+    NVME_IOCS_COMB_REJECTED     = 0x0129,
+    NVME_INVALID_IOCS           = 0x0126,
     NVME_CONFLICTING_ATTRS      = 0x0180,
     NVME_INVALID_PROT_INFO      = 0x0181,
     NVME_WRITE_TO_RO            = 0x0182,
@@ -734,6 +761,8 @@ typedef struct QEMU_PACKED NvmeSmartLog {
     uint8_t     reserved2[320];
 } NvmeSmartLog;
 
+#define NVME_ACS_MAX 256
+
 typedef struct QEMU_PACKED NvmeEffectsLog {
     uint32_t acs[256];
     uint32_t iocs[256];
@@ -782,10 +811,14 @@ typedef struct QEMU_PACKED NvmePSD {
 #define NVME_IDENTIFY_DATA_SIZE 4096
 
 enum {
-    NVME_ID_CNS_NS             = 0x0,
-    NVME_ID_CNS_CTRL           = 0x1,
-    NVME_ID_CNS_NS_ACTIVE_LIST = 0x2,
-    NVME_ID_CNS_NS_DESCR_LIST  = 0x3,
+    NVME_ID_CNS_NS                  = 0x00,
+    NVME_ID_CNS_CTRL                = 0x01,
+    NVME_ID_CNS_NS_ACTIVE_LIST      = 0x02,
+    NVME_ID_CNS_NS_DESCR_LIST       = 0x03,
+    NVME_ID_CNS_NS_IOCS             = 0x05,
+    NVME_ID_CNS_CTRL_IOCS           = 0x06,
+    NVME_ID_CNS_NS_ACTIVE_LIST_IOCS = 0x07,
+    NVME_ID_CNS_IOCS                = 0x1c,
 };
 
 typedef struct QEMU_PACKED NvmeIdCtrl {
@@ -935,6 +968,7 @@ enum NvmeFeatureIds {
     NVME_WRITE_ATOMICITY            = 0xa,
     NVME_ASYNCHRONOUS_EVENT_CONF    = 0xb,
     NVME_TIMESTAMP                  = 0xe,
+    NVME_COMMAND_SET_PROFILE        = 0x19,
     NVME_SOFTWARE_PROGRESS_MARKER   = 0x80,
     NVME_FID_MAX                    = 0x100,
 };
@@ -983,7 +1017,7 @@ typedef struct QEMU_PACKED NvmeLBAF {
 
 #define NVME_NSID_BROADCAST 0xffffffff
 
-typedef struct QEMU_PACKED NvmeIdNs {
+typedef struct QEMU_PACKED NvmeIdNsNvm {
     uint64_t    nsze;
     uint64_t    ncap;
     uint64_t    nuse;
@@ -1011,7 +1045,7 @@ typedef struct QEMU_PACKED NvmeIdNs {
     NvmeLBAF    lbaf[16];
     uint8_t     rsvd192[192];
     uint8_t     vs[3712];
-} NvmeIdNs;
+} NvmeIdNsNvm;
 
 typedef struct QEMU_PACKED NvmeIdNsDescr {
     uint8_t nidt;
@@ -1023,12 +1057,14 @@ enum {
     NVME_NIDT_EUI64_LEN =  8,
     NVME_NIDT_NGUID_LEN = 16,
     NVME_NIDT_UUID_LEN  = 16,
+    NVME_NIDT_CSI_LEN   =  1,
 };
 
 enum NvmeNsIdentifierType {
     NVME_NIDT_EUI64 = 0x1,
     NVME_NIDT_NGUID = 0x2,
     NVME_NIDT_UUID  = 0x3,
+    NVME_NIDT_CSI   = 0x4,
 };
 
 /*Deallocate Logical Block Features*/
@@ -1081,7 +1117,7 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
-    QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsNvm) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
     QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
diff --git a/block/nvme.c b/block/nvme.c
index f4f27b6da7d8..b5cc0185c298 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -336,7 +336,7 @@ static inline int nvme_translate_error(const NvmeCqe *c)
 {
     uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF;
     if (status) {
-        trace_nvme_error(le32_to_cpu(c->result),
+        trace_nvme_error(le32_to_cpu(c->dw0),
                          le16_to_cpu(c->sq_head),
                          le16_to_cpu(c->sq_id),
                          le16_to_cpu(c->cid),
@@ -503,7 +503,7 @@ static void nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
     BDRVNVMeState *s = bs->opaque;
     union {
         NvmeIdCtrl ctrl;
-        NvmeIdNs ns;
+        NvmeIdNsNvm ns;
     } *id;
     NvmeLBAF *lbaf;
     uint16_t oncs;
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 9a63004c000a..68b2ce3d17c5 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -57,10 +57,15 @@ static int nvme_blk_truncate(BlockBackend *blk, size_t len, Error **errp)
 
 static void nvme_ns_init(NvmeNamespace *ns)
 {
-    NvmeIdNs *id_ns = &ns->id_ns;
+    NvmeIdNsNvm *id_ns;
+
+    ns->id_ns[NVME_IOCS_NVM] = g_new0(NvmeIdNsNvm, 1);
+    id_ns = nvme_ns_id_nvm(ns);
+
+    ns->iocs = ns->params.iocs;
 
     if (blk_get_flags(ns->blkconf.blk) & BDRV_O_UNMAP) {
-        ns->id_ns.dlfeat = 0x9;
+        id_ns->dlfeat = 0x9;
     }
 
     id_ns->lbaf[0].ds = ns->params.lbads;
@@ -170,6 +175,14 @@ static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
         return -1;
     }
 
+    switch (ns->params.iocs) {
+    case NVME_IOCS_NVM:
+        break;
+    default:
+        error_setg(errp, "unsupported iocs");
+        return -1;
+    }
+
     return 0;
 }
 
@@ -194,7 +207,8 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
          * With a pstate file in place we can enable the Deallocated or
          * Unwritten Logical Block Error feature.
          */
-        ns->id_ns.nsfeat |= 0x4;
+        NvmeIdNsNvm *id_ns = nvme_ns_id_nvm(ns);
+        id_ns->nsfeat |= 0x4;
     }
 
     if (nvme_register_namespace(n, ns, errp)) {
@@ -241,6 +255,7 @@ static Property nvme_ns_props[] = {
     DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
     DEFINE_PROP_UINT8("lbads", NvmeNamespace, params.lbads, BDRV_SECTOR_BITS),
     DEFINE_PROP_DRIVE("pstate", NvmeNamespace, pstate.blk),
+    DEFINE_PROP_UINT8("iocs", NvmeNamespace, params.iocs, NVME_IOCS_NVM),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 21f558f75559..f5578ea3bd63 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -102,6 +102,7 @@ static const bool nvme_feature_support[NVME_FID_MAX] = {
     [NVME_WRITE_ATOMICITY]          = true,
     [NVME_ASYNCHRONOUS_EVENT_CONF]  = true,
     [NVME_TIMESTAMP]                = true,
+    [NVME_COMMAND_SET_PROFILE]      = true,
 };
 
 static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
@@ -111,6 +112,7 @@ static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
     [NVME_NUMBER_OF_QUEUES]         = NVME_FEAT_CAP_CHANGE,
     [NVME_ASYNCHRONOUS_EVENT_CONF]  = NVME_FEAT_CAP_CHANGE,
     [NVME_TIMESTAMP]                = NVME_FEAT_CAP_CHANGE,
+    [NVME_COMMAND_SET_PROFILE]      = NVME_FEAT_CAP_CHANGE,
 };
 
 #define NVME_EFFECTS_ADMIN_INITIALIZER                 \
@@ -143,13 +145,15 @@ static const NvmeEffectsLog nvme_effects_admin_only = {
     },
 };
 
-static const NvmeEffectsLog nvme_effects = {
-    .acs = {
-        NVME_EFFECTS_ADMIN_INITIALIZER,
-    },
+static const NvmeEffectsLog nvme_effects[NVME_IOCS_MAX] = {
+    [NVME_IOCS_NVM] = {
+        .acs = {
+            NVME_EFFECTS_ADMIN_INITIALIZER,
+        },
 
-    .iocs = {
-        NVME_EFFECTS_NVM_INITIALIZER,
+        .iocs = {
+            NVME_EFFECTS_NVM_INITIALIZER,
+        },
     },
 };
 
@@ -861,7 +865,7 @@ static void nvme_process_aers(void *opaque)
 
         req = n->aer_reqs[n->outstanding_aers];
 
-        result = (NvmeAerResult *) &req->cqe.result;
+        result = (NvmeAerResult *) &req->cqe.dw0;
         result->event_type = event->result.event_type;
         result->event_info = event->result.event_info;
         result->log_page = event->result.log_page;
@@ -921,7 +925,8 @@ static inline uint16_t nvme_check_mdts(NvmeCtrl *n, size_t len)
 static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace *ns,
                                          uint64_t slba, uint32_t nlb)
 {
-    uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
+    NvmeIdNsNvm *id_ns = nvme_ns_id_nvm(ns);
+    uint64_t nsze = le64_to_cpu(id_ns->nsze);
 
     if (unlikely(UINT64_MAX - slba < nlb || slba + nlb > nsze)) {
         return NVME_LBA_RANGE | NVME_DNR;
@@ -1111,7 +1116,8 @@ static uint16_t nvme_rwz(NvmeCtrl *n, NvmeRequest *req)
 
     status = nvme_check_bounds(n, ns, slba, nlb);
     if (status) {
-        trace_pci_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
+        NvmeIdNsNvm *id_ns = nvme_ns_id_nvm(ns);
+        trace_pci_nvme_err_invalid_lba_range(slba, nlb, id_ns->nsze);
         goto invalid;
     }
 
@@ -1157,9 +1163,10 @@ invalid:
 static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
     uint32_t nsid = le32_to_cpu(req->cmd.nsid);
+    uint8_t opc = req->cmd.opcode;
 
     trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req),
-                          req->cmd.opcode, nvme_io_opc_str(req->cmd.opcode));
+                          opc, nvme_io_opc_str(req->cmd.opcode));
 
     if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_ADMIN_ONLY) {
         return NVME_INVALID_OPCODE | NVME_DNR;
@@ -1174,6 +1181,10 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
+    if (!(nvme_effects[req->ns->iocs].iocs[opc] & NVME_EFFECTS_CSUPP)) {
+        return NVME_INVALID_OPCODE | NVME_DNR;
+    }
+
     switch (req->cmd.opcode) {
     case NVME_CMD_FLUSH:
         return nvme_flush(n, req);
@@ -1412,6 +1423,7 @@ static uint16_t nvme_effects_log(NvmeCtrl *n, uint32_t buf_len, uint64_t off,
     const NvmeEffectsLog *effects;
 
     uint32_t trans_len;
+    uint8_t csi = le32_to_cpu(req->cmd.cdw14) >> 24;
 
     if (off > sizeof(NvmeEffectsLog)) {
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1423,7 +1435,17 @@ static uint16_t nvme_effects_log(NvmeCtrl *n, uint32_t buf_len, uint64_t off,
         break;
 
     case NVME_CC_CSS_NVM:
-        effects = &nvme_effects;
+        effects = &nvme_effects[NVME_IOCS_NVM];
+        break;
+
+    case NVME_CC_CSS_ALL:
+        if (!(n->iocscs[n->features.iocsci] & (1 << csi))) {
+            return NVME_INVALID_FIELD | NVME_DNR;
+        }
+
+        assert(csi < NVME_IOCS_MAX);
+
+        effects = &nvme_effects[csi];
         break;
 
     default:
@@ -1593,39 +1615,94 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
     return NVME_SUCCESS;
 }
 
-static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_ctrl(NvmeCtrl *n, uint8_t cns, uint8_t csi,
+                                   NvmeRequest *req)
 {
+    NvmeIdCtrl empty = { 0 };
+    NvmeIdCtrl *id_ctrl = &empty;
+
     trace_pci_nvme_identify_ctrl();
 
-    return nvme_dma(n, (uint8_t *)&n->id_ctrl, sizeof(n->id_ctrl),
+    switch (cns) {
+    case NVME_ID_CNS_CTRL:
+        id_ctrl = &n->id_ctrl;
+
+        break;
+
+    case NVME_ID_CNS_CTRL_IOCS:
+        if (!(n->iocscs[n->features.iocsci] & (1 << csi))) {
+            return NVME_INVALID_FIELD | NVME_DNR;
+        }
+
+        assert(csi < NVME_IOCS_MAX);
+
+        if (n->id_ctrl_iocss[csi]) {
+            id_ctrl = n->id_ctrl_iocss[csi];
+        }
+
+        break;
+
+    default:
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    return nvme_dma(n, (uint8_t *)id_ctrl, sizeof(*id_ctrl),
                     DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_ns(NvmeCtrl *n, uint8_t cns, uint8_t csi,
+                                 NvmeRequest *req)
 {
+    NvmeIdNsNvm empty = { 0 };
+    void *id_ns = &empty;
     NvmeNamespace *ns;
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
-    NvmeIdNs *id_ns, inactive = { 0 };
     uint32_t nsid = le32_to_cpu(c->nsid);
 
-    trace_pci_nvme_identify_ns(nsid);
+    trace_pci_nvme_identify_ns(nsid, csi);
 
     if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
         return NVME_INVALID_NSID | NVME_DNR;
     }
 
     ns = nvme_ns(n, nsid);
-    if (unlikely(!ns)) {
-        id_ns = &inactive;
-    } else {
-        id_ns = &ns->id_ns;
+    if (ns) {
+        switch (cns) {
+        case NVME_ID_CNS_NS:
+            id_ns = ns->id_ns[NVME_IOCS_NVM];
+            if (!id_ns) {
+                return NVME_INVALID_IOCS | NVME_DNR;
+            }
+
+            break;
+
+        case NVME_ID_CNS_NS_IOCS:
+            if (csi == NVME_IOCS_NVM) {
+                break;
+            }
+
+            if (csi >= NVME_IOCS_MAX) {
+                return NVME_INVALID_FIELD | NVME_DNR;
+            }
+
+            id_ns = ns->id_ns[csi];
+            if (!id_ns) {
+                return NVME_INVALID_FIELD | NVME_DNR;
+            }
+
+            break;
+
+        default:
+            return NVME_INVALID_FIELD | NVME_DNR;
+        }
     }
 
-    return nvme_dma(n, (uint8_t *)id_ns, sizeof(NvmeIdNs),
+    return nvme_dma(n, (uint8_t *)id_ns, NVME_IDENTIFY_DATA_SIZE,
                     DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_nslist(NvmeCtrl *n, uint8_t cns, uint8_t csi,
+                                     NvmeRequest *req)
 {
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
     static const int data_len = NVME_IDENTIFY_DATA_SIZE;
@@ -1634,7 +1711,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
     uint16_t ret;
     int j = 0;
 
-    trace_pci_nvme_identify_nslist(min_nsid);
+    trace_pci_nvme_identify_nslist(min_nsid, csi);
 
     /*
      * Both 0xffffffff (NVME_NSID_BROADCAST) and 0xfffffffe are invalid values
@@ -1646,11 +1723,21 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
         return NVME_INVALID_NSID | NVME_DNR;
     }
 
+    if (cns == NVME_ID_CNS_NS_ACTIVE_LIST_IOCS && !csi) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
     list = g_malloc0(data_len);
     for (int i = 1; i <= n->num_namespaces; i++) {
-        if (i <= min_nsid || !nvme_ns(n, i)) {
+        NvmeNamespace *ns = nvme_ns(n, i);
+        if (i <= min_nsid || !ns) {
             continue;
         }
+
+        if (cns == NVME_ID_CNS_NS_ACTIVE_LIST_IOCS && csi && csi != ns->iocs) {
+            continue;
+        }
+
         list[j++] = cpu_to_le32(i);
         if (j == data_len / sizeof(uint32_t)) {
             break;
@@ -1666,6 +1753,7 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
     uint32_t nsid = le32_to_cpu(c->nsid);
+    NvmeNamespace *ns;
     uint8_t list[NVME_IDENTIFY_DATA_SIZE];
 
     struct data {
@@ -1673,6 +1761,11 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
             NvmeIdNsDescr hdr;
             uint8_t v[16];
         } uuid;
+
+        struct {
+            NvmeIdNsDescr hdr;
+            uint8_t v;
+        } iocs;
     };
 
     struct data *ns_descrs = (struct data *)list;
@@ -1683,7 +1776,8 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
         return NVME_INVALID_NSID | NVME_DNR;
     }
 
-    if (unlikely(!nvme_ns(n, nsid))) {
+    ns = nvme_ns(n, nsid);
+    if (unlikely(!ns)) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
@@ -1699,25 +1793,45 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
     ns_descrs->uuid.hdr.nidl = NVME_NIDT_UUID_LEN;
     stl_be_p(&ns_descrs->uuid.v, nsid);
 
+    ns_descrs->iocs.hdr.nidt = NVME_NIDT_CSI;
+    ns_descrs->iocs.hdr.nidl = NVME_NIDT_CSI_LEN;
+    stb_p(&ns_descrs->iocs.v, ns->iocs);
+
     return nvme_dma(n, list, NVME_IDENTIFY_DATA_SIZE,
                     DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static uint16_t nvme_identify_iocs(NvmeCtrl *n, uint16_t cntid,
+                                   NvmeRequest *req)
+{
+    return nvme_dma(n, (uint8_t *) n->iocscs, sizeof(n->iocscs),
+                    DMA_DIRECTION_FROM_DEVICE, req);
+}
+
 static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
 
+    trace_pci_nvme_identify(nvme_cid(req), le32_to_cpu(c->nsid),
+                            le16_to_cpu(c->cntid), c->cns, c->csi,
+                            le16_to_cpu(c->nvmsetid));
+
     switch (le32_to_cpu(c->cns)) {
     case NVME_ID_CNS_NS:
-        return nvme_identify_ns(n, req);
+    case NVME_ID_CNS_NS_IOCS:
+        return nvme_identify_ns(n, c->cns, c->csi, req);
     case NVME_ID_CNS_CTRL:
-        return nvme_identify_ctrl(n, req);
+    case NVME_ID_CNS_CTRL_IOCS:
+        return nvme_identify_ctrl(n, c->cns, c->csi, req);
     case NVME_ID_CNS_NS_ACTIVE_LIST:
-        return nvme_identify_nslist(n, req);
+    case NVME_ID_CNS_NS_ACTIVE_LIST_IOCS:
+        return nvme_identify_nslist(n, c->cns, c->csi, req);
     case NVME_ID_CNS_NS_DESCR_LIST:
         return nvme_identify_ns_descr_list(n, req);
+    case NVME_ID_CNS_IOCS:
+        return nvme_identify_iocs(n, c->cntid, req);
     default:
-        trace_pci_nvme_err_invalid_identify_cns(le32_to_cpu(c->cns));
+        trace_pci_nvme_err_invalid_identify_cns(c->cns);
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 }
@@ -1726,7 +1840,7 @@ static uint16_t nvme_abort(NvmeCtrl *n, NvmeRequest *req)
 {
     uint16_t sqid = le32_to_cpu(req->cmd.cdw10) & 0xffff;
 
-    req->cqe.result = 1;
+    req->cqe.dw0 = 1;
     if (nvme_check_sqid(n, sqid)) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
@@ -1911,6 +2025,9 @@ defaults:
             result |= NVME_INTVC_NOCOALESCING;
         }
 
+        break;
+    case NVME_COMMAND_SET_PROFILE:
+        result = cpu_to_le32(n->features.iocsci & 0x1ff);
         break;
     default:
         result = nvme_feature_default[fid];
@@ -1918,7 +2035,8 @@ defaults:
     }
 
 out:
-    req->cqe.result = cpu_to_le32(result);
+    req->cqe.dw0 = cpu_to_le32(result);
+
     return NVME_SUCCESS;
 }
 
@@ -1941,6 +2059,7 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeRequest *req)
 static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeNamespace *ns = NULL;
+    NvmeIdNsNvm *id_ns;
 
     NvmeCmd *cmd = &req->cmd;
     uint32_t dw10 = le32_to_cpu(cmd->cdw10);
@@ -2017,7 +2136,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
                     continue;
                 }
 
-                if (NVME_ID_NS_NSFEAT_DULBE(ns->id_ns.nsfeat)) {
+                id_ns = nvme_ns_id_nvm(ns);
+                if (NVME_ID_NS_NSFEAT_DULBE(id_ns->nsfeat)) {
                     ns->features.err_rec = dw11;
                 }
             }
@@ -2033,6 +2153,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
 
         for (int i = 1; i <= n->num_namespaces; i++) {
             ns = nvme_ns(n, i);
+
             if (!ns) {
                 continue;
             }
@@ -2063,14 +2184,34 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
                                     ((dw11 >> 16) & 0xFFFF) + 1,
                                     n->params.max_ioqpairs,
                                     n->params.max_ioqpairs);
-        req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
-                                      ((n->params.max_ioqpairs - 1) << 16));
+        req->cqe.dw0 = cpu_to_le32((n->params.max_ioqpairs - 1) |
+                                   ((n->params.max_ioqpairs - 1) << 16));
         break;
     case NVME_ASYNCHRONOUS_EVENT_CONF:
         n->features.async_config = dw11;
         break;
     case NVME_TIMESTAMP:
         return nvme_set_feature_timestamp(n, req);
+    case NVME_COMMAND_SET_PROFILE:
+        if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_ALL) {
+            uint16_t iocsci = dw11 & 0x1ff;
+            uint64_t iocsc = n->iocscs[iocsci];
+
+            for (int i = 1; i <= n->num_namespaces; i++) {
+                ns = nvme_ns(n, i);
+                if (!ns) {
+                    continue;
+                }
+
+                if (!(iocsc & (1 << ns->iocs))) {
+                    return NVME_IOCS_COMB_REJECTED | NVME_DNR;
+                }
+            }
+
+            n->features.iocsci = iocsci;
+        }
+
+        break;
     default:
         return NVME_FEAT_NOT_CHANGEABLE | NVME_DNR;
     }
@@ -2217,6 +2358,8 @@ static int nvme_start_ctrl(NvmeCtrl *n)
     uint32_t page_bits = NVME_CC_MPS(n->bar.cc) + 12;
     uint32_t page_size = 1 << page_bits;
 
+    NvmeIdCtrl *id_ctrl = &n->id_ctrl;
+
     if (unlikely(n->cq[0])) {
         trace_pci_nvme_err_startfail_cq();
         return -1;
@@ -2256,28 +2399,28 @@ static int nvme_start_ctrl(NvmeCtrl *n)
         return -1;
     }
     if (unlikely(NVME_CC_IOCQES(n->bar.cc) <
-                 NVME_CTRL_CQES_MIN(n->id_ctrl.cqes))) {
+                 NVME_CTRL_CQES_MIN(id_ctrl->cqes))) {
         trace_pci_nvme_err_startfail_cqent_too_small(
                     NVME_CC_IOCQES(n->bar.cc),
                     NVME_CTRL_CQES_MIN(n->bar.cap));
         return -1;
     }
     if (unlikely(NVME_CC_IOCQES(n->bar.cc) >
-                 NVME_CTRL_CQES_MAX(n->id_ctrl.cqes))) {
+                 NVME_CTRL_CQES_MAX(id_ctrl->cqes))) {
         trace_pci_nvme_err_startfail_cqent_too_large(
                     NVME_CC_IOCQES(n->bar.cc),
                     NVME_CTRL_CQES_MAX(n->bar.cap));
         return -1;
     }
     if (unlikely(NVME_CC_IOSQES(n->bar.cc) <
-                 NVME_CTRL_SQES_MIN(n->id_ctrl.sqes))) {
+                 NVME_CTRL_SQES_MIN(id_ctrl->sqes))) {
         trace_pci_nvme_err_startfail_sqent_too_small(
                     NVME_CC_IOSQES(n->bar.cc),
                     NVME_CTRL_SQES_MIN(n->bar.cap));
         return -1;
     }
     if (unlikely(NVME_CC_IOSQES(n->bar.cc) >
-                 NVME_CTRL_SQES_MAX(n->id_ctrl.sqes))) {
+                 NVME_CTRL_SQES_MAX(id_ctrl->sqes))) {
         trace_pci_nvme_err_startfail_sqent_too_large(
                     NVME_CC_IOSQES(n->bar.cc),
                     NVME_CTRL_SQES_MAX(n->bar.cap));
@@ -2738,6 +2881,8 @@ static void nvme_init_state(NvmeCtrl *n)
     n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
     n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
     n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
+    n->iocscs[0] = 1 << NVME_IOCS_NVM;
+    n->features.iocsci = 0;
 }
 
 int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
@@ -2942,7 +3087,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
     NVME_CAP_SET_CQR(n->bar.cap, 1);
     NVME_CAP_SET_TO(n->bar.cap, 0xf);
-    NVME_CAP_SET_CSS(n->bar.cap, 1);
+    NVME_CAP_SET_CSS(n->bar.cap, (NVME_CAP_CSS_NVM | NVME_CAP_CSS_CSI));
     NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
 
     n->bar.vs = NVME_SPEC_VER;
@@ -3002,6 +3147,11 @@ static void nvme_exit(PCIDevice *pci_dev)
     if (n->pmrdev) {
         host_memory_backend_set_mapped(n->pmrdev, false);
     }
+
+    for (int i = 0; i < NVME_IOCS_MAX; i++) {
+        g_free(n->id_ctrl_iocss[i]);
+    }
+
     msix_uninit_exclusive_bar(pci_dev);
 }
 
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 774513469274..b002eb7c8a5c 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -48,10 +48,12 @@ pci_nvme_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize,
 pci_nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
 pci_nvme_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
 pci_nvme_del_cq(uint16_t cqid) "deleted completion queue, cqid=%"PRIu16""
+pci_nvme_identify(uint16_t cid, uint32_t nsid, uint16_t cntid, uint8_t cns, uint8_t csi, uint16_t nvmsetid) "cid %"PRIu16" nsid %"PRIu32" cntid 0x%"PRIx16" cns 0x%"PRIx8" csi 0x%"PRIx8" nvmsetid %"PRIu16""
 pci_nvme_identify_ctrl(void) "identify controller"
-pci_nvme_identify_ns(uint32_t ns) "nsid %"PRIu32""
-pci_nvme_identify_nslist(uint32_t ns) "nsid %"PRIu32""
+pci_nvme_identify_ns(uint32_t ns, uint8_t csi) "nsid %"PRIu32" csi 0x%"PRIx8""
+pci_nvme_identify_nslist(uint32_t ns, uint8_t csi) "nsid %"PRIu32" csi 0x%"PRIx8""
 pci_nvme_identify_ns_descr_list(uint32_t ns) "nsid %"PRIu32""
+pci_nvme_identify_io_cmd_set(uint16_t cid) "cid %"PRIu16""
 pci_nvme_get_log(uint16_t cid, uint8_t lid, uint8_t lsp, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" lsp 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
 pci_nvme_getfeat(uint16_t cid, uint32_t nsid, uint8_t fid, uint8_t sel, uint32_t cdw11) "cid %"PRIu16" nsid 0x%"PRIx32" fid 0x%"PRIx8" sel 0x%"PRIx8" cdw11 0x%"PRIx32""
 pci_nvme_setfeat(uint16_t cid, uint32_t nsid, uint8_t fid, uint8_t save, uint32_t cdw11) "cid %"PRIu16" nsid 0x%"PRIx32" fid 0x%"PRIx8" save 0x%"PRIx8" cdw11 0x%"PRIx32""
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 09/16] hw/block/nvme: add basic read/write for zoned namespaces
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (7 preceding siblings ...)
  2020-09-24 20:45 ` [PATCH 08/16] hw/block/nvme: support namespace types Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 20:45 ` [PATCH 10/16] hw/block/nvme: add the zone management receive command Klaus Jensen
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

This adds basic read and write for zoned namespaces.

A zoned namespace is created by setting the iocs namespace parameter to
0x2 and specifying the zns.zcap parameter (zone capacity) in number of
logical blocks per zone. If a zone size (zns.zsze) is not specified, the
namespace device will set the zone size to be the next power of two and
fit in as many zones as possible on the underlying namespace blockdev.
This behavior is not required by the specification, but ensures that the
device can be initialized by the Linux kernel nvme driver, which
requires a power of two zone size.

If the namespace has an associated 'pstate' blockdev it will be used to
store the zone states persistently. Only zone state changes are
persisted, that is, zone write pointer updates are only persistent if
the zone is explicitly closed. On boot up, any zones that were in an
Opened state will be transitioned to Full.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 docs/specs/nvme.txt   |   7 ++
 hw/block/nvme-ns.h    | 104 ++++++++++++++++++++
 include/block/nvme.h  |  59 ++++++++++-
 hw/block/nvme-ns.c    | 145 +++++++++++++++++++++++++--
 hw/block/nvme.c       | 221 +++++++++++++++++++++++++++++++++++++++++-
 hw/block/trace-events |   8 ++
 6 files changed, 534 insertions(+), 10 deletions(-)

diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
index 890990510b14..82def65d4c78 100644
--- a/docs/specs/nvme.txt
+++ b/docs/specs/nvme.txt
@@ -22,6 +22,13 @@ nvme-ns Options
   `iocs`; The "I/O Command Set" associated with the namespace. E.g. 0x0 for the
      NVM Command Set (the default), or 0x2 for the Zoned Namespace Command Set.
 
+  `zns.zcap`; If `iocs` is 0x2, this specifies the zone capacity. It is
+     specified in units of logical blocks.
+
+  `zns.zsze`; If `iocs` is 0x2, this specifies the zone size. It is specified
+     in units of the logical blocks. If not specified, the value depends on
+     zns.zcap; if the zone capacity is a power of two, the zone size will be
+     set to that, otherwise it will default to the next power of two.
 
 Reference Specifications
 ------------------------
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 06c0bc8b32a4..c15bfcfc5a08 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -23,8 +23,19 @@ typedef struct NvmeNamespaceParams {
     uint32_t nsid;
     uint8_t  iocs;
     uint8_t  lbads;
+
+    struct {
+        uint64_t zcap;
+        uint64_t zsze;
+    } zns;
 } NvmeNamespaceParams;
 
+typedef struct NvmeZone {
+    NvmeZoneDescriptor *zd;
+
+    uint64_t wp_staging;
+} NvmeZone;
+
 typedef struct NvmeNamespace {
     DeviceState  parent_obj;
     BlockConf    blkconf;
@@ -40,6 +51,10 @@ typedef struct NvmeNamespace {
             unsigned long *map;
             int64_t       offset;
         } utilization;
+
+        struct {
+            int64_t offset;
+        } zns;
     } pstate;
 
     NvmeNamespaceParams params;
@@ -47,8 +62,20 @@ typedef struct NvmeNamespace {
     struct {
         uint32_t err_rec;
     } features;
+
+    struct {
+        int num_zones;
+
+        NvmeZone           *zones;
+        NvmeZoneDescriptor *zd;
+    } zns;
 } NvmeNamespace;
 
+static inline bool nvme_ns_zoned(NvmeNamespace *ns)
+{
+    return ns->iocs == NVME_IOCS_ZONED;
+}
+
 static inline uint32_t nvme_nsid(NvmeNamespace *ns)
 {
     if (ns) {
@@ -63,17 +90,39 @@ static inline NvmeIdNsNvm *nvme_ns_id_nvm(NvmeNamespace *ns)
     return ns->id_ns[NVME_IOCS_NVM];
 }
 
+static inline NvmeIdNsZns *nvme_ns_id_zoned(NvmeNamespace *ns)
+{
+    return ns->id_ns[NVME_IOCS_ZONED];
+}
+
 static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
 {
     NvmeIdNsNvm *id_ns = nvme_ns_id_nvm(ns);
     return &id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
 }
 
+static inline NvmeLBAFE *nvme_ns_lbafe(NvmeNamespace *ns)
+{
+    NvmeIdNsNvm *id_ns = nvme_ns_id_nvm(ns);
+    NvmeIdNsZns *id_ns_zns = nvme_ns_id_zoned(ns);
+    return &id_ns_zns->lbafe[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
+}
+
 static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
 {
     return nvme_ns_lbaf(ns)->ds;
 }
 
+static inline uint64_t nvme_ns_zsze(NvmeNamespace *ns)
+{
+    return nvme_ns_lbafe(ns)->zsze;
+}
+
+static inline uint64_t nvme_ns_zsze_bytes(NvmeNamespace *ns)
+{
+    return nvme_ns_zsze(ns) << nvme_ns_lbads(ns);
+}
+
 /* calculate the number of LBAs that the namespace can accomodate */
 static inline uint64_t nvme_ns_nlbas(NvmeNamespace *ns)
 {
@@ -86,8 +135,63 @@ static inline size_t nvme_l2b(NvmeNamespace *ns, uint64_t lba)
     return lba << nvme_ns_lbads(ns);
 }
 
+static inline int nvme_ns_zone_idx(NvmeNamespace *ns, uint64_t lba)
+{
+    return lba / nvme_ns_zsze(ns);
+}
+
+static inline NvmeZone *nvme_ns_get_zone(NvmeNamespace *ns, uint64_t lba)
+{
+    int idx = nvme_ns_zone_idx(ns, lba);
+    if (unlikely(idx >= ns->zns.num_zones)) {
+        return NULL;
+    }
+
+    return &ns->zns.zones[idx];
+}
+
+static inline NvmeZoneState nvme_zs(NvmeZone *zone)
+{
+    return (zone->zd->zs >> 4) & 0xf;
+}
+
+static inline void nvme_zs_set(NvmeZone *zone, NvmeZoneState zs)
+{
+    zone->zd->zs = zs << 4;
+}
+
+static inline bool nvme_ns_zone_wp_valid(NvmeZone *zone)
+{
+    switch (nvme_zs(zone)) {
+    case NVME_ZS_ZSF:
+    case NVME_ZS_ZSRO:
+    case NVME_ZS_ZSO:
+        return false;
+    default:
+        return false;
+    }
+}
+
+static inline uint64_t nvme_zslba(NvmeZone *zone)
+{
+    return le64_to_cpu(zone->zd->zslba);
+}
+
+static inline uint64_t nvme_zcap(NvmeZone *zone)
+{
+    return le64_to_cpu(zone->zd->zcap);
+}
+
+static inline uint64_t nvme_wp(NvmeZone *zone)
+{
+    return le64_to_cpu(zone->zd->wp);
+}
+
 typedef struct NvmeCtrl NvmeCtrl;
 
+const char *nvme_zs_str(NvmeZone *zone);
+const char *nvme_zs_to_str(NvmeZoneState zs);
+
 int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
 void nvme_ns_drain(NvmeNamespace *ns);
 void nvme_ns_flush(NvmeNamespace *ns);
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 792fccf8c81f..2e523c9d97b4 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -390,8 +390,9 @@ enum NvmePmrmscMask {
     (pmrmsc |= (uint64_t)(val & PMRMSC_CBA_MASK) << PMRMSC_CBA_SHIFT)
 
 enum NvmeCommandSet {
-    NVME_IOCS_NVM = 0x0,
-    NVME_IOCS_MAX = 0x1,
+    NVME_IOCS_NVM   = 0x0,
+    NVME_IOCS_ZONED = 0x2,
+    NVME_IOCS_MAX   = 0x3,
 };
 
 enum NvmeSglDescriptorType {
@@ -703,6 +704,11 @@ enum NvmeStatusCodes {
     NVME_CONFLICTING_ATTRS      = 0x0180,
     NVME_INVALID_PROT_INFO      = 0x0181,
     NVME_WRITE_TO_RO            = 0x0182,
+    NVME_ZONE_BOUNDARY_ERROR    = 0x01b8,
+    NVME_ZONE_IS_FULL           = 0x01b9,
+    NVME_ZONE_IS_READ_ONLY      = 0x01ba,
+    NVME_ZONE_IS_OFFLINE        = 0x01bb,
+    NVME_ZONE_INVALID_WRITE     = 0x01bc,
     NVME_WRITE_FAULT            = 0x0280,
     NVME_UNRECOVERED_READ       = 0x0281,
     NVME_E2E_GUARD_ERROR        = 0x0282,
@@ -781,6 +787,31 @@ enum {
     NVME_EFFECTS_UUID_SEL   = 1 << 19,
 };
 
+typedef enum NvmeZoneType {
+    NVME_ZT_SEQ = 0x2,
+} NvmeZoneType;
+
+typedef enum NvmeZoneState {
+    NVME_ZS_ZSE  = 0x1,
+    NVME_ZS_ZSIO = 0x2,
+    NVME_ZS_ZSEO = 0x3,
+    NVME_ZS_ZSC  = 0x4,
+    NVME_ZS_ZSRO = 0xd,
+    NVME_ZS_ZSF  = 0xe,
+    NVME_ZS_ZSO  = 0xf,
+} NvmeZoneState;
+
+typedef struct QEMU_PACKED NvmeZoneDescriptor {
+    uint8_t  zt;
+    uint8_t  zs;
+    uint8_t  za;
+    uint8_t  rsvd3[5];
+    uint64_t zcap;
+    uint64_t zslba;
+    uint64_t wp;
+    uint8_t  rsvd32[32];
+} NvmeZoneDescriptor;
+
 enum NvmeSmartWarn {
     NVME_SMART_SPARE                  = 1 << 0,
     NVME_SMART_TEMPERATURE            = 1 << 1,
@@ -794,6 +825,7 @@ enum NvmeLogIdentifier {
     NVME_LOG_SMART_INFO     = 0x02,
     NVME_LOG_FW_SLOT_INFO   = 0x03,
     NVME_LOG_EFFECTS        = 0x05,
+    NVME_LOG_CHANGED_ZONE_LIST = 0xbf,
 };
 
 typedef struct QEMU_PACKED NvmePSD {
@@ -1099,9 +1131,27 @@ enum NvmeIdNsDps {
     DPS_FIRST_EIGHT = 8,
 };
 
+typedef struct QEMU_PACKED NvmeLBAFE {
+    uint64_t    zsze;
+    uint8_t     zdes;
+    uint8_t     rsvd9[7];
+} NvmeLBAFE;
+
+typedef struct QEMU_PACKED NvmeIdNsZns {
+    uint16_t    zoc;
+    uint16_t    ozcs;
+    uint32_t    mar;
+    uint32_t    mor;
+    uint32_t    rrl;
+    uint32_t    frl;
+    uint8_t     rsvd20[2796];
+    NvmeLBAFE   lbafe[16];
+    uint8_t     rsvd3072[768];
+    uint8_t     vs[256];
+} NvmeIdNsZns;
+
 static inline void _nvme_check_size(void)
 {
-    QEMU_BUILD_BUG_ON(sizeof(NvmeBar) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeAerResult) != 4);
     QEMU_BUILD_BUG_ON(sizeof(NvmeCqe) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeDsmRange) != 16);
@@ -1118,8 +1168,11 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsNvm) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsZns) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsDescr) != 4);
     QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeZoneDescriptor) != 64);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeLBAFE) != 16);
 }
 #endif
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 68b2ce3d17c5..872c46f2f2f0 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -30,6 +30,26 @@
 #include "nvme.h"
 #include "nvme-ns.h"
 
+const char *nvme_zs_str(NvmeZone *zone)
+{
+    return nvme_zs_to_str(nvme_zs(zone));
+}
+
+const char *nvme_zs_to_str(NvmeZoneState zs)
+{
+    switch (zs) {
+    case NVME_ZS_ZSE:  return "ZSE";
+    case NVME_ZS_ZSIO: return "ZSIO";
+    case NVME_ZS_ZSEO: return "ZSEO";
+    case NVME_ZS_ZSC:  return "ZSC";
+    case NVME_ZS_ZSRO: return "ZSRO";
+    case NVME_ZS_ZSF:  return "ZSF";
+    case NVME_ZS_ZSO:  return "ZSO";
+    }
+
+    return NULL;
+}
+
 static int nvme_blk_truncate(BlockBackend *blk, size_t len, Error **errp)
 {
     int ret;
@@ -55,6 +75,47 @@ static int nvme_blk_truncate(BlockBackend *blk, size_t len, Error **errp)
     return 0;
 }
 
+static void nvme_ns_zns_init_zones(NvmeNamespace *ns)
+{
+    NvmeZone *zone;
+    NvmeZoneDescriptor *zd;
+    uint64_t zslba, zsze = nvme_ns_zsze(ns);
+
+    for (int i = 0; i < ns->zns.num_zones; i++) {
+        zslba = i * zsze;
+        zone = nvme_ns_get_zone(ns, zslba);
+        zone->zd = &ns->zns.zd[i];
+
+        zd = zone->zd;
+
+        zd->zt = NVME_ZT_SEQ;
+        nvme_zs_set(zone, NVME_ZS_ZSE);
+        zd->zcap = ns->params.zns.zcap;
+        zone->wp_staging = zslba;
+        zd->wp = zd->zslba = cpu_to_le64(zslba);
+    }
+}
+
+static void nvme_ns_init_zoned(NvmeNamespace *ns)
+{
+    NvmeIdNsNvm *id_ns = nvme_ns_id_nvm(ns);
+    NvmeIdNsZns *id_ns_zns = nvme_ns_id_zoned(ns);
+
+    for (int i = 0; i <= id_ns->nlbaf; i++) {
+        id_ns_zns->lbafe[i].zsze = ns->params.zns.zsze ?
+            ns->params.zns.zsze : cpu_to_le64(pow2ceil(ns->params.zns.zcap));
+    }
+
+    ns->zns.num_zones = nvme_ns_nlbas(ns) / nvme_ns_zsze(ns);
+    ns->zns.zones = g_malloc0_n(ns->zns.num_zones, sizeof(NvmeZone));
+    ns->zns.zd = g_malloc0_n(ns->zns.num_zones, sizeof(NvmeZoneDescriptor));
+
+    id_ns->ncap = ns->zns.num_zones * ns->params.zns.zcap;
+
+    id_ns_zns->mar = 0xffffffff;
+    id_ns_zns->mor = 0xffffffff;
+}
+
 static void nvme_ns_init(NvmeNamespace *ns)
 {
     NvmeIdNsNvm *id_ns;
@@ -71,9 +132,14 @@ static void nvme_ns_init(NvmeNamespace *ns)
     id_ns->lbaf[0].ds = ns->params.lbads;
 
     id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(ns));
+    id_ns->ncap = id_ns->nsze;
+
+    if (ns->iocs == NVME_IOCS_ZONED) {
+        ns->id_ns[NVME_IOCS_ZONED] = g_new0(NvmeIdNsZns, 1);
+        nvme_ns_init_zoned(ns);
+    }
 
     /* no thin provisioning */
-    id_ns->ncap = id_ns->nsze;
     id_ns->nuse = id_ns->ncap;
 }
 
@@ -82,7 +148,7 @@ static int nvme_ns_setup_blk_pstate(NvmeNamespace *ns, Error **errp)
     BlockBackend *blk = ns->pstate.blk;
     uint64_t perm, shared_perm;
     ssize_t len;
-    size_t pstate_len;
+    size_t util_len, zd_len, pstate_len;
     int ret;
 
     perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE;
@@ -93,8 +159,10 @@ static int nvme_ns_setup_blk_pstate(NvmeNamespace *ns, Error **errp)
         return ret;
     }
 
-    pstate_len = ROUND_UP(DIV_ROUND_UP(nvme_ns_nlbas(ns), 8),
-                          BDRV_SECTOR_SIZE);
+    util_len = DIV_ROUND_UP(nvme_ns_nlbas(ns), 8);
+    zd_len = nvme_ns_zoned(ns) ?
+        ns->zns.num_zones * sizeof(NvmeZoneDescriptor) : 0;
+    pstate_len = ROUND_UP(util_len + zd_len, BDRV_SECTOR_SIZE);
 
     len = blk_getlength(blk);
     if (len < 0) {
@@ -105,6 +173,8 @@ static int nvme_ns_setup_blk_pstate(NvmeNamespace *ns, Error **errp)
     unsigned long *map = bitmap_new(nvme_ns_nlbas(ns));
     ns->pstate.utilization.offset = 0;
 
+    ns->pstate.zns.offset = util_len;
+
     if (!len) {
         ret = nvme_blk_truncate(blk, pstate_len, errp);
         if (ret < 0) {
@@ -112,6 +182,10 @@ static int nvme_ns_setup_blk_pstate(NvmeNamespace *ns, Error **errp)
         }
 
         ns->pstate.utilization.map = map;
+
+        if (nvme_ns_zoned(ns)) {
+            nvme_ns_zns_init_zones(ns);
+        }
     } else {
         if (len != pstate_len) {
             error_setg(errp, "pstate size mismatch "
@@ -120,7 +194,7 @@ static int nvme_ns_setup_blk_pstate(NvmeNamespace *ns, Error **errp)
             return -1;
         }
 
-        ret = blk_pread(blk, 0, map, pstate_len);
+        ret = blk_pread(blk, 0, map, util_len);
         if (ret < 0) {
             error_setg_errno(errp, -ret, "could not read pstate");
             return ret;
@@ -132,7 +206,48 @@ static int nvme_ns_setup_blk_pstate(NvmeNamespace *ns, Error **errp)
         ns->pstate.utilization.map = map;
 #endif
 
-        return 0;
+        if (nvme_ns_zoned(ns)) {
+            ret = blk_pread(blk, util_len, ns->zns.zd, zd_len);
+            if (ret < 0) {
+                error_setg_errno(errp, -ret, "could not read pstate");
+                return ret;
+            }
+
+            for (int i = 0; i < ns->zns.num_zones; i++) {
+                NvmeZone *zone = &ns->zns.zones[i];
+                zone->zd = &ns->zns.zd[i];
+
+                zone->wp_staging = nvme_wp(zone);
+
+                switch (nvme_zs(zone)) {
+                case NVME_ZS_ZSE:
+                case NVME_ZS_ZSF:
+                case NVME_ZS_ZSRO:
+                case NVME_ZS_ZSO:
+                    continue;
+
+                case NVME_ZS_ZSC:
+                    if (nvme_wp(zone) == nvme_zslba(zone)) {
+                        nvme_zs_set(zone, NVME_ZS_ZSE);
+                    }
+
+                    continue;
+
+                case NVME_ZS_ZSIO:
+                case NVME_ZS_ZSEO:
+                    zone->zd->wp = zone->zd->zslba;
+                    nvme_zs_set(zone, NVME_ZS_ZSF);
+                }
+            }
+        }
+    }
+
+    if (nvme_ns_zoned(ns)) {
+        ret = blk_pwrite(blk, util_len, ns->zns.zd, zd_len, 0);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "could not write pstate");
+            return ret;
+        }
     }
 
     return 0;
@@ -178,6 +293,20 @@ static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
     switch (ns->params.iocs) {
     case NVME_IOCS_NVM:
         break;
+
+    case NVME_IOCS_ZONED:
+        if (!ns->params.zns.zcap) {
+            error_setg(errp, "zns.zcap must be specified");
+            return -1;
+        }
+
+        if (ns->params.zns.zsze && ns->params.zns.zsze < ns->params.zns.zcap) {
+            error_setg(errp, "zns.zsze cannot be less than zns.zcap");
+            return -1;
+        }
+
+        break;
+
     default:
         error_setg(errp, "unsupported iocs");
         return -1;
@@ -209,6 +338,8 @@ int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
          */
         NvmeIdNsNvm *id_ns = nvme_ns_id_nvm(ns);
         id_ns->nsfeat |= 0x4;
+    } else if (nvme_ns_zoned(ns)) {
+        nvme_ns_zns_init_zones(ns);
     }
 
     if (nvme_register_namespace(n, ns, errp)) {
@@ -256,6 +387,8 @@ static Property nvme_ns_props[] = {
     DEFINE_PROP_UINT8("lbads", NvmeNamespace, params.lbads, BDRV_SECTOR_BITS),
     DEFINE_PROP_DRIVE("pstate", NvmeNamespace, pstate.blk),
     DEFINE_PROP_UINT8("iocs", NvmeNamespace, params.iocs, NVME_IOCS_NVM),
+    DEFINE_PROP_UINT64("zns.zcap", NvmeNamespace, params.zns.zcap, 0),
+    DEFINE_PROP_UINT64("zns.zsze", NvmeNamespace, params.zns.zsze, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index f5578ea3bd63..b0179291b966 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -155,6 +155,16 @@ static const NvmeEffectsLog nvme_effects[NVME_IOCS_MAX] = {
             NVME_EFFECTS_NVM_INITIALIZER,
         },
     },
+
+    [NVME_IOCS_ZONED] = {
+        .acs = {
+            NVME_EFFECTS_ADMIN_INITIALIZER,
+        },
+
+        .iocs = {
+            NVME_EFFECTS_NVM_INITIALIZER,
+        },
+    },
 };
 
 static void nvme_process_sq(void *opaque);
@@ -911,6 +921,112 @@ static void nvme_clear_events(NvmeCtrl *n, uint8_t event_type)
     }
 }
 
+static uint16_t nvme_check_zone_readable(NvmeCtrl *n, NvmeRequest *req,
+                                         NvmeZone *zone)
+{
+    NvmeZoneState zs = nvme_zs(zone);
+    uint64_t zslba = nvme_zslba(zone);
+
+    if (zs == NVME_ZS_ZSO) {
+        trace_pci_nvme_err_invalid_zone_condition(nvme_cid(req), zslba,
+                                                  NVME_ZS_ZSO);
+        return NVME_ZONE_IS_OFFLINE | NVME_DNR;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_check_zone_read(NvmeCtrl *n, uint64_t slba, uint32_t nlb,
+                                     NvmeRequest *req, NvmeZone *zone)
+{
+    NvmeNamespace *ns = req->ns;
+    uint64_t zslba = nvme_zslba(zone);
+    uint64_t zsze = nvme_ns_zsze(ns);
+    uint16_t status;
+
+    status = nvme_check_zone_readable(n, req, zone);
+    if (status) {
+        return status;
+    }
+
+    if ((slba + nlb) > (zslba + zsze)) {
+        trace_pci_nvme_err_zone_boundary(nvme_cid(req), slba, nlb, zsze);
+        return NVME_ZONE_BOUNDARY_ERROR | NVME_DNR;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_check_zone_writeable(NvmeCtrl *n, NvmeRequest *req,
+                                          NvmeZone *zone)
+{
+    NvmeZoneState zs = nvme_zs(zone);
+    uint64_t zslba = nvme_zslba(zone);
+
+    if (zs == NVME_ZS_ZSO) {
+        trace_pci_nvme_err_invalid_zone_condition(nvme_cid(req), zslba,
+                                                  NVME_ZS_ZSO);
+        return NVME_ZONE_IS_OFFLINE | NVME_DNR;
+    }
+
+    switch (zs) {
+    case NVME_ZS_ZSE:
+    case NVME_ZS_ZSC:
+    case NVME_ZS_ZSIO:
+    case NVME_ZS_ZSEO:
+        return NVME_SUCCESS;
+    case NVME_ZS_ZSF:
+        trace_pci_nvme_err_zone_is_full(nvme_cid(req), zslba);
+        return NVME_ZONE_IS_FULL | NVME_DNR;
+    case NVME_ZS_ZSRO:
+        trace_pci_nvme_err_zone_is_read_only(nvme_cid(req), zslba);
+        return NVME_ZONE_IS_READ_ONLY | NVME_DNR;
+    default:
+        break;
+    }
+
+    trace_pci_nvme_err_invalid_zone_condition(nvme_cid(req), zslba, zs);
+    return NVME_INTERNAL_DEV_ERROR | NVME_DNR;
+}
+
+static uint16_t nvme_check_zone_write(NvmeCtrl *n, uint64_t slba, uint32_t nlb,
+                                      NvmeRequest *req, NvmeZone *zone)
+{
+    uint64_t zslba, wp, zcap;
+    uint16_t status;
+
+    zslba = nvme_zslba(zone);
+    wp = zone->wp_staging;
+    zcap = nvme_zcap(zone);
+
+    status = nvme_check_zone_writeable(n, req, zone);
+    if (status) {
+        return status;
+    }
+
+    if ((wp - zslba) + nlb > zcap) {
+        trace_pci_nvme_err_zone_boundary(nvme_cid(req), slba, nlb, zcap);
+        return NVME_ZONE_BOUNDARY_ERROR | NVME_DNR;
+    }
+
+    if (slba != wp) {
+        trace_pci_nvme_err_zone_invalid_write(nvme_cid(req), slba, wp);
+        return NVME_ZONE_INVALID_WRITE | NVME_DNR;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static inline uint16_t nvme_check_zone(NvmeCtrl *n, uint64_t slba,
+                                       uint32_t nlb, NvmeRequest *req,
+                                       NvmeZone *zone) {
+    if (nvme_req_is_write(req)) {
+        return nvme_check_zone_write(n, slba, nlb, req, zone);
+    }
+
+    return nvme_check_zone_read(n, slba, nlb, req, zone);
+}
+
 static inline uint16_t nvme_check_mdts(NvmeCtrl *n, size_t len)
 {
     uint8_t mdts = n->params.mdts;
@@ -989,6 +1105,42 @@ static int nvme_allocate(NvmeNamespace *ns, uint64_t slba, uint32_t nlb)
     return ret;
 }
 
+static int nvme_zns_commit_zone(NvmeNamespace *ns, NvmeZone *zone)
+{
+    uint64_t zslba;
+    int64_t offset;
+
+    if (!ns->pstate.blk) {
+        return 0;
+    }
+
+    trace_pci_nvme_zns_commit_zone(nvme_nsid(ns), nvme_zslba(zone));
+
+    zslba = nvme_zslba(zone);
+    offset = ns->pstate.zns.offset +
+        nvme_ns_zone_idx(ns, zslba) * sizeof(NvmeZoneDescriptor);
+
+    return blk_pwrite(ns->pstate.blk, offset, zone->zd,
+                      sizeof(NvmeZoneDescriptor), 0);
+}
+
+static void nvme_zns_advance_wp(NvmeRequest *req)
+{
+    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
+    uint64_t slba = le64_to_cpu(rw->slba);
+    uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
+    NvmeZone *zone = nvme_ns_get_zone(req->ns, slba);
+    uint64_t wp = nvme_wp(zone);
+
+    wp += nlb;
+    zone->zd->wp = cpu_to_le64(wp);
+    if (wp == nvme_zslba(zone) + nvme_zcap(zone)) {
+        nvme_zs_set(zone, NVME_ZS_ZSF);
+        if (nvme_zns_commit_zone(req->ns, zone) < 0) {
+            req->status = NVME_INTERNAL_DEV_ERROR;
+        }
+    }
+}
 
 static void nvme_rw_cb(void *opaque, int ret)
 {
@@ -1005,6 +1157,10 @@ static void nvme_rw_cb(void *opaque, int ret)
 
     if (!ret) {
         block_acct_done(stats, acct);
+
+        if (nvme_ns_zoned(ns) && nvme_req_is_write(req)) {
+            nvme_zns_advance_wp(req);
+        }
     } else {
         uint16_t status;
 
@@ -1030,6 +1186,26 @@ static void nvme_rw_cb(void *opaque, int ret)
         error_report_err(local_err);
 
         req->status = status;
+
+        if (nvme_ns_zoned(ns)) {
+            NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
+            uint64_t slba = le64_to_cpu(rw->slba);
+
+            NvmeZone *zone = nvme_ns_get_zone(ns, slba);
+
+            /*
+             * Transition the zone to read-only on write fault and offline
+             * on unrecovered read.
+             */
+            NvmeZoneState zs = status == NVME_WRITE_FAULT ?
+                NVME_ZS_ZSRO : NVME_ZS_ZSO;
+
+            nvme_zs_set(zone, zs);
+
+            if (nvme_zns_commit_zone(ns, zone) < 0) {
+                req->status = NVME_INTERNAL_DEV_ERROR;
+            }
+        }
     }
 
     nvme_enqueue_req_completion(nvme_cq(req), req);
@@ -1103,6 +1279,7 @@ static uint16_t nvme_rwz(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
     NvmeNamespace *ns = req->ns;
+    NvmeZone *zone = NULL;
 
     uint64_t slba = le64_to_cpu(rw->slba);
     uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
@@ -1114,6 +1291,20 @@ static uint16_t nvme_rwz(NvmeCtrl *n, NvmeRequest *req)
     trace_pci_nvme_rwz(nvme_cid(req), nvme_io_opc_str(rw->opcode),
                        nvme_nsid(ns), nlb, len, slba);
 
+    if (nvme_ns_zoned(ns)) {
+        zone = nvme_ns_get_zone(ns, slba);
+        if (!zone) {
+            trace_pci_nvme_err_invalid_zone(nvme_cid(req), slba);
+            status = NVME_INVALID_FIELD | NVME_DNR;
+            goto invalid;
+        }
+
+        status = nvme_check_zone(n, slba, nlb, req, zone);
+        if (status) {
+            goto invalid;
+        }
+    }
+
     status = nvme_check_bounds(n, ns, slba, nlb);
     if (status) {
         NvmeIdNsNvm *id_ns = nvme_ns_id_nvm(ns);
@@ -1145,6 +1336,31 @@ static uint16_t nvme_rwz(NvmeCtrl *n, NvmeRequest *req)
     }
 
     if (is_write) {
+        if (zone) {
+            if (zone->wp_staging != nvme_wp(zone)) {
+                trace_pci_nvme_err_zone_pending_writes(nvme_cid(req),
+                                                       nvme_zslba(zone),
+                                                       nvme_wp(zone),
+                                                       zone->wp_staging);
+            }
+
+            switch (nvme_zs(zone)) {
+            case NVME_ZS_ZSE:
+            case NVME_ZS_ZSC:
+                nvme_zs_set(zone, NVME_ZS_ZSIO);
+
+                if (nvme_zns_commit_zone(req->ns, zone) < 0) {
+                    status = NVME_INTERNAL_DEV_ERROR;
+                    goto invalid;
+                }
+
+            default:
+                break;
+            }
+
+            zone->wp_staging += nlb;
+        }
+
         if (nvme_allocate(ns, slba, nlb) < 0) {
             status = NVME_INTERNAL_DEV_ERROR;
             goto invalid;
@@ -2881,7 +3097,7 @@ static void nvme_init_state(NvmeCtrl *n)
     n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
     n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
     n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
-    n->iocscs[0] = 1 << NVME_IOCS_NVM;
+    n->iocscs[0] = (1 << NVME_IOCS_NVM) | (1 << NVME_IOCS_ZONED);
     n->features.iocsci = 0;
 }
 
@@ -3032,6 +3248,9 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     uint8_t *pci_conf = pci_dev->config;
     char *subnqn;
 
+    n->id_ctrl_iocss[NVME_IOCS_NVM] = g_new0(NvmeIdCtrl, 1);
+    n->id_ctrl_iocss[NVME_IOCS_ZONED] = g_new0(NvmeIdCtrl, 1);
+
     id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
     id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
     strpadcpy((char *)id->mn, sizeof(id->mn), "QEMU NVMe Ctrl", ' ');
diff --git a/hw/block/trace-events b/hw/block/trace-events
index b002eb7c8a5c..d46a7a4942bb 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -72,6 +72,7 @@ pci_nvme_enqueue_event_noqueue(int queued) "queued %d"
 pci_nvme_enqueue_event_masked(uint8_t typ) "type 0x%"PRIx8""
 pci_nvme_no_outstanding_aers(void) "ignoring event; no outstanding AERs"
 pci_nvme_enqueue_req_completion(uint16_t cid, uint16_t cqid, uint16_t status) "cid %"PRIu16" cqid %"PRIu16" status 0x%"PRIx16""
+pci_nvme_zns_commit_zone(uint32_t nsid, uint64_t zslba) "nsid 0x%"PRIx32" zslba 0x%"PRIx64""
 pci_nvme_mmio_read(uint64_t addr) "addr 0x%"PRIx64""
 pci_nvme_mmio_write(uint64_t addr, uint64_t data) "addr 0x%"PRIx64" data 0x%"PRIx64""
 pci_nvme_mmio_doorbell_cq(uint16_t cqid, uint16_t new_head) "cqid %"PRIu16" new_head %"PRIu16""
@@ -97,6 +98,11 @@ pci_nvme_err_addr_read(uint64_t addr) "addr 0x%"PRIx64""
 pci_nvme_err_addr_write(uint64_t addr) "addr 0x%"PRIx64""
 pci_nvme_err_cfs(void) "controller fatal status"
 pci_nvme_err_aio(uint16_t cid, const char *errname, uint16_t status) "cid %"PRIu16" err '%s' status 0x%"PRIx16""
+pci_nvme_err_zone_is_full(uint16_t cid, uint64_t slba) "cid %"PRIu16" lba 0x%"PRIx64""
+pci_nvme_err_zone_is_read_only(uint16_t cid, uint64_t slba) "cid %"PRIu16" lba 0x%"PRIx64""
+pci_nvme_err_zone_invalid_write(uint16_t cid, uint64_t slba, uint64_t wp) "cid %"PRIu16" lba 0x%"PRIx64" wp 0x%"PRIx64""
+pci_nvme_err_zone_boundary(uint16_t cid, uint64_t slba, uint32_t nlb, uint64_t zcap) "cid %"PRIu16" lba 0x%"PRIx64" nlb %"PRIu32" zcap 0x%"PRIx64""
+pci_nvme_err_zone_pending_writes(uint16_t cid, uint64_t zslba, uint64_t wp, uint64_t wp_staging) "cid %"PRIu16" zslba 0x%"PRIx64" wp 0x%"PRIx64" wp_staging 0x%"PRIx64""
 pci_nvme_err_invalid_sgld(uint16_t cid, uint8_t typ) "cid %"PRIu16" type 0x%"PRIx8""
 pci_nvme_err_invalid_num_sgld(uint16_t cid, uint8_t typ) "cid %"PRIu16" type 0x%"PRIx8""
 pci_nvme_err_invalid_sgl_excess_length(uint16_t cid) "cid %"PRIu16""
@@ -125,6 +131,8 @@ pci_nvme_err_invalid_identify_cns(uint16_t cns) "identify, invalid cns=0x%"PRIx1
 pci_nvme_err_invalid_getfeat(int dw10) "invalid get features, dw10=0x%"PRIx32""
 pci_nvme_err_invalid_setfeat(uint32_t dw10) "invalid set features, dw10=0x%"PRIx32""
 pci_nvme_err_invalid_log_page(uint16_t cid, uint16_t lid) "cid %"PRIu16" lid 0x%"PRIx16""
+pci_nvme_err_invalid_zone(uint16_t cid, uint64_t lba) "cid %"PRIu16" lba 0x%"PRIx64""
+pci_nvme_err_invalid_zone_condition(uint16_t cid, uint64_t zslba, uint8_t condition) "cid %"PRIu16" zslba 0x%"PRIx64" condition 0x%"PRIx8""
 pci_nvme_err_startfail_cq(void) "nvme_start_ctrl failed because there are non-admin completion queues"
 pci_nvme_err_startfail_sq(void) "nvme_start_ctrl failed because there are non-admin submission queues"
 pci_nvme_err_startfail_nbarasq(void) "nvme_start_ctrl failed because the admin submission queue address is null"
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 10/16] hw/block/nvme: add the zone management receive command
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (8 preceding siblings ...)
  2020-09-24 20:45 ` [PATCH 09/16] hw/block/nvme: add basic read/write for zoned namespaces Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 20:45 ` [PATCH 11/16] hw/block/nvme: add the zone management send command Klaus Jensen
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

Add the Zone Management Receive command.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme-ns.h    |   8 +++
 hw/block/nvme.h       |   1 +
 include/block/nvme.h  |  46 ++++++++++++++
 hw/block/nvme-ns.c    |  35 ++++++++++-
 hw/block/nvme.c       | 135 ++++++++++++++++++++++++++++++++++++++++++
 hw/block/trace-events |   1 +
 6 files changed, 223 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index c15bfcfc5a08..5a695334a052 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -27,11 +27,13 @@ typedef struct NvmeNamespaceParams {
     struct {
         uint64_t zcap;
         uint64_t zsze;
+        uint8_t  zdes;
     } zns;
 } NvmeNamespaceParams;
 
 typedef struct NvmeZone {
     NvmeZoneDescriptor *zd;
+    uint8_t            *zde;
 
     uint64_t wp_staging;
 } NvmeZone;
@@ -68,6 +70,7 @@ typedef struct NvmeNamespace {
 
         NvmeZone           *zones;
         NvmeZoneDescriptor *zd;
+        uint8_t            *zde;
     } zns;
 } NvmeNamespace;
 
@@ -160,6 +163,11 @@ static inline void nvme_zs_set(NvmeZone *zone, NvmeZoneState zs)
     zone->zd->zs = zs << 4;
 }
 
+static inline size_t nvme_ns_zdes_bytes(NvmeNamespace *ns)
+{
+    return ns->params.zns.zdes << 6;
+}
+
 static inline bool nvme_ns_zone_wp_valid(NvmeZone *zone)
 {
     switch (nvme_zs(zone)) {
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index f66ed9ab7eff..523eef0bcad8 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -71,6 +71,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
     case NVME_CMD_WRITE:            return "NVME_NVM_CMD_WRITE";
     case NVME_CMD_READ:             return "NVME_NVM_CMD_READ";
     case NVME_CMD_WRITE_ZEROES:     return "NVME_NVM_CMD_WRITE_ZEROES";
+    case NVME_CMD_ZONE_MGMT_RECV:   return "NVME_ZONED_CMD_ZONE_MGMT_RECV";
     default:                        return "NVME_NVM_CMD_UNKNOWN";
     }
 }
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 2e523c9d97b4..9bacf48ee9e9 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -481,6 +481,7 @@ enum NvmeIoCommands {
     NVME_CMD_COMPARE            = 0x05,
     NVME_CMD_WRITE_ZEROES       = 0x08,
     NVME_CMD_DSM                = 0x09,
+    NVME_CMD_ZONE_MGMT_RECV     = 0x7a,
 };
 
 typedef struct QEMU_PACKED NvmeDeleteQ {
@@ -593,6 +594,44 @@ enum {
     NVME_RW_PRINFO_PRCHK_REF    = 1 << 10,
 };
 
+typedef struct QEMU_PACKED NvmeZoneManagementRecvCmd {
+    uint8_t     opcode;
+    uint8_t     flags;
+    uint16_t    cid;
+    uint32_t    nsid;
+    uint8_t     rsvd8[16];
+    NvmeCmdDptr dptr;
+    uint64_t    slba;
+    uint32_t    numdw;
+    uint8_t     zra;
+    uint8_t     zrasp;
+    uint8_t     zrasf;
+    uint8_t     rsvd55[9];
+} NvmeZoneManagementRecvCmd;
+
+typedef enum NvmeZoneManagementRecvAction {
+    NVME_CMD_ZONE_MGMT_RECV_REPORT_ZONES          = 0x0,
+    NVME_CMD_ZONE_MGMT_RECV_EXTENDED_REPORT_ZONES = 0x1,
+} NvmeZoneManagementRecvAction;
+
+typedef enum NvmeZoneManagementRecvActionSpecificField {
+    NVME_CMD_ZONE_MGMT_RECV_LIST_ALL  = 0x0,
+    NVME_CMD_ZONE_MGMT_RECV_LIST_ZSE  = 0x1,
+    NVME_CMD_ZONE_MGMT_RECV_LIST_ZSIO = 0x2,
+    NVME_CMD_ZONE_MGMT_RECV_LIST_ZSEO = 0x3,
+    NVME_CMD_ZONE_MGMT_RECV_LIST_ZSC  = 0x4,
+    NVME_CMD_ZONE_MGMT_RECV_LIST_ZSF  = 0x5,
+    NVME_CMD_ZONE_MGMT_RECV_LIST_ZSRO = 0x6,
+    NVME_CMD_ZONE_MGMT_RECV_LIST_ZSO  = 0x7,
+} NvmeZoneManagementRecvActionSpecificField;
+
+#define NVME_CMD_ZONE_MGMT_RECEIVE_PARTIAL 0x1
+
+typedef struct QEMU_PACKED NvmeZoneReportHeader {
+    uint64_t num_zones;
+    uint8_t  rsvd[56];
+} NvmeZoneReportHeader;
+
 typedef struct QEMU_PACKED NvmeDsmCmd {
     uint8_t     opcode;
     uint8_t     flags;
@@ -812,6 +851,12 @@ typedef struct QEMU_PACKED NvmeZoneDescriptor {
     uint8_t  rsvd32[32];
 } NvmeZoneDescriptor;
 
+#define NVME_ZA_ZDEV (1 << 7)
+
+#define NVME_ZA_SET(za, attrs)   ((za) |= (attrs))
+#define NVME_ZA_CLEAR(za, attrs) ((za) &= ~(attrs))
+#define NVME_ZA_CLEAR_ALL(za)    ((za) = 0x0)
+
 enum NvmeSmartWarn {
     NVME_SMART_SPARE                  = 1 << 0,
     NVME_SMART_TEMPERATURE            = 1 << 1,
@@ -1162,6 +1207,7 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdentify) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeRwCmd) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeDsmCmd) != 64);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeZoneManagementRecvCmd) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeRangeType) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 872c46f2f2f0..0fb196c7103e 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -85,6 +85,9 @@ static void nvme_ns_zns_init_zones(NvmeNamespace *ns)
         zslba = i * zsze;
         zone = nvme_ns_get_zone(ns, zslba);
         zone->zd = &ns->zns.zd[i];
+        if (ns->params.zns.zdes) {
+            zone->zde = &ns->zns.zde[i];
+        }
 
         zd = zone->zd;
 
@@ -104,11 +107,15 @@ static void nvme_ns_init_zoned(NvmeNamespace *ns)
     for (int i = 0; i <= id_ns->nlbaf; i++) {
         id_ns_zns->lbafe[i].zsze = ns->params.zns.zsze ?
             ns->params.zns.zsze : cpu_to_le64(pow2ceil(ns->params.zns.zcap));
+        id_ns_zns->lbafe[i].zdes = ns->params.zns.zdes;
     }
 
     ns->zns.num_zones = nvme_ns_nlbas(ns) / nvme_ns_zsze(ns);
     ns->zns.zones = g_malloc0_n(ns->zns.num_zones, sizeof(NvmeZone));
     ns->zns.zd = g_malloc0_n(ns->zns.num_zones, sizeof(NvmeZoneDescriptor));
+    if (ns->params.zns.zdes) {
+        ns->zns.zde = g_malloc0_n(ns->zns.num_zones, nvme_ns_zdes_bytes(ns));
+    }
 
     id_ns->ncap = ns->zns.num_zones * ns->params.zns.zcap;
 
@@ -148,7 +155,7 @@ static int nvme_ns_setup_blk_pstate(NvmeNamespace *ns, Error **errp)
     BlockBackend *blk = ns->pstate.blk;
     uint64_t perm, shared_perm;
     ssize_t len;
-    size_t util_len, zd_len, pstate_len;
+    size_t util_len, zd_len, zde_len, pstate_len;
     int ret;
 
     perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE;
@@ -162,7 +169,9 @@ static int nvme_ns_setup_blk_pstate(NvmeNamespace *ns, Error **errp)
     util_len = DIV_ROUND_UP(nvme_ns_nlbas(ns), 8);
     zd_len = nvme_ns_zoned(ns) ?
         ns->zns.num_zones * sizeof(NvmeZoneDescriptor) : 0;
-    pstate_len = ROUND_UP(util_len + zd_len, BDRV_SECTOR_SIZE);
+    zde_len = nvme_ns_zoned(ns) ?
+        ns->zns.num_zones * nvme_ns_zdes_bytes(ns) : 0;
+    pstate_len = ROUND_UP(util_len + zd_len + zde_len, BDRV_SECTOR_SIZE);
 
     len = blk_getlength(blk);
     if (len < 0) {
@@ -213,9 +222,19 @@ static int nvme_ns_setup_blk_pstate(NvmeNamespace *ns, Error **errp)
                 return ret;
             }
 
+            if (zde_len) {
+                ret = blk_pread(blk, util_len + zd_len, ns->zns.zde,
+                                zde_len);
+                if (ret < 0) {
+                    error_setg_errno(errp, -ret, "could not read pstate");
+                    return ret;
+                }
+            }
+
             for (int i = 0; i < ns->zns.num_zones; i++) {
                 NvmeZone *zone = &ns->zns.zones[i];
                 zone->zd = &ns->zns.zd[i];
+                zone->zde = &ns->zns.zde[i];
 
                 zone->wp_staging = nvme_wp(zone);
 
@@ -227,7 +246,8 @@ static int nvme_ns_setup_blk_pstate(NvmeNamespace *ns, Error **errp)
                     continue;
 
                 case NVME_ZS_ZSC:
-                    if (nvme_wp(zone) == nvme_zslba(zone)) {
+                    if (nvme_wp(zone) == nvme_zslba(zone) &&
+                        !(zone->zd->za & NVME_ZA_ZDEV)) {
                         nvme_zs_set(zone, NVME_ZS_ZSE);
                     }
 
@@ -248,6 +268,14 @@ static int nvme_ns_setup_blk_pstate(NvmeNamespace *ns, Error **errp)
             error_setg_errno(errp, -ret, "could not write pstate");
             return ret;
         }
+
+        if (zde_len) {
+            ret = blk_pwrite(blk, util_len + zd_len, ns->zns.zde, zde_len, 0);
+            if (ret < 0) {
+                error_setg_errno(errp, -ret, "could not write pstate");
+                return ret;
+            }
+        }
     }
 
     return 0;
@@ -389,6 +417,7 @@ static Property nvme_ns_props[] = {
     DEFINE_PROP_UINT8("iocs", NvmeNamespace, params.iocs, NVME_IOCS_NVM),
     DEFINE_PROP_UINT64("zns.zcap", NvmeNamespace, params.zns.zcap, 0),
     DEFINE_PROP_UINT64("zns.zsze", NvmeNamespace, params.zns.zsze, 0),
+    DEFINE_PROP_UINT8("zns.zdes", NvmeNamespace, params.zns.zdes, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index b0179291b966..43ae89a0a6cb 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -163,6 +163,7 @@ static const NvmeEffectsLog nvme_effects[NVME_IOCS_MAX] = {
 
         .iocs = {
             NVME_EFFECTS_NVM_INITIALIZER,
+            [NVME_CMD_ZONE_MGMT_RECV] = NVME_EFFECTS_CSUPP,
         },
     },
 };
@@ -1201,6 +1202,9 @@ static void nvme_rw_cb(void *opaque, int ret)
                 NVME_ZS_ZSRO : NVME_ZS_ZSO;
 
             nvme_zs_set(zone, zs);
+            if (zs == NVME_ZS_ZSO) {
+                NVME_ZA_CLEAR_ALL(zone->zd->za);
+            }
 
             if (nvme_zns_commit_zone(ns, zone) < 0) {
                 req->status = NVME_INTERNAL_DEV_ERROR;
@@ -1269,6 +1273,135 @@ static uint16_t nvme_do_aio(BlockBackend *blk, int64_t offset, size_t len,
     return NVME_NO_COMPLETE;
 }
 
+static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeZoneManagementRecvCmd *recv;
+    NvmeZoneManagementRecvAction zra;
+    NvmeZoneManagementRecvActionSpecificField zrasp;
+    NvmeNamespace *ns = req->ns;
+    NvmeZone *zone;
+
+    uint8_t *buf, *bufp, zs_list;
+    uint64_t slba;
+    int num_zones = 0, zidx = 0, zidx_begin;
+    uint16_t zes, status;
+    size_t len;
+
+    recv = (NvmeZoneManagementRecvCmd *) &req->cmd;
+
+    zra = recv->zra;
+    zrasp = recv->zrasp;
+    slba = le64_to_cpu(recv->slba);
+    len = (le32_to_cpu(recv->numdw) + 1) << 2;
+
+    if (!nvme_ns_zoned(ns)) {
+        return NVME_INVALID_OPCODE | NVME_DNR;
+    }
+
+    trace_pci_nvme_zone_mgmt_recv(nvme_cid(req), nvme_nsid(ns), slba, len,
+                                  zra, zrasp, recv->zrasf);
+
+    if (!len) {
+        return NVME_SUCCESS;
+    }
+
+    switch (zrasp) {
+    case NVME_CMD_ZONE_MGMT_RECV_LIST_ALL:
+        zs_list = 0;
+        break;
+
+    case NVME_CMD_ZONE_MGMT_RECV_LIST_ZSE:
+        zs_list = NVME_ZS_ZSE;
+        break;
+
+    case NVME_CMD_ZONE_MGMT_RECV_LIST_ZSIO:
+        zs_list = NVME_ZS_ZSIO;
+        break;
+
+    case NVME_CMD_ZONE_MGMT_RECV_LIST_ZSEO:
+        zs_list = NVME_ZS_ZSEO;
+        break;
+
+    case NVME_CMD_ZONE_MGMT_RECV_LIST_ZSC:
+        zs_list = NVME_ZS_ZSC;
+        break;
+
+    case NVME_CMD_ZONE_MGMT_RECV_LIST_ZSF:
+        zs_list = NVME_ZS_ZSF;
+        break;
+
+    case NVME_CMD_ZONE_MGMT_RECV_LIST_ZSRO:
+        zs_list = NVME_ZS_ZSRO;
+        break;
+
+    case NVME_CMD_ZONE_MGMT_RECV_LIST_ZSO:
+        zs_list = NVME_ZS_ZSO;
+        break;
+    default:
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    status = nvme_check_mdts(n, len);
+    if (status) {
+        return status;
+    }
+
+    if (!nvme_ns_get_zone(ns, slba)) {
+        trace_pci_nvme_err_invalid_zone(nvme_cid(req), slba);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    zidx_begin = zidx = nvme_ns_zone_idx(ns, slba);
+    zes = sizeof(NvmeZoneDescriptor);
+    if (zra == NVME_CMD_ZONE_MGMT_RECV_EXTENDED_REPORT_ZONES) {
+        zes += nvme_ns_zdes_bytes(ns);
+    }
+
+    buf = bufp = g_malloc0(len);
+    bufp += sizeof(NvmeZoneReportHeader);
+
+    while ((bufp + zes) - buf <= len && zidx < ns->zns.num_zones) {
+        zone = &ns->zns.zones[zidx++];
+
+        if (zs_list && zs_list != nvme_zs(zone)) {
+            continue;
+        }
+
+        num_zones++;
+
+        memcpy(bufp, zone->zd, sizeof(NvmeZoneDescriptor));
+
+        if (zra == NVME_CMD_ZONE_MGMT_RECV_EXTENDED_REPORT_ZONES) {
+            memcpy(bufp + sizeof(NvmeZoneDescriptor), zone->zde,
+                   nvme_ns_zdes_bytes(ns));
+        }
+
+        bufp += zes;
+    }
+
+    if (!(recv->zrasf & NVME_CMD_ZONE_MGMT_RECEIVE_PARTIAL)) {
+        if (!zs_list) {
+            num_zones = ns->zns.num_zones - zidx_begin;
+        } else {
+            num_zones = 0;
+            for (int i = zidx_begin; i < ns->zns.num_zones; i++) {
+                zone = &ns->zns.zones[i];
+
+                if (zs_list == nvme_zs(zone)) {
+                    num_zones++;
+                }
+            }
+        }
+    }
+
+    stq_le_p(buf, (uint64_t)num_zones);
+
+    status = nvme_dma(n, buf, len, DMA_DIRECTION_FROM_DEVICE, req);
+    g_free(buf);
+
+    return status;
+}
+
 static uint16_t nvme_flush(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeNamespace *ns = req->ns;
@@ -1408,6 +1541,8 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
     case NVME_CMD_WRITE:
     case NVME_CMD_READ:
         return nvme_rwz(n, req);
+    case NVME_CMD_ZONE_MGMT_RECV:
+        return nvme_zone_mgmt_recv(n, req);
     default:
         trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
         return NVME_INVALID_OPCODE | NVME_DNR;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index d46a7a4942bb..a2671dadb1e8 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -42,6 +42,7 @@ pci_nvme_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode, cons
 pci_nvme_admin_cmd(uint16_t cid, uint16_t sqid, uint8_t opcode, const char *opname) "cid %"PRIu16" sqid %"PRIu16" opc 0x%"PRIx8" opname '%s'"
 pci_nvme_rwz(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, uint64_t len, uint64_t lba) "cid %"PRIu16" opname '%s' nsid %"PRIu32" nlb %"PRIu32" len %"PRIu64" lba 0x%"PRIx64""
 pci_nvme_rw_cb(uint16_t cid, const char *blkname) "cid %"PRIu16" blk '%s'"
+pci_nvme_zone_mgmt_recv(uint16_t cid, uint32_t nsid, uint64_t slba, uint64_t len, uint8_t zra, uint8_t zrasp, uint8_t zrasf) "cid %"PRIu16" nsid %"PRIu32" slba 0x%"PRIx64" len %"PRIu64" zra 0x%"PRIx8" zrasp 0x%"PRIx8" zrasf 0x%"PRIx8""
 pci_nvme_allocate(uint32_t ns, uint64_t slba, uint32_t nlb) "nsid %"PRIu32" slba 0x%"PRIx64" nlb %"PRIu32""
 pci_nvme_do_aio(uint16_t cid, uint8_t opc, const char *opname, const char *blkname, int64_t offset, size_t len) "cid %"PRIu16" opc 0x%"PRIx8" opname '%s' blk '%s' offset %"PRId64" len %zu"
 pci_nvme_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16""
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 11/16] hw/block/nvme: add the zone management send command
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (9 preceding siblings ...)
  2020-09-24 20:45 ` [PATCH 10/16] hw/block/nvme: add the zone management receive command Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 20:45 ` [PATCH 12/16] hw/block/nvme: add the zone append command Klaus Jensen
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

Add the Zone Management Send command.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.h       |   1 +
 include/block/nvme.h  |  29 +++
 hw/block/nvme.c       | 552 ++++++++++++++++++++++++++++++++++++++++--
 hw/block/trace-events |  11 +
 4 files changed, 574 insertions(+), 19 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 523eef0bcad8..c704663e0a3e 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -71,6 +71,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
     case NVME_CMD_WRITE:            return "NVME_NVM_CMD_WRITE";
     case NVME_CMD_READ:             return "NVME_NVM_CMD_READ";
     case NVME_CMD_WRITE_ZEROES:     return "NVME_NVM_CMD_WRITE_ZEROES";
+    case NVME_CMD_ZONE_MGMT_SEND:   return "NVME_ZONED_CMD_ZONE_MGMT_SEND";
     case NVME_CMD_ZONE_MGMT_RECV:   return "NVME_ZONED_CMD_ZONE_MGMT_RECV";
     default:                        return "NVME_NVM_CMD_UNKNOWN";
     }
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 9bacf48ee9e9..967b42eb5da7 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -481,6 +481,7 @@ enum NvmeIoCommands {
     NVME_CMD_COMPARE            = 0x05,
     NVME_CMD_WRITE_ZEROES       = 0x08,
     NVME_CMD_DSM                = 0x09,
+    NVME_CMD_ZONE_MGMT_SEND     = 0x79,
     NVME_CMD_ZONE_MGMT_RECV     = 0x7a,
 };
 
@@ -594,6 +595,32 @@ enum {
     NVME_RW_PRINFO_PRCHK_REF    = 1 << 10,
 };
 
+typedef struct QEMU_PACKED NvmeZoneManagementSendCmd {
+    uint8_t     opcode;
+    uint8_t     flags;
+    uint16_t    cid;
+    uint32_t    nsid;
+    uint32_t    rsvd8[4];
+    NvmeCmdDptr dptr;
+    uint64_t    slba;
+    uint32_t    rsvd48;
+    uint8_t     zsa;
+    uint8_t     zsflags;
+    uint16_t    rsvd54;
+    uint32_t    rsvd56[2];
+} NvmeZoneManagementSendCmd;
+
+#define NVME_CMD_ZONE_MGMT_SEND_SELECT_ALL(zsflags) ((zsflags) & 0x1)
+
+typedef enum NvmeZoneManagementSendAction {
+    NVME_CMD_ZONE_MGMT_SEND_CLOSE   = 0x1,
+    NVME_CMD_ZONE_MGMT_SEND_FINISH  = 0x2,
+    NVME_CMD_ZONE_MGMT_SEND_OPEN    = 0x3,
+    NVME_CMD_ZONE_MGMT_SEND_RESET   = 0x4,
+    NVME_CMD_ZONE_MGMT_SEND_OFFLINE = 0x5,
+    NVME_CMD_ZONE_MGMT_SEND_SET_ZDE = 0x10,
+} NvmeZoneManagementSendAction;
+
 typedef struct QEMU_PACKED NvmeZoneManagementRecvCmd {
     uint8_t     opcode;
     uint8_t     flags;
@@ -748,6 +775,7 @@ enum NvmeStatusCodes {
     NVME_ZONE_IS_READ_ONLY      = 0x01ba,
     NVME_ZONE_IS_OFFLINE        = 0x01bb,
     NVME_ZONE_INVALID_WRITE     = 0x01bc,
+    NVME_INVALID_ZONE_STATE_TRANSITION = 0x01bf,
     NVME_WRITE_FAULT            = 0x0280,
     NVME_UNRECOVERED_READ       = 0x0281,
     NVME_E2E_GUARD_ERROR        = 0x0282,
@@ -1207,6 +1235,7 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdentify) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeRwCmd) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeDsmCmd) != 64);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeZoneManagementSendCmd) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeZoneManagementRecvCmd) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeRangeType) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64);
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 43ae89a0a6cb..51611ed32600 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -164,6 +164,8 @@ static const NvmeEffectsLog nvme_effects[NVME_IOCS_MAX] = {
         .iocs = {
             NVME_EFFECTS_NVM_INITIALIZER,
             [NVME_CMD_ZONE_MGMT_RECV] = NVME_EFFECTS_CSUPP,
+            [NVME_CMD_ZONE_MGMT_SEND] = NVME_EFFECTS_CSUPP |
+                                        NVME_EFFECTS_LBCC,
         },
     },
 };
@@ -1064,21 +1066,20 @@ static inline uint16_t nvme_check_dulbe(NvmeNamespace *ns, uint64_t slba,
     return NVME_SUCCESS;
 }
 
-static int nvme_allocate(NvmeNamespace *ns, uint64_t slba, uint32_t nlb)
+static int __nvme_allocate(NvmeNamespace *ns, uint64_t slba, uint32_t nlb,
+                           bool deallocate)
 {
     int nlongs, first;
     int64_t offset;
     unsigned long *map, *src;
     int ret;
 
-    if (!(ns->pstate.blk && nvme_check_dulbe(ns, slba, nlb))) {
-        return 0;
+    if (deallocate) {
+        bitmap_clear(ns->pstate.utilization.map, slba, nlb);
+    } else {
+        bitmap_set(ns->pstate.utilization.map, slba, nlb);
     }
 
-    trace_pci_nvme_allocate(nvme_nsid(ns), slba, nlb);
-
-    bitmap_set(ns->pstate.utilization.map, slba, nlb);
-
     nlongs = BITS_TO_LONGS(nlb) + 1;
     first = slba / BITS_PER_LONG;
     offset = ns->pstate.utilization.offset + first * sizeof(unsigned long);
@@ -1106,6 +1107,28 @@ static int nvme_allocate(NvmeNamespace *ns, uint64_t slba, uint32_t nlb)
     return ret;
 }
 
+static int nvme_allocate(NvmeNamespace *ns, uint64_t slba, uint32_t nlb)
+{
+    if (!(ns->pstate.blk && nvme_check_dulbe(ns, slba, nlb))) {
+        return 0;
+    }
+
+    trace_pci_nvme_allocate(nvme_nsid(ns), slba, nlb);
+
+    return __nvme_allocate(ns, slba, nlb, false /* deallocate */);
+}
+
+static int nvme_deallocate(NvmeNamespace *ns, uint64_t slba, uint32_t nlb)
+{
+    if (!ns->pstate.blk) {
+        return 0;
+    }
+
+    trace_pci_nvme_deallocate(nvme_nsid(ns), slba, nlb);
+
+    return __nvme_allocate(ns, slba, nlb, true /* deallocate */);
+}
+
 static int nvme_zns_commit_zone(NvmeNamespace *ns, NvmeZone *zone)
 {
     uint64_t zslba;
@@ -1125,6 +1148,139 @@ static int nvme_zns_commit_zone(NvmeNamespace *ns, NvmeZone *zone)
                       sizeof(NvmeZoneDescriptor), 0);
 }
 
+static int nvme_zns_commit_zde(NvmeNamespace *ns, NvmeZone *zone)
+{
+    uint64_t zslba;
+    int zidx;
+    size_t zdes_bytes;
+    int64_t offset;
+
+    if (!ns->pstate.blk) {
+        return 0;
+    }
+
+    zslba = nvme_zslba(zone);
+    zidx = nvme_ns_zone_idx(ns, zslba);
+    zdes_bytes = nvme_ns_zdes_bytes(ns);
+    offset = ns->pstate.zns.offset +
+        ns->zns.num_zones * sizeof(NvmeZoneDescriptor) +
+        zidx * zdes_bytes;
+
+    return blk_pwrite(ns->pstate.blk, offset, zone->zde, zdes_bytes, 0);
+}
+
+static inline void nvme_zone_reset_wp(NvmeZone *zone)
+{
+    zone->zd->wp = zone->zd->zslba;
+    zone->wp_staging = nvme_zslba(zone);
+}
+
+static uint16_t nvme_zrm_transition(NvmeNamespace *ns, NvmeZone *zone,
+                                    NvmeZoneState to)
+{
+    NvmeZoneState from = nvme_zs(zone);
+
+    /* fast path */
+    if (from == to) {
+        return NVME_SUCCESS;
+    }
+
+    switch (from) {
+    case NVME_ZS_ZSE:
+        break;
+
+    case NVME_ZS_ZSIO:
+    case NVME_ZS_ZSEO:
+        switch (to) {
+        case NVME_ZS_ZSEO:
+            break;
+
+        case NVME_ZS_ZSE:
+            nvme_zone_reset_wp(zone);
+
+            /* fallthrough */
+
+        case NVME_ZS_ZSO:
+            NVME_ZA_CLEAR_ALL(zone->zd->za);
+
+            /* fallthrough */
+
+        case NVME_ZS_ZSF:
+        case NVME_ZS_ZSRO:
+        case NVME_ZS_ZSC:
+            break;
+
+        default:
+            return NVME_INVALID_ZONE_STATE_TRANSITION | NVME_DNR;
+        }
+
+        break;
+
+    case NVME_ZS_ZSC:
+        switch (to) {
+        case NVME_ZS_ZSE:
+            nvme_zone_reset_wp(zone);
+
+            /* fallthrough */
+
+        case NVME_ZS_ZSO:
+            NVME_ZA_CLEAR_ALL(zone->zd->za);
+
+            /* fallthrough */
+
+        case NVME_ZS_ZSF:
+        case NVME_ZS_ZSRO:
+        case NVME_ZS_ZSIO:
+        case NVME_ZS_ZSEO:
+            break;
+
+        default:
+            return NVME_INVALID_ZONE_STATE_TRANSITION | NVME_DNR;
+        }
+
+        break;
+
+    case NVME_ZS_ZSRO:
+        switch (to) {
+        case NVME_ZS_ZSO:
+            NVME_ZA_CLEAR_ALL(zone->zd->za);
+            break;
+
+        default:
+            return NVME_INVALID_ZONE_STATE_TRANSITION | NVME_DNR;
+        }
+
+        break;
+
+    case NVME_ZS_ZSF:
+        switch (to) {
+        case NVME_ZS_ZSE:
+            nvme_zone_reset_wp(zone);
+
+            /* fallthrough */
+
+        case NVME_ZS_ZSO:
+            NVME_ZA_CLEAR_ALL(zone->zd->za);
+
+            /* fallthrough */
+
+        case NVME_ZS_ZSRO:
+            break;
+
+        default:
+            return NVME_INVALID_ZONE_STATE_TRANSITION | NVME_DNR;
+        }
+
+        break;
+
+    default:
+        return NVME_INVALID_ZONE_STATE_TRANSITION | NVME_DNR;
+    }
+
+    nvme_zs_set(zone, to);
+    return NVME_SUCCESS;
+}
+
 static void nvme_zns_advance_wp(NvmeRequest *req)
 {
     NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
@@ -1136,7 +1292,7 @@ static void nvme_zns_advance_wp(NvmeRequest *req)
     wp += nlb;
     zone->zd->wp = cpu_to_le64(wp);
     if (wp == nvme_zslba(zone) + nvme_zcap(zone)) {
-        nvme_zs_set(zone, NVME_ZS_ZSF);
+        nvme_zrm_transition(req->ns, zone, NVME_ZS_ZSF);
         if (nvme_zns_commit_zone(req->ns, zone) < 0) {
             req->status = NVME_INTERNAL_DEV_ERROR;
         }
@@ -1201,11 +1357,7 @@ static void nvme_rw_cb(void *opaque, int ret)
             NvmeZoneState zs = status == NVME_WRITE_FAULT ?
                 NVME_ZS_ZSRO : NVME_ZS_ZSO;
 
-            nvme_zs_set(zone, zs);
-            if (zs == NVME_ZS_ZSO) {
-                NVME_ZA_CLEAR_ALL(zone->zd->za);
-            }
-
+            nvme_zrm_transition(ns, zone, zs);
             if (nvme_zns_commit_zone(ns, zone) < 0) {
                 req->status = NVME_INTERNAL_DEV_ERROR;
             }
@@ -1273,6 +1425,364 @@ static uint16_t nvme_do_aio(BlockBackend *blk, int64_t offset, size_t len,
     return NVME_NO_COMPLETE;
 }
 
+static uint16_t nvme_zone_mgmt_send_close(NvmeCtrl *n, NvmeRequest *req,
+                                          NvmeZone *zone)
+{
+    NvmeNamespace *ns = req->ns;
+    uint16_t status;
+
+    trace_pci_nvme_zone_mgmt_send_close(nvme_cid(req), nvme_nsid(ns),
+                                        nvme_zslba(zone), nvme_zs_str(zone));
+
+    switch (nvme_zs(zone)) {
+    case NVME_ZS_ZSC:
+        return NVME_SUCCESS;
+
+    case NVME_ZS_ZSE:
+        /*
+         * The state machine in nvme_zrm_transition allows zones to transition
+         * from ZSE to ZSC. That transition is only valid if done as part Set
+         * Zone Descriptor, so do an early check here.
+         */
+        return NVME_INVALID_ZONE_STATE_TRANSITION | NVME_DNR;
+
+    default:
+        break;
+    }
+
+    status = nvme_zrm_transition(ns, zone, NVME_ZS_ZSC);
+    if (status) {
+        return status;
+    }
+
+    if (nvme_zns_commit_zone(ns, zone) < 0) {
+        return NVME_INTERNAL_DEV_ERROR;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_zone_mgmt_send_finish(NvmeCtrl *n, NvmeRequest *req,
+                                           NvmeZone *zone)
+{
+    NvmeNamespace *ns = req->ns;
+    uint16_t status;
+
+    trace_pci_nvme_zone_mgmt_send_finish(nvme_cid(req), nvme_nsid(ns),
+                                         nvme_zslba(zone), nvme_zs_str(zone));
+
+    if (nvme_zs(zone) == NVME_ZS_ZSF) {
+        return NVME_SUCCESS;
+    }
+
+    status = nvme_zrm_transition(ns, zone, NVME_ZS_ZSF);
+    if (status) {
+        return status;
+    }
+
+    if (nvme_zns_commit_zone(ns, zone) < 0) {
+        return NVME_INTERNAL_DEV_ERROR;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_zone_mgmt_send_open(NvmeCtrl *n, NvmeRequest *req,
+                                         NvmeZone *zone)
+{
+    NvmeNamespace *ns = req->ns;
+    uint16_t status;
+
+    trace_pci_nvme_zone_mgmt_send_open(nvme_cid(req), nvme_nsid(ns),
+                                       nvme_zslba(zone), nvme_zs_str(zone));
+
+    if (nvme_zs(zone) == NVME_ZS_ZSEO) {
+        return NVME_SUCCESS;
+    }
+
+    status = nvme_zrm_transition(ns, zone, NVME_ZS_ZSEO);
+    if (status) {
+        return status;
+    }
+
+    if (nvme_zns_commit_zone(ns, zone) < 0) {
+        return NVME_INTERNAL_DEV_ERROR;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_zone_mgmt_send_reset(NvmeCtrl *n, NvmeRequest *req,
+                                          NvmeZone *zone)
+{
+    NvmeNamespace *ns = req->ns;
+    uint64_t zslba = nvme_zslba(zone);
+    uint64_t zcap = nvme_zcap(zone);
+
+    trace_pci_nvme_zone_mgmt_send_reset(nvme_cid(req), nvme_nsid(ns),
+                                        nvme_zslba(zone), nvme_zs_str(zone));
+
+    switch (nvme_zs(zone)) {
+    case NVME_ZS_ZSIO:
+    case NVME_ZS_ZSEO:
+    case NVME_ZS_ZSC:
+    case NVME_ZS_ZSF:
+        if (blk_pdiscard(ns->blkconf.blk, nvme_l2b(ns, zslba),
+                         nvme_l2b(ns, zcap)) < 0) {
+            return NVME_INTERNAL_DEV_ERROR;
+        }
+
+        if (nvme_deallocate(ns, zslba, zcap) < 0) {
+            return NVME_INTERNAL_DEV_ERROR;
+        }
+
+        nvme_zrm_transition(ns, zone, NVME_ZS_ZSE);
+        if (nvme_zns_commit_zone(ns, zone) < 0) {
+            return NVME_INTERNAL_DEV_ERROR;
+        }
+
+        /* fallthrough */
+
+    case NVME_ZS_ZSE:
+        return NVME_SUCCESS;
+
+    default:
+        break;
+    }
+
+    return NVME_INVALID_ZONE_STATE_TRANSITION | NVME_DNR;
+}
+
+static uint16_t nvme_zone_mgmt_send_offline(NvmeCtrl *n, NvmeRequest *req,
+                                            NvmeZone *zone)
+{
+    NvmeNamespace *ns = req->ns;
+
+    trace_pci_nvme_zone_mgmt_send_offline(nvme_cid(req), nvme_nsid(ns),
+                                          nvme_zslba(zone), nvme_zs_str(zone));
+
+    switch (nvme_zs(zone)) {
+    case NVME_ZS_ZSRO:
+        if (nvme_deallocate(ns, nvme_zslba(zone), nvme_zcap(zone)) < 0) {
+            return NVME_INTERNAL_DEV_ERROR;
+        }
+
+        nvme_zrm_transition(ns, zone, NVME_ZS_ZSO);
+        if (nvme_zns_commit_zone(ns, zone) < 0) {
+            return NVME_INTERNAL_DEV_ERROR;
+        }
+
+        /* fallthrough */
+
+    case NVME_ZS_ZSO:
+        return NVME_SUCCESS;
+
+    default:
+        break;
+    }
+
+    return NVME_INVALID_ZONE_STATE_TRANSITION | NVME_DNR;
+}
+
+static uint16_t nvme_zone_mgmt_send_set_zde(NvmeCtrl *n, NvmeRequest *req,
+                                            NvmeZone *zone)
+{
+    NvmeNamespace *ns = req->ns;
+    uint16_t status;
+
+    trace_pci_nvme_zone_mgmt_send_set_zde(nvme_cid(req), nvme_nsid(ns),
+                                          nvme_zslba(zone), nvme_zs_str(zone));
+
+    if (nvme_zs(zone) != NVME_ZS_ZSE) {
+        trace_pci_nvme_err_invalid_zone_condition(nvme_cid(req),
+                                                  nvme_zslba(zone),
+                                                  nvme_zs(zone));
+        return NVME_INVALID_ZONE_STATE_TRANSITION | NVME_DNR;
+    }
+
+    status = nvme_check_mdts(n, nvme_ns_zdes_bytes(ns));
+    if (status) {
+        return status;
+    }
+
+    status = nvme_dma(n, zone->zde, nvme_ns_zdes_bytes(ns),
+                      DMA_DIRECTION_TO_DEVICE, req);
+    if (status) {
+        return status;
+    }
+
+    status = nvme_zrm_transition(ns, zone, NVME_ZS_ZSC);
+    if (status) {
+        return status;
+    }
+
+    if (nvme_zns_commit_zde(ns, zone) < 0) {
+        return NVME_INTERNAL_DEV_ERROR;
+    }
+
+    NVME_ZA_SET(zone->zd->za, NVME_ZA_ZDEV);
+
+    if (nvme_zns_commit_zone(ns, zone) < 0) {
+        return NVME_INTERNAL_DEV_ERROR;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_zone_mgmt_send_all(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeZoneManagementSendCmd *send = (NvmeZoneManagementSendCmd *) &req->cmd;
+    NvmeNamespace *ns = req->ns;
+    NvmeZone *zone;
+
+    uint16_t status = NVME_SUCCESS;
+
+    trace_pci_nvme_zone_mgmt_send_all(nvme_cid(req), nvme_nsid(ns), send->zsa);
+
+    switch (send->zsa) {
+    case NVME_CMD_ZONE_MGMT_SEND_SET_ZDE:
+        return NVME_INVALID_FIELD | NVME_DNR;
+
+    case NVME_CMD_ZONE_MGMT_SEND_CLOSE:
+        for (int i = 0; i < ns->zns.num_zones; i++) {
+            zone = &ns->zns.zones[i];
+
+            switch (nvme_zs(zone)) {
+            case NVME_ZS_ZSIO:
+            case NVME_ZS_ZSEO:
+                status = nvme_zone_mgmt_send_close(n, req, zone);
+                if (status) {
+                    return status;
+                }
+
+            default:
+                continue;
+            }
+        }
+
+        break;
+
+    case NVME_CMD_ZONE_MGMT_SEND_FINISH:
+        for (int i = 0; i < ns->zns.num_zones; i++) {
+            zone = &ns->zns.zones[i];
+
+            switch (nvme_zs(zone)) {
+            case NVME_ZS_ZSIO:
+            case NVME_ZS_ZSEO:
+            case NVME_ZS_ZSC:
+                status = nvme_zone_mgmt_send_finish(n, req, zone);
+                if (status) {
+                    return status;
+                }
+
+            default:
+                continue;
+            }
+        }
+
+        break;
+
+    case NVME_CMD_ZONE_MGMT_SEND_OPEN:
+        for (int i = 0; i < ns->zns.num_zones; i++) {
+            zone = &ns->zns.zones[i];
+
+            if (nvme_zs(zone) == NVME_ZS_ZSC) {
+                status = nvme_zone_mgmt_send_open(n, req, zone);
+                if (status) {
+                    return status;
+                }
+            }
+        }
+
+        break;
+
+    case NVME_CMD_ZONE_MGMT_SEND_RESET:
+        for (int i = 0; i < ns->zns.num_zones; i++) {
+            zone = &ns->zns.zones[i];
+
+            switch (nvme_zs(zone)) {
+            case NVME_ZS_ZSIO:
+            case NVME_ZS_ZSEO:
+            case NVME_ZS_ZSC:
+            case NVME_ZS_ZSF:
+                status = nvme_zone_mgmt_send_reset(n, req, zone);
+                if (status) {
+                    return status;
+                }
+
+            default:
+                continue;
+            }
+        }
+
+        break;
+
+    case NVME_CMD_ZONE_MGMT_SEND_OFFLINE:
+        for (int i = 0; i < ns->zns.num_zones; i++) {
+            zone = &ns->zns.zones[i];
+
+            if (nvme_zs(zone) == NVME_ZS_ZSRO) {
+                status = nvme_zone_mgmt_send_offline(n, req, zone);
+                if (status) {
+                    return status;
+                }
+            }
+        }
+
+        break;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_zone_mgmt_send(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeZoneManagementSendCmd *send = (NvmeZoneManagementSendCmd *) &req->cmd;
+    NvmeZoneManagementSendAction zsa = send->zsa;
+    NvmeNamespace *ns = req->ns;
+    NvmeZone *zone;
+    uint64_t zslba = le64_to_cpu(send->slba);
+
+    if (!nvme_ns_zoned(ns)) {
+        return NVME_INVALID_OPCODE | NVME_DNR;
+    }
+
+    trace_pci_nvme_zone_mgmt_send(nvme_cid(req), ns->params.nsid, zslba, zsa,
+                                  send->zsflags);
+
+    if (NVME_CMD_ZONE_MGMT_SEND_SELECT_ALL(send->zsflags)) {
+        return nvme_zone_mgmt_send_all(n, req);
+    }
+
+    if (zslba & (nvme_ns_zsze(ns) - 1)) {
+        trace_pci_nvme_err_invalid_zslba(nvme_cid(req), zslba);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    zone = nvme_ns_get_zone(ns, zslba);
+    if (!zone) {
+        trace_pci_nvme_err_invalid_zone(nvme_cid(req), zslba);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    switch (zsa) {
+    case NVME_CMD_ZONE_MGMT_SEND_CLOSE:
+        return nvme_zone_mgmt_send_close(n, req, zone);
+    case NVME_CMD_ZONE_MGMT_SEND_FINISH:
+        return nvme_zone_mgmt_send_finish(n, req, zone);
+    case NVME_CMD_ZONE_MGMT_SEND_OPEN:
+        return nvme_zone_mgmt_send_open(n, req, zone);
+    case NVME_CMD_ZONE_MGMT_SEND_RESET:
+        return nvme_zone_mgmt_send_reset(n, req, zone);
+    case NVME_CMD_ZONE_MGMT_SEND_OFFLINE:
+        return nvme_zone_mgmt_send_offline(n, req, zone);
+    case NVME_CMD_ZONE_MGMT_SEND_SET_ZDE:
+        return nvme_zone_mgmt_send_set_zde(n, req, zone);
+    }
+
+    return NVME_INVALID_FIELD | NVME_DNR;
+}
+
 static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeZoneManagementRecvCmd *recv;
@@ -1478,17 +1988,19 @@ static uint16_t nvme_rwz(NvmeCtrl *n, NvmeRequest *req)
             }
 
             switch (nvme_zs(zone)) {
-            case NVME_ZS_ZSE:
-            case NVME_ZS_ZSC:
-                nvme_zs_set(zone, NVME_ZS_ZSIO);
+            case NVME_ZS_ZSIO:
+            case NVME_ZS_ZSEO:
+                break;
+            default:
+                status = nvme_zrm_transition(ns, zone, NVME_ZS_ZSIO);
+                if (status) {
+                    goto invalid;
+                }
 
                 if (nvme_zns_commit_zone(req->ns, zone) < 0) {
                     status = NVME_INTERNAL_DEV_ERROR;
                     goto invalid;
                 }
-
-            default:
-                break;
             }
 
             zone->wp_staging += nlb;
@@ -1541,6 +2053,8 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
     case NVME_CMD_WRITE:
     case NVME_CMD_READ:
         return nvme_rwz(n, req);
+    case NVME_CMD_ZONE_MGMT_SEND:
+        return nvme_zone_mgmt_send(n, req);
     case NVME_CMD_ZONE_MGMT_RECV:
         return nvme_zone_mgmt_recv(n, req);
     default:
diff --git a/hw/block/trace-events b/hw/block/trace-events
index a2671dadb1e8..d6342f5c555d 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -42,8 +42,18 @@ pci_nvme_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode, cons
 pci_nvme_admin_cmd(uint16_t cid, uint16_t sqid, uint8_t opcode, const char *opname) "cid %"PRIu16" sqid %"PRIu16" opc 0x%"PRIx8" opname '%s'"
 pci_nvme_rwz(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, uint64_t len, uint64_t lba) "cid %"PRIu16" opname '%s' nsid %"PRIu32" nlb %"PRIu32" len %"PRIu64" lba 0x%"PRIx64""
 pci_nvme_rw_cb(uint16_t cid, const char *blkname) "cid %"PRIu16" blk '%s'"
+pci_nvme_zone_mgmt_send(uint16_t cid, uint32_t nsid, uint64_t zslba, uint8_t zsa, uint8_t zsflags) "cid %"PRIu16" nsid %"PRIu32" zslba 0x%"PRIx64" zsa 0x%"PRIx8" zsflags 0x%"PRIx8""
+pci_nvme_zone_mgmt_send_all(uint16_t cid, uint32_t nsid, uint8_t za) "cid %"PRIu16" nsid %"PRIu32" za 0x%"PRIx8""
+pci_nvme_zone_mgmt_send_close(uint16_t cid, uint32_t nsid, uint64_t zslba, const char *zc) "cid %"PRIu16" nsid %"PRIu32" zslba 0x%"PRIx64" zc \"%s\""
+pci_nvme_zone_mgmt_send_finish(uint16_t cid, uint32_t nsid, uint64_t zslba, const char *zc) "cid %"PRIu16" nsid %"PRIu32" zslba 0x%"PRIx64" zc \"%s\""
+pci_nvme_zone_mgmt_send_open(uint16_t cid, uint32_t nsid, uint64_t zslba, const char *zc) "cid %"PRIu16" nsid %"PRIu32" zslba 0x%"PRIx64" zc \"%s\""
+pci_nvme_zone_mgmt_send_reset(uint16_t cid, uint32_t nsid, uint64_t zslba, const char *zc) "cid %"PRIu16" nsid %"PRIu32" zslba 0x%"PRIx64" zc \"%s\""
+pci_nvme_zone_mgmt_send_reset_cb(uint16_t cid, uint32_t nsid) "cid %"PRIu16" nsid %"PRIu32""
+pci_nvme_zone_mgmt_send_offline(uint16_t cid, uint32_t nsid, uint64_t zslba, const char *zc) "cid %"PRIu16" nsid %"PRIu32" zslba 0x%"PRIx64" zc \"%s\""
+pci_nvme_zone_mgmt_send_set_zde(uint16_t cid, uint32_t nsid, uint64_t zslba, const char *zc) "cid %"PRIu16" nsid %"PRIu32" zslba 0x%"PRIx64" zc \"%s\""
 pci_nvme_zone_mgmt_recv(uint16_t cid, uint32_t nsid, uint64_t slba, uint64_t len, uint8_t zra, uint8_t zrasp, uint8_t zrasf) "cid %"PRIu16" nsid %"PRIu32" slba 0x%"PRIx64" len %"PRIu64" zra 0x%"PRIx8" zrasp 0x%"PRIx8" zrasf 0x%"PRIx8""
 pci_nvme_allocate(uint32_t ns, uint64_t slba, uint32_t nlb) "nsid %"PRIu32" slba 0x%"PRIx64" nlb %"PRIu32""
+pci_nvme_deallocate(uint32_t ns, uint64_t slba, uint32_t nlb) "nsid %"PRIu32" slba 0x%"PRIx64" nlb %"PRIu32""
 pci_nvme_do_aio(uint16_t cid, uint8_t opc, const char *opname, const char *blkname, int64_t offset, size_t len) "cid %"PRIu16" opc 0x%"PRIx8" opname '%s' blk '%s' offset %"PRId64" len %zu"
 pci_nvme_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16""
 pci_nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
@@ -134,6 +144,7 @@ pci_nvme_err_invalid_setfeat(uint32_t dw10) "invalid set features, dw10=0x%"PRIx
 pci_nvme_err_invalid_log_page(uint16_t cid, uint16_t lid) "cid %"PRIu16" lid 0x%"PRIx16""
 pci_nvme_err_invalid_zone(uint16_t cid, uint64_t lba) "cid %"PRIu16" lba 0x%"PRIx64""
 pci_nvme_err_invalid_zone_condition(uint16_t cid, uint64_t zslba, uint8_t condition) "cid %"PRIu16" zslba 0x%"PRIx64" condition 0x%"PRIx8""
+pci_nvme_err_invalid_zslba(uint16_t cid, uint64_t zslba) "cid %"PRIu16" zslba 0x%"PRIx64""
 pci_nvme_err_startfail_cq(void) "nvme_start_ctrl failed because there are non-admin completion queues"
 pci_nvme_err_startfail_sq(void) "nvme_start_ctrl failed because there are non-admin submission queues"
 pci_nvme_err_startfail_nbarasq(void) "nvme_start_ctrl failed because the admin submission queue address is null"
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 12/16] hw/block/nvme: add the zone append command
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (10 preceding siblings ...)
  2020-09-24 20:45 ` [PATCH 11/16] hw/block/nvme: add the zone management send command Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 20:45 ` [PATCH 13/16] hw/block/nvme: track and enforce zone resources Klaus Jensen
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

Add the Zone Append command.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.h       |  6 +++++
 include/block/nvme.h  |  7 ++++++
 hw/block/nvme.c       | 52 +++++++++++++++++++++++++++++++++++++++++--
 hw/block/trace-events |  1 +
 4 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index c704663e0a3e..8cd2d936548e 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -16,6 +16,10 @@ typedef struct NvmeParams {
     uint32_t aer_max_queued;
     uint8_t  mdts;
     bool     use_intel_id;
+
+    struct {
+        uint8_t zasl;
+    } zns;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
@@ -41,6 +45,7 @@ static inline bool nvme_req_is_write(NvmeRequest *req)
     switch (req->cmd.opcode) {
     case NVME_CMD_WRITE:
     case NVME_CMD_WRITE_ZEROES:
+    case NVME_CMD_ZONE_APPEND:
         return true;
     default:
         return false;
@@ -73,6 +78,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
     case NVME_CMD_WRITE_ZEROES:     return "NVME_NVM_CMD_WRITE_ZEROES";
     case NVME_CMD_ZONE_MGMT_SEND:   return "NVME_ZONED_CMD_ZONE_MGMT_SEND";
     case NVME_CMD_ZONE_MGMT_RECV:   return "NVME_ZONED_CMD_ZONE_MGMT_RECV";
+    case NVME_CMD_ZONE_APPEND:      return "NVME_ZONED_CMD_ZONE_APPEND";
     default:                        return "NVME_NVM_CMD_UNKNOWN";
     }
 }
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 967b42eb5da7..5f8914f594f4 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -483,6 +483,7 @@ enum NvmeIoCommands {
     NVME_CMD_DSM                = 0x09,
     NVME_CMD_ZONE_MGMT_SEND     = 0x79,
     NVME_CMD_ZONE_MGMT_RECV     = 0x7a,
+    NVME_CMD_ZONE_APPEND        = 0x7d,
 };
 
 typedef struct QEMU_PACKED NvmeDeleteQ {
@@ -1018,6 +1019,11 @@ enum NvmeIdCtrlLpa {
     NVME_LPA_EXTENDED = 1 << 2,
 };
 
+typedef struct QEMU_PACKED NvmeIdCtrlZns {
+    uint8_t zasl;
+    uint8_t rsvd1[4095];
+} NvmeIdCtrlZns;
+
 #define NVME_CTRL_SQES_MIN(sqes) ((sqes) & 0xf)
 #define NVME_CTRL_SQES_MAX(sqes) (((sqes) >> 4) & 0xf)
 #define NVME_CTRL_CQES_MIN(cqes) ((cqes) & 0xf)
@@ -1242,6 +1248,7 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrlZns) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsNvm) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeIdNsZns) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 51611ed32600..294bc2fb719d 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -166,6 +166,8 @@ static const NvmeEffectsLog nvme_effects[NVME_IOCS_MAX] = {
             [NVME_CMD_ZONE_MGMT_RECV] = NVME_EFFECTS_CSUPP,
             [NVME_CMD_ZONE_MGMT_SEND] = NVME_EFFECTS_CSUPP |
                                         NVME_EFFECTS_LBCC,
+            [NVME_CMD_ZONE_APPEND]    = NVME_EFFECTS_CSUPP |
+                                        NVME_EFFECTS_LBCC,
         },
     },
 };
@@ -1041,6 +1043,21 @@ static inline uint16_t nvme_check_mdts(NvmeCtrl *n, size_t len)
     return NVME_SUCCESS;
 }
 
+static inline uint16_t nvme_check_zasl(NvmeCtrl *n, size_t len)
+{
+    uint8_t zasl = n->params.zns.zasl;
+
+    if (!zasl) {
+        return nvme_check_mdts(n, len);
+    }
+
+    if (len > n->page_size << zasl) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    return NVME_SUCCESS;
+}
+
 static inline uint16_t nvme_check_bounds(NvmeCtrl *n, NvmeNamespace *ns,
                                          uint64_t slba, uint32_t nlb)
 {
@@ -1393,6 +1410,7 @@ static uint16_t nvme_do_aio(BlockBackend *blk, int64_t offset, size_t len,
         break;
 
     case NVME_CMD_WRITE:
+    case NVME_CMD_ZONE_APPEND:
         is_write = true;
 
         /* fallthrough */
@@ -1928,7 +1946,7 @@ static uint16_t nvme_rwz(NvmeCtrl *n, NvmeRequest *req)
     uint32_t nlb = (uint32_t)le16_to_cpu(rw->nlb) + 1;
     size_t len = nvme_l2b(ns, nlb);
 
-    bool is_write = nvme_req_is_write(req);
+    bool is_append, is_write = nvme_req_is_write(req);
     uint16_t status;
 
     trace_pci_nvme_rwz(nvme_cid(req), nvme_io_opc_str(rw->opcode),
@@ -1942,6 +1960,25 @@ static uint16_t nvme_rwz(NvmeCtrl *n, NvmeRequest *req)
             goto invalid;
         }
 
+        if (req->cmd.opcode == NVME_CMD_ZONE_APPEND) {
+            is_append = true;
+            uint64_t wp = zone->wp_staging;
+
+            if (slba != nvme_zslba(zone)) {
+                trace_pci_nvme_err_invalid_zslba(nvme_cid(req), slba);
+                return NVME_INVALID_FIELD | NVME_DNR;
+            }
+
+            status = nvme_check_zasl(n, len);
+            if (status) {
+                trace_pci_nvme_err_zasl(nvme_cid(req), len);
+                goto invalid;
+            }
+
+            slba = wp;
+            rw->slba = req->cqe.qw0 = cpu_to_le64(wp);
+        }
+
         status = nvme_check_zone(n, slba, nlb, req, zone);
         if (status) {
             goto invalid;
@@ -1980,7 +2017,7 @@ static uint16_t nvme_rwz(NvmeCtrl *n, NvmeRequest *req)
 
     if (is_write) {
         if (zone) {
-            if (zone->wp_staging != nvme_wp(zone)) {
+            if (!is_append && (zone->wp_staging != nvme_wp(zone))) {
                 trace_pci_nvme_err_zone_pending_writes(nvme_cid(req),
                                                        nvme_zslba(zone),
                                                        nvme_wp(zone),
@@ -2052,6 +2089,7 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
     case NVME_CMD_WRITE_ZEROES:
     case NVME_CMD_WRITE:
     case NVME_CMD_READ:
+    case NVME_CMD_ZONE_APPEND:
         return nvme_rwz(n, req);
     case NVME_CMD_ZONE_MGMT_SEND:
         return nvme_zone_mgmt_send(n, req);
@@ -3718,6 +3756,11 @@ static void nvme_check_constraints(NvmeCtrl *n, Error **errp)
         return;
     }
 
+    if (params->zns.zasl && params->zns.zasl > params->mdts) {
+        error_setg(errp, "zns.zasl must be less than or equal to mdts");
+        return;
+    }
+
     if (!n->params.cmb_size_mb && n->pmrdev) {
         if (host_memory_backend_is_mapped(n->pmrdev)) {
             error_setg(errp, "can't use already busy memdev: %s",
@@ -3894,12 +3937,16 @@ static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
 {
     NvmeIdCtrl *id = &n->id_ctrl;
+    NvmeIdCtrlZns *id_zns;
     uint8_t *pci_conf = pci_dev->config;
     char *subnqn;
 
     n->id_ctrl_iocss[NVME_IOCS_NVM] = g_new0(NvmeIdCtrl, 1);
     n->id_ctrl_iocss[NVME_IOCS_ZONED] = g_new0(NvmeIdCtrl, 1);
 
+    id_zns = n->id_ctrl_iocss[NVME_IOCS_ZONED];
+    id_zns->zasl = n->params.zns.zasl;
+
     id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
     id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
     strpadcpy((char *)id->mn, sizeof(id->mn), "QEMU NVMe Ctrl", ' ');
@@ -4036,6 +4083,7 @@ static Property nvme_props[] = {
     DEFINE_PROP_UINT32("aer_max_queued", NvmeCtrl, params.aer_max_queued, 64),
     DEFINE_PROP_UINT8("mdts", NvmeCtrl, params.mdts, 7),
     DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false),
+    DEFINE_PROP_UINT8("zns.zasl", NvmeCtrl, params.zns.zasl, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/trace-events b/hw/block/trace-events
index d6342f5c555d..929409b79b41 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -103,6 +103,7 @@ pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
 
 # nvme traces for error conditions
 pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
+pci_nvme_err_zasl(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
 pci_nvme_err_req_status(uint16_t cid, uint32_t nsid, uint16_t status, uint8_t opc) "cid %"PRIu16" nsid %"PRIu32" status 0x%"PRIx16" opc 0x%"PRIx8""
 pci_nvme_err_dulbe(uint16_t cid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" slba 0x%"PRIx64" nlb %"PRIu32""
 pci_nvme_err_addr_read(uint64_t addr) "addr 0x%"PRIx64""
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 13/16] hw/block/nvme: track and enforce zone resources
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (11 preceding siblings ...)
  2020-09-24 20:45 ` [PATCH 12/16] hw/block/nvme: add the zone append command Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 20:45 ` [PATCH 14/16] hw/block/nvme: allow open to close transitions by controller Klaus Jensen
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

Track number of open/active resources.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 docs/specs/nvme.txt  |  7 +++++
 hw/block/nvme-ns.h   |  7 +++++
 include/block/nvme.h |  2 ++
 hw/block/nvme-ns.c   | 25 +++++++++++++++--
 hw/block/nvme.c      | 65 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 103 insertions(+), 3 deletions(-)

diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
index 82def65d4c78..921dbce2dc01 100644
--- a/docs/specs/nvme.txt
+++ b/docs/specs/nvme.txt
@@ -30,6 +30,13 @@ nvme-ns Options
      zns.zcap; if the zone capacity is a power of two, the zone size will be
      set to that, otherwise it will default to the next power of two.
 
+  `zns.mar`; Specifies the number of active resources available. This is a 0s
+     based value.
+
+  `zns.mor`; Specifies the number of open resources available. This is a 0s
+     based value.
+
+
 Reference Specifications
 ------------------------
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 5a695334a052..f520ffa89c98 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -28,6 +28,8 @@ typedef struct NvmeNamespaceParams {
         uint64_t zcap;
         uint64_t zsze;
         uint8_t  zdes;
+        uint32_t mar;
+        uint32_t mor;
     } zns;
 } NvmeNamespaceParams;
 
@@ -71,6 +73,11 @@ typedef struct NvmeNamespace {
         NvmeZone           *zones;
         NvmeZoneDescriptor *zd;
         uint8_t            *zde;
+
+        struct {
+            uint32_t open;
+            uint32_t active;
+        } resources;
     } zns;
 } NvmeNamespace;
 
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 5f8914f594f4..d51f397e7ff1 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -776,6 +776,8 @@ enum NvmeStatusCodes {
     NVME_ZONE_IS_READ_ONLY      = 0x01ba,
     NVME_ZONE_IS_OFFLINE        = 0x01bb,
     NVME_ZONE_INVALID_WRITE     = 0x01bc,
+    NVME_TOO_MANY_ACTIVE_ZONES  = 0x01bd,
+    NVME_TOO_MANY_OPEN_ZONES    = 0x01be,
     NVME_INVALID_ZONE_STATE_TRANSITION = 0x01bf,
     NVME_WRITE_FAULT            = 0x0280,
     NVME_UNRECOVERED_READ       = 0x0281,
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 0fb196c7103e..588fe7a1f018 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -119,8 +119,13 @@ static void nvme_ns_init_zoned(NvmeNamespace *ns)
 
     id_ns->ncap = ns->zns.num_zones * ns->params.zns.zcap;
 
-    id_ns_zns->mar = 0xffffffff;
-    id_ns_zns->mor = 0xffffffff;
+    id_ns_zns->mar = cpu_to_le32(ns->params.zns.mar);
+    id_ns_zns->mor = cpu_to_le32(ns->params.zns.mor);
+
+    ns->zns.resources.active = ns->params.zns.mar != 0xffffffff ?
+        ns->params.zns.mar + 1 : ns->zns.num_zones;
+    ns->zns.resources.open = ns->params.zns.mor != 0xffffffff ?
+        ns->params.zns.mor + 1 : ns->zns.num_zones;
 }
 
 static void nvme_ns_init(NvmeNamespace *ns)
@@ -249,9 +254,15 @@ static int nvme_ns_setup_blk_pstate(NvmeNamespace *ns, Error **errp)
                     if (nvme_wp(zone) == nvme_zslba(zone) &&
                         !(zone->zd->za & NVME_ZA_ZDEV)) {
                         nvme_zs_set(zone, NVME_ZS_ZSE);
+                        continue;
                     }
 
-                    continue;
+                    if (ns->zns.resources.active) {
+                        ns->zns.resources.active--;
+                        continue;
+                    }
+
+                    /* fallthrough */
 
                 case NVME_ZS_ZSIO:
                 case NVME_ZS_ZSEO:
@@ -333,6 +344,12 @@ static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
             return -1;
         }
 
+        if (ns->params.zns.mor > ns->params.zns.mar) {
+            error_setg(errp, "maximum open resources (zns.mor) must be less "
+                       "than or equal to maximum active resources (zns.mar)");
+            return -1;
+        }
+
         break;
 
     default:
@@ -418,6 +435,8 @@ static Property nvme_ns_props[] = {
     DEFINE_PROP_UINT64("zns.zcap", NvmeNamespace, params.zns.zcap, 0),
     DEFINE_PROP_UINT64("zns.zsze", NvmeNamespace, params.zns.zsze, 0),
     DEFINE_PROP_UINT8("zns.zdes", NvmeNamespace, params.zns.zdes, 0),
+    DEFINE_PROP_UINT32("zns.mar", NvmeNamespace, params.zns.mar, 0xffffffff),
+    DEFINE_PROP_UINT32("zns.mor", NvmeNamespace, params.zns.mor, 0xffffffff),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 294bc2fb719d..79732b8a8574 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1204,6 +1204,40 @@ static uint16_t nvme_zrm_transition(NvmeNamespace *ns, NvmeZone *zone,
 
     switch (from) {
     case NVME_ZS_ZSE:
+        switch (to) {
+        case NVME_ZS_ZSF:
+        case NVME_ZS_ZSRO:
+        case NVME_ZS_ZSO:
+            break;
+
+        case NVME_ZS_ZSC:
+            if (!ns->zns.resources.active) {
+                return NVME_TOO_MANY_ACTIVE_ZONES;
+            }
+
+            ns->zns.resources.active--;
+
+            break;
+
+        case NVME_ZS_ZSIO:
+        case NVME_ZS_ZSEO:
+            if (!ns->zns.resources.active) {
+                return NVME_TOO_MANY_ACTIVE_ZONES;
+            }
+
+            if (!ns->zns.resources.open) {
+                return NVME_TOO_MANY_OPEN_ZONES;
+            }
+
+            ns->zns.resources.active--;
+            ns->zns.resources.open--;
+
+            break;
+
+        default:
+            return NVME_INVALID_ZONE_STATE_TRANSITION | NVME_DNR;
+        }
+
         break;
 
     case NVME_ZS_ZSIO:
@@ -1224,7 +1258,13 @@ static uint16_t nvme_zrm_transition(NvmeNamespace *ns, NvmeZone *zone,
 
         case NVME_ZS_ZSF:
         case NVME_ZS_ZSRO:
+            ns->zns.resources.active++;
+
+            /* fallthrough */
+
         case NVME_ZS_ZSC:
+            ns->zns.resources.open++;
+
             break;
 
         default:
@@ -1247,8 +1287,18 @@ static uint16_t nvme_zrm_transition(NvmeNamespace *ns, NvmeZone *zone,
 
         case NVME_ZS_ZSF:
         case NVME_ZS_ZSRO:
+            ns->zns.resources.active++;
+
+            break;
+
         case NVME_ZS_ZSIO:
         case NVME_ZS_ZSEO:
+            if (!ns->zns.resources.open) {
+                return NVME_TOO_MANY_OPEN_ZONES;
+            }
+
+            ns->zns.resources.open--;
+
             break;
 
         default:
@@ -1652,6 +1702,7 @@ static uint16_t nvme_zone_mgmt_send_all(NvmeCtrl *n, NvmeRequest *req)
     NvmeZoneManagementSendCmd *send = (NvmeZoneManagementSendCmd *) &req->cmd;
     NvmeNamespace *ns = req->ns;
     NvmeZone *zone;
+    int count;
 
     uint16_t status = NVME_SUCCESS;
 
@@ -1701,6 +1752,20 @@ static uint16_t nvme_zone_mgmt_send_all(NvmeCtrl *n, NvmeRequest *req)
         break;
 
     case NVME_CMD_ZONE_MGMT_SEND_OPEN:
+        count = 0;
+
+        for (int i = 0; i < ns->zns.num_zones; i++) {
+            zone = &ns->zns.zones[i];
+
+            if (nvme_zs(zone) == NVME_ZS_ZSC) {
+                count++;
+            }
+        }
+
+        if (count > ns->zns.resources.open) {
+            return NVME_TOO_MANY_OPEN_ZONES;
+        }
+
         for (int i = 0; i < ns->zns.num_zones; i++) {
             zone = &ns->zns.zones[i];
 
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 14/16] hw/block/nvme: allow open to close transitions by controller
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (12 preceding siblings ...)
  2020-09-24 20:45 ` [PATCH 13/16] hw/block/nvme: track and enforce zone resources Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 20:45 ` [PATCH 15/16] hw/block/nvme: support zone active excursions Klaus Jensen
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

Allow the controller to release open resources by transitioning
implicitly and explicitly opened zones to closed. This is done using a
naive "least recently opened" strategy.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme-ns.h    |   5 +++
 hw/block/nvme-ns.c    |   5 +++
 hw/block/nvme.c       | 102 +++++++++++++++++++++++++++++++++++-------
 hw/block/trace-events |   5 +++
 4 files changed, 102 insertions(+), 15 deletions(-)

diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index f520ffa89c98..1fdcdf706ff6 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -38,6 +38,8 @@ typedef struct NvmeZone {
     uint8_t            *zde;
 
     uint64_t wp_staging;
+
+    QTAILQ_ENTRY(NvmeZone) lru_entry;
 } NvmeZone;
 
 typedef struct NvmeNamespace {
@@ -77,6 +79,9 @@ typedef struct NvmeNamespace {
         struct {
             uint32_t open;
             uint32_t active;
+
+            QTAILQ_HEAD(, NvmeZone) lru_open;
+            QTAILQ_HEAD(, NvmeZone) lru_active;
         } resources;
     } zns;
 } NvmeNamespace;
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 588fe7a1f018..547090282660 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -126,6 +126,9 @@ static void nvme_ns_init_zoned(NvmeNamespace *ns)
         ns->params.zns.mar + 1 : ns->zns.num_zones;
     ns->zns.resources.open = ns->params.zns.mor != 0xffffffff ?
         ns->params.zns.mor + 1 : ns->zns.num_zones;
+
+    QTAILQ_INIT(&ns->zns.resources.lru_open);
+    QTAILQ_INIT(&ns->zns.resources.lru_active);
 }
 
 static void nvme_ns_init(NvmeNamespace *ns)
@@ -259,6 +262,8 @@ static int nvme_ns_setup_blk_pstate(NvmeNamespace *ns, Error **errp)
 
                     if (ns->zns.resources.active) {
                         ns->zns.resources.active--;
+                        QTAILQ_INSERT_TAIL(&ns->zns.resources.lru_active, zone,
+                                           lru_entry);
                         continue;
                     }
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 79732b8a8574..a43a593ab89e 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1192,12 +1192,61 @@ static inline void nvme_zone_reset_wp(NvmeZone *zone)
     zone->wp_staging = nvme_zslba(zone);
 }
 
-static uint16_t nvme_zrm_transition(NvmeNamespace *ns, NvmeZone *zone,
-                                    NvmeZoneState to)
+static uint16_t nvme_zrm_transition(NvmeCtrl *n, NvmeNamespace *ns,
+                                    NvmeZone *zone, NvmeZoneState to,
+                                    NvmeRequest *req);
+
+static uint16_t nvme_zrm_release_open(NvmeCtrl *n, NvmeNamespace *ns,
+                                      NvmeRequest *req)
+{
+    NvmeZone *candidate;
+    NvmeZoneState zs;
+    uint16_t status;
+
+    trace_pci_nvme_zone_zrm_release_open(nvme_cid(req), ns->params.nsid);
+
+    QTAILQ_FOREACH(candidate, &ns->zns.resources.lru_open, lru_entry) {
+        zs = nvme_zs(candidate);
+
+        trace_pci_nvme_zone_zrm_candidate(nvme_cid(req), ns->params.nsid,
+                                          nvme_zslba(candidate),
+                                          nvme_wp(candidate), zs);
+
+        /* skip explicitly opened zones */
+        if (zs == NVME_ZS_ZSEO) {
+            continue;
+        }
+
+        /* the zone cannot be closed if it is currently writing */
+        if (candidate->wp_staging != nvme_wp(candidate)) {
+            continue;
+        }
+
+        status = nvme_zrm_transition(n, ns, candidate, NVME_ZS_ZSC, req);
+        if (status) {
+            return status;
+        }
+
+        if (nvme_zns_commit_zone(ns, candidate) < 0) {
+            return NVME_INTERNAL_DEV_ERROR;
+        }
+
+        return NVME_SUCCESS;
+    }
+
+    return NVME_TOO_MANY_OPEN_ZONES;
+}
+
+static uint16_t nvme_zrm_transition(NvmeCtrl *n, NvmeNamespace *ns,
+                                    NvmeZone *zone, NvmeZoneState to,
+                                    NvmeRequest *req)
 {
     NvmeZoneState from = nvme_zs(zone);
+    uint16_t status;
+
+    trace_pci_nvme_zone_zrm_transition(nvme_cid(req), ns->params.nsid,
+                                       nvme_zslba(zone), nvme_zs(zone), to);
 
-    /* fast path */
     if (from == to) {
         return NVME_SUCCESS;
     }
@@ -1212,25 +1261,32 @@ static uint16_t nvme_zrm_transition(NvmeNamespace *ns, NvmeZone *zone,
 
         case NVME_ZS_ZSC:
             if (!ns->zns.resources.active) {
+                trace_pci_nvme_err_too_many_active_zones(nvme_cid(req));
                 return NVME_TOO_MANY_ACTIVE_ZONES;
             }
 
             ns->zns.resources.active--;
+            QTAILQ_INSERT_TAIL(&ns->zns.resources.lru_active, zone, lru_entry);
 
             break;
 
         case NVME_ZS_ZSIO:
         case NVME_ZS_ZSEO:
             if (!ns->zns.resources.active) {
+                trace_pci_nvme_err_too_many_active_zones(nvme_cid(req));
                 return NVME_TOO_MANY_ACTIVE_ZONES;
             }
 
             if (!ns->zns.resources.open) {
-                return NVME_TOO_MANY_OPEN_ZONES;
+                status = nvme_zrm_release_open(n, ns, req);
+                if (status) {
+                    return status;
+                }
             }
 
             ns->zns.resources.active--;
             ns->zns.resources.open--;
+            QTAILQ_INSERT_TAIL(&ns->zns.resources.lru_open, zone, lru_entry);
 
             break;
 
@@ -1259,11 +1315,15 @@ static uint16_t nvme_zrm_transition(NvmeNamespace *ns, NvmeZone *zone,
         case NVME_ZS_ZSF:
         case NVME_ZS_ZSRO:
             ns->zns.resources.active++;
+            ns->zns.resources.open++;
+            QTAILQ_REMOVE(&ns->zns.resources.lru_open, zone, lru_entry);
 
-            /* fallthrough */
+            break;
 
         case NVME_ZS_ZSC:
             ns->zns.resources.open++;
+            QTAILQ_REMOVE(&ns->zns.resources.lru_open, zone, lru_entry);
+            QTAILQ_INSERT_TAIL(&ns->zns.resources.lru_active, zone, lru_entry);
 
             break;
 
@@ -1288,16 +1348,22 @@ static uint16_t nvme_zrm_transition(NvmeNamespace *ns, NvmeZone *zone,
         case NVME_ZS_ZSF:
         case NVME_ZS_ZSRO:
             ns->zns.resources.active++;
+            QTAILQ_REMOVE(&ns->zns.resources.lru_active, zone, lru_entry);
 
             break;
 
         case NVME_ZS_ZSIO:
         case NVME_ZS_ZSEO:
             if (!ns->zns.resources.open) {
-                return NVME_TOO_MANY_OPEN_ZONES;
+                status = nvme_zrm_release_open(n, ns, req);
+                if (status) {
+                    return status;
+                }
             }
 
             ns->zns.resources.open--;
+            QTAILQ_REMOVE(&ns->zns.resources.lru_active, zone, lru_entry);
+            QTAILQ_INSERT_TAIL(&ns->zns.resources.lru_open, zone, lru_entry);
 
             break;
 
@@ -1321,6 +1387,9 @@ static uint16_t nvme_zrm_transition(NvmeNamespace *ns, NvmeZone *zone,
 
     case NVME_ZS_ZSF:
         switch (to) {
+        case NVME_ZS_ZSF:
+            return NVME_SUCCESS;
+
         case NVME_ZS_ZSE:
             nvme_zone_reset_wp(zone);
 
@@ -1359,7 +1428,9 @@ static void nvme_zns_advance_wp(NvmeRequest *req)
     wp += nlb;
     zone->zd->wp = cpu_to_le64(wp);
     if (wp == nvme_zslba(zone) + nvme_zcap(zone)) {
-        nvme_zrm_transition(req->ns, zone, NVME_ZS_ZSF);
+        NvmeCtrl *n = nvme_ctrl(req);
+
+        nvme_zrm_transition(n, req->ns, zone, NVME_ZS_ZSF, req);
         if (nvme_zns_commit_zone(req->ns, zone) < 0) {
             req->status = NVME_INTERNAL_DEV_ERROR;
         }
@@ -1416,6 +1487,7 @@ static void nvme_rw_cb(void *opaque, int ret)
             uint64_t slba = le64_to_cpu(rw->slba);
 
             NvmeZone *zone = nvme_ns_get_zone(ns, slba);
+            NvmeCtrl *n = nvme_ctrl(req);
 
             /*
              * Transition the zone to read-only on write fault and offline
@@ -1424,7 +1496,7 @@ static void nvme_rw_cb(void *opaque, int ret)
             NvmeZoneState zs = status == NVME_WRITE_FAULT ?
                 NVME_ZS_ZSRO : NVME_ZS_ZSO;
 
-            nvme_zrm_transition(ns, zone, zs);
+            nvme_zrm_transition(n, ns, zone, zs, req);
             if (nvme_zns_commit_zone(ns, zone) < 0) {
                 req->status = NVME_INTERNAL_DEV_ERROR;
             }
@@ -1518,7 +1590,7 @@ static uint16_t nvme_zone_mgmt_send_close(NvmeCtrl *n, NvmeRequest *req,
         break;
     }
 
-    status = nvme_zrm_transition(ns, zone, NVME_ZS_ZSC);
+    status = nvme_zrm_transition(n, ns, zone, NVME_ZS_ZSC, req);
     if (status) {
         return status;
     }
@@ -1543,7 +1615,7 @@ static uint16_t nvme_zone_mgmt_send_finish(NvmeCtrl *n, NvmeRequest *req,
         return NVME_SUCCESS;
     }
 
-    status = nvme_zrm_transition(ns, zone, NVME_ZS_ZSF);
+    status = nvme_zrm_transition(n, ns, zone, NVME_ZS_ZSF, req);
     if (status) {
         return status;
     }
@@ -1568,7 +1640,7 @@ static uint16_t nvme_zone_mgmt_send_open(NvmeCtrl *n, NvmeRequest *req,
         return NVME_SUCCESS;
     }
 
-    status = nvme_zrm_transition(ns, zone, NVME_ZS_ZSEO);
+    status = nvme_zrm_transition(n, ns, zone, NVME_ZS_ZSEO, req);
     if (status) {
         return status;
     }
@@ -1604,7 +1676,7 @@ static uint16_t nvme_zone_mgmt_send_reset(NvmeCtrl *n, NvmeRequest *req,
             return NVME_INTERNAL_DEV_ERROR;
         }
 
-        nvme_zrm_transition(ns, zone, NVME_ZS_ZSE);
+        nvme_zrm_transition(n, ns, zone, NVME_ZS_ZSE, req);
         if (nvme_zns_commit_zone(ns, zone) < 0) {
             return NVME_INTERNAL_DEV_ERROR;
         }
@@ -1635,7 +1707,7 @@ static uint16_t nvme_zone_mgmt_send_offline(NvmeCtrl *n, NvmeRequest *req,
             return NVME_INTERNAL_DEV_ERROR;
         }
 
-        nvme_zrm_transition(ns, zone, NVME_ZS_ZSO);
+        nvme_zrm_transition(n, ns, zone, NVME_ZS_ZSO, req);
         if (nvme_zns_commit_zone(ns, zone) < 0) {
             return NVME_INTERNAL_DEV_ERROR;
         }
@@ -1679,7 +1751,7 @@ static uint16_t nvme_zone_mgmt_send_set_zde(NvmeCtrl *n, NvmeRequest *req,
         return status;
     }
 
-    status = nvme_zrm_transition(ns, zone, NVME_ZS_ZSC);
+    status = nvme_zrm_transition(n, ns, zone, NVME_ZS_ZSC, req);
     if (status) {
         return status;
     }
@@ -2094,7 +2166,7 @@ static uint16_t nvme_rwz(NvmeCtrl *n, NvmeRequest *req)
             case NVME_ZS_ZSEO:
                 break;
             default:
-                status = nvme_zrm_transition(ns, zone, NVME_ZS_ZSIO);
+                status = nvme_zrm_transition(n, ns, zone, NVME_ZS_ZSIO, req);
                 if (status) {
                     goto invalid;
                 }
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 929409b79b41..18f7b24ef5e9 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -88,6 +88,9 @@ pci_nvme_mmio_read(uint64_t addr) "addr 0x%"PRIx64""
 pci_nvme_mmio_write(uint64_t addr, uint64_t data) "addr 0x%"PRIx64" data 0x%"PRIx64""
 pci_nvme_mmio_doorbell_cq(uint16_t cqid, uint16_t new_head) "cqid %"PRIu16" new_head %"PRIu16""
 pci_nvme_mmio_doorbell_sq(uint16_t sqid, uint16_t new_tail) "sqid %"PRIu16" new_tail %"PRIu16""
+pci_nvme_zone_zrm_transition(uint16_t cid, uint32_t nsid, uint64_t zslba, uint8_t from, uint8_t to) "cid %"PRIu16" nsid %"PRIu32" zslba 0x%"PRIx64" from 0x%"PRIx8" to 0x%"PRIx8""
+pci_nvme_zone_zrm_candidate(uint16_t cid, uint32_t nsid, uint64_t zslba, uint64_t wp, uint8_t zc) "cid %"PRIu16" nsid %"PRIu32" zslba 0x%"PRIx64" wp 0x%"PRIx64" zc 0x%"PRIx8""
+pci_nvme_zone_zrm_release_open(uint16_t cid, uint32_t nsid) "cid %"PRIu16" nsid %"PRIu32""
 pci_nvme_mmio_intm_set(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask set, data=0x%"PRIx64", new_mask=0x%"PRIx64""
 pci_nvme_mmio_intm_clr(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask clr, data=0x%"PRIx64", new_mask=0x%"PRIx64""
 pci_nvme_mmio_cfg(uint64_t data) "wrote MMIO, config controller config=0x%"PRIx64""
@@ -115,6 +118,8 @@ pci_nvme_err_zone_is_read_only(uint16_t cid, uint64_t slba) "cid %"PRIu16" lba 0
 pci_nvme_err_zone_invalid_write(uint16_t cid, uint64_t slba, uint64_t wp) "cid %"PRIu16" lba 0x%"PRIx64" wp 0x%"PRIx64""
 pci_nvme_err_zone_boundary(uint16_t cid, uint64_t slba, uint32_t nlb, uint64_t zcap) "cid %"PRIu16" lba 0x%"PRIx64" nlb %"PRIu32" zcap 0x%"PRIx64""
 pci_nvme_err_zone_pending_writes(uint16_t cid, uint64_t zslba, uint64_t wp, uint64_t wp_staging) "cid %"PRIu16" zslba 0x%"PRIx64" wp 0x%"PRIx64" wp_staging 0x%"PRIx64""
+pci_nvme_err_too_many_active_zones(uint16_t cid) "cid %"PRIu16""
+pci_nvme_err_too_many_open_zones(uint16_t cid) "cid %"PRIu16""
 pci_nvme_err_invalid_sgld(uint16_t cid, uint8_t typ) "cid %"PRIu16" type 0x%"PRIx8""
 pci_nvme_err_invalid_num_sgld(uint16_t cid, uint8_t typ) "cid %"PRIu16" type 0x%"PRIx8""
 pci_nvme_err_invalid_sgl_excess_length(uint16_t cid) "cid %"PRIu16""
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 15/16] hw/block/nvme: support zone active excursions
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (13 preceding siblings ...)
  2020-09-24 20:45 ` [PATCH 14/16] hw/block/nvme: allow open to close transitions by controller Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 20:45 ` [PATCH 16/16] hw/block/nvme: support reset/finish recommended limits Klaus Jensen
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

Allow the controller to release active resources by transitioning zones
to the full state if enabled.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 docs/specs/nvme.txt   |   5 ++
 hw/block/nvme-ns.h    |   3 +
 include/block/nvme.h  |  30 ++++++--
 hw/block/nvme-ns.c    |   3 +
 hw/block/nvme.c       | 170 ++++++++++++++++++++++++++++++++++++++----
 hw/block/trace-events |   5 +-
 6 files changed, 197 insertions(+), 19 deletions(-)

diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
index 921dbce2dc01..a286d9c07025 100644
--- a/docs/specs/nvme.txt
+++ b/docs/specs/nvme.txt
@@ -36,6 +36,11 @@ nvme-ns Options
   `zns.mor`; Specifies the number of open resources available. This is a 0s
      based value.
 
+  `zns.zoc`; Specifies Zone Operation Characteristics (ZOC). Only 0x2 is
+     supported, which allows Zone Active Excursions. If enabled, the controller
+     will release active resources when needed by transitioning zones to the
+     Full state.
+
 
 Reference Specifications
 ------------------------
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 1fdcdf706ff6..d9093fdad984 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -30,6 +30,7 @@ typedef struct NvmeNamespaceParams {
         uint8_t  zdes;
         uint32_t mar;
         uint32_t mor;
+        uint16_t zoc;
     } zns;
 } NvmeNamespaceParams;
 
@@ -83,6 +84,8 @@ typedef struct NvmeNamespace {
             QTAILQ_HEAD(, NvmeZone) lru_open;
             QTAILQ_HEAD(, NvmeZone) lru_active;
         } resources;
+
+        NvmeChangedZoneList changed_list;
     } zns;
 } NvmeNamespace;
 
diff --git a/include/block/nvme.h b/include/block/nvme.h
index d51f397e7ff1..0948a262b2df 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -687,6 +687,7 @@ typedef struct QEMU_PACKED NvmeDsmRange {
 enum NvmeAsyncEventRequest {
     NVME_AER_TYPE_ERROR                     = 0,
     NVME_AER_TYPE_SMART                     = 1,
+    NVME_AER_TYPE_NOTICE                    = 2,
     NVME_AER_TYPE_IO_SPECIFIC               = 6,
     NVME_AER_TYPE_VENDOR_SPECIFIC           = 7,
     NVME_AER_INFO_ERR_INVALID_DB_REGISTER   = 0,
@@ -698,13 +699,15 @@ enum NvmeAsyncEventRequest {
     NVME_AER_INFO_SMART_RELIABILITY         = 0,
     NVME_AER_INFO_SMART_TEMP_THRESH         = 1,
     NVME_AER_INFO_SMART_SPARE_THRESH        = 2,
+    NVME_AER_INFO_NOTICE_ZONE_DESCR_CHANGED = 0xef,
 };
 
 typedef struct QEMU_PACKED NvmeAerResult {
-    uint8_t event_type;
-    uint8_t event_info;
-    uint8_t log_page;
-    uint8_t resv;
+    uint8_t  event_type;
+    uint8_t  event_info;
+    uint8_t  log_page;
+    uint8_t  resv;
+    uint32_t nsid;
 } NvmeAerResult;
 
 typedef struct QEMU_PACKED NvmeCqe {
@@ -882,6 +885,15 @@ typedef struct QEMU_PACKED NvmeZoneDescriptor {
     uint8_t  rsvd32[32];
 } NvmeZoneDescriptor;
 
+#define NVME_CHANGED_ZONE_LIST_MAX_IDS 511
+
+typedef struct QEMU_PACKED NvmeChangedZoneList {
+    uint16_t num_ids;
+    uint8_t  rsvd2[6];
+    uint64_t ids[NVME_CHANGED_ZONE_LIST_MAX_IDS];
+} NvmeChangedZoneList;
+
+#define NVME_ZA_ZFC  (1 << 0)
 #define NVME_ZA_ZDEV (1 << 7)
 
 #define NVME_ZA_SET(za, attrs)   ((za) |= (attrs))
@@ -996,6 +1008,10 @@ typedef struct QEMU_PACKED NvmeIdCtrl {
     uint8_t     vs[1024];
 } NvmeIdCtrl;
 
+enum NvmeIdCtrlOaes {
+    NVME_OAES_ZDCN = 1 << 27,
+};
+
 enum NvmeIdCtrlOacs {
     NVME_OACS_SECURITY  = 1 << 0,
     NVME_OACS_FORMAT    = 1 << 1,
@@ -1064,6 +1080,7 @@ typedef struct QEMU_PACKED NvmeIdCtrlZns {
 #define NVME_AEC_SMART(aec)         (aec & 0xff)
 #define NVME_AEC_NS_ATTR(aec)       ((aec >> 8) & 0x1)
 #define NVME_AEC_FW_ACTIVATION(aec) ((aec >> 9) & 0x1)
+#define NVME_AEC_ZDCN(aec)          ((aec >> 27) & 0x1)
 
 #define NVME_ERR_REC_TLER(err_rec)  (err_rec & 0xffff)
 #define NVME_ERR_REC_DULBE(err_rec) (err_rec & 0x10000)
@@ -1231,9 +1248,11 @@ typedef struct QEMU_PACKED NvmeIdNsZns {
     uint8_t     vs[256];
 } NvmeIdNsZns;
 
+#define NVME_ID_NS_ZNS_ZOC_ZAE (1 << 1)
+
 static inline void _nvme_check_size(void)
 {
-    QEMU_BUILD_BUG_ON(sizeof(NvmeAerResult) != 4);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeAerResult) != 8);
     QEMU_BUILD_BUG_ON(sizeof(NvmeCqe) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeDsmRange) != 16);
     QEMU_BUILD_BUG_ON(sizeof(NvmeCmd) != 64);
@@ -1258,5 +1277,6 @@ static inline void _nvme_check_size(void)
     QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
     QEMU_BUILD_BUG_ON(sizeof(NvmeZoneDescriptor) != 64);
     QEMU_BUILD_BUG_ON(sizeof(NvmeLBAFE) != 16);
+    QEMU_BUILD_BUG_ON(sizeof(NvmeChangedZoneList) != 4096);
 }
 #endif
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 547090282660..9a9f728d791f 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -104,6 +104,8 @@ static void nvme_ns_init_zoned(NvmeNamespace *ns)
     NvmeIdNsNvm *id_ns = nvme_ns_id_nvm(ns);
     NvmeIdNsZns *id_ns_zns = nvme_ns_id_zoned(ns);
 
+    id_ns_zns->zoc = cpu_to_le16(ns->params.zns.zoc);
+
     for (int i = 0; i <= id_ns->nlbaf; i++) {
         id_ns_zns->lbafe[i].zsze = ns->params.zns.zsze ?
             ns->params.zns.zsze : cpu_to_le64(pow2ceil(ns->params.zns.zcap));
@@ -442,6 +444,7 @@ static Property nvme_ns_props[] = {
     DEFINE_PROP_UINT8("zns.zdes", NvmeNamespace, params.zns.zdes, 0),
     DEFINE_PROP_UINT32("zns.mar", NvmeNamespace, params.zns.mar, 0xffffffff),
     DEFINE_PROP_UINT32("zns.mor", NvmeNamespace, params.zns.mor, 0xffffffff),
+    DEFINE_PROP_UINT16("zns.zoc", NvmeNamespace, params.zns.zoc, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a43a593ab89e..ecc88e858eee 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -880,10 +880,11 @@ static void nvme_process_aers(void *opaque)
 
         req = n->aer_reqs[n->outstanding_aers];
 
-        result = (NvmeAerResult *) &req->cqe.dw0;
+        result = (NvmeAerResult *) &req->cqe.qw0;
         result->event_type = event->result.event_type;
         result->event_info = event->result.event_info;
         result->log_page = event->result.log_page;
+        result->nsid = event->result.nsid;
         g_free(event);
 
         trace_pci_nvme_aer_post_cqe(result->event_type, result->event_info,
@@ -893,12 +894,14 @@ static void nvme_process_aers(void *opaque)
     }
 }
 
-static void nvme_enqueue_event(NvmeCtrl *n, uint8_t event_type,
-                               uint8_t event_info, uint8_t log_page)
+static void nvme_enqueue_event(NvmeCtrl *n, NvmeNamespace *ns,
+                               uint8_t event_type, uint8_t event_info,
+                               uint8_t log_page)
 {
     NvmeAsyncEvent *event;
 
-    trace_pci_nvme_enqueue_event(event_type, event_info, log_page);
+    trace_pci_nvme_enqueue_event(ns ? ns->params.nsid : -1, event_type,
+                                 event_info, log_page);
 
     if (n->aer_queued == n->params.aer_max_queued) {
         trace_pci_nvme_enqueue_event_noqueue(n->aer_queued);
@@ -912,6 +915,11 @@ static void nvme_enqueue_event(NvmeCtrl *n, uint8_t event_type,
         .log_page   = log_page,
     };
 
+    if (event_info == NVME_AER_INFO_NOTICE_ZONE_DESCR_CHANGED) {
+        assert(ns);
+        event->result.nsid = ns->params.nsid;
+    }
+
     QTAILQ_INSERT_TAIL(&n->aer_queue, event, entry);
     n->aer_queued++;
 
@@ -1192,10 +1200,54 @@ static inline void nvme_zone_reset_wp(NvmeZone *zone)
     zone->wp_staging = nvme_zslba(zone);
 }
 
+static void nvme_zone_changed(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone,
+                              NvmeRequest *req)
+{
+    uint16_t num_ids = le16_to_cpu(ns->zns.changed_list.num_ids);
+
+    trace_pci_nvme_zone_changed(nvme_cid(req), ns->params.nsid,
+                                nvme_zslba(zone));
+
+    if (num_ids < NVME_CHANGED_ZONE_LIST_MAX_IDS) {
+        for (int i = 0; i < num_ids; i++) {
+            if (ns->zns.changed_list.ids[i] == zone->zd->zslba) {
+                goto out;
+            }
+        }
+
+        ns->zns.changed_list.ids[num_ids] = zone->zd->zslba;
+        ns->zns.changed_list.num_ids = cpu_to_le16(num_ids + 1);
+    } else {
+        memset(&ns->zns.changed_list, 0x0, sizeof(NvmeChangedZoneList));
+        ns->zns.changed_list.num_ids = cpu_to_le16(0xffff);
+    }
+
+out:
+    nvme_zns_commit_zone(ns, zone);
+
+    if (NVME_AEC_ZDCN(n->features.async_config)) {
+        nvme_enqueue_event(n, ns, NVME_AER_TYPE_NOTICE,
+                           NVME_AER_INFO_NOTICE_ZONE_DESCR_CHANGED,
+                           NVME_LOG_CHANGED_ZONE_LIST);
+    }
+}
+
 static uint16_t nvme_zrm_transition(NvmeCtrl *n, NvmeNamespace *ns,
                                     NvmeZone *zone, NvmeZoneState to,
                                     NvmeRequest *req);
 
+static void nvme_zone_active_excursion(NvmeCtrl *n, NvmeNamespace *ns,
+                                       NvmeZone *zone, NvmeRequest *req)
+{
+    trace_pci_nvme_zone_active_excursion(ns->params.nsid, nvme_zslba(zone),
+                                         nvme_zs_str(zone));
+
+    nvme_zrm_transition(n, ns, zone, NVME_ZS_ZSF, req);
+    NVME_ZA_SET(zone->zd->za, NVME_ZA_ZFC);
+
+    nvme_zone_changed(n, ns, zone, req);
+}
+
 static uint16_t nvme_zrm_release_open(NvmeCtrl *n, NvmeNamespace *ns,
                                       NvmeRequest *req)
 {
@@ -1237,6 +1289,57 @@ static uint16_t nvme_zrm_release_open(NvmeCtrl *n, NvmeNamespace *ns,
     return NVME_TOO_MANY_OPEN_ZONES;
 }
 
+static uint16_t nvme_zrm_release_active(NvmeCtrl *n, NvmeNamespace *ns,
+                                        NvmeRequest *req)
+{
+    NvmeZone *candidate = NULL;
+    NvmeZoneDescriptor *zd;
+    NvmeZoneState zs;
+
+    trace_pci_nvme_zone_zrm_release_active(nvme_cid(req), ns->params.nsid);
+
+    if (ns->params.zns.zoc & NVME_ID_NS_ZNS_ZOC_ZAE) {
+        return NVME_TOO_MANY_ACTIVE_ZONES;
+    }
+
+    QTAILQ_FOREACH(candidate, &ns->zns.resources.lru_active, lru_entry) {
+        zd = candidate->zd;
+        zs = nvme_zs(candidate);
+
+        trace_pci_nvme_zone_zrm_candidate(nvme_cid(req), ns->params.nsid,
+                                          nvme_zslba(candidate),
+                                          nvme_wp(candidate), zs);
+
+        goto out;
+    }
+
+    /*
+     * If all zone resources are tied up on open zones we have to transition
+     * one of those to full.
+     */
+    QTAILQ_FOREACH(candidate, &ns->zns.resources.lru_open, lru_entry) {
+        zd = candidate->zd;
+        zs = nvme_zs(candidate);
+
+        trace_pci_nvme_zone_zrm_candidate(nvme_cid(req), ns->params.nsid,
+                                          nvme_zslba(candidate),
+                                          nvme_wp(candidate), zs);
+
+        /* the zone cannot be finished if it is currently writing */
+        if (candidate->wp_staging != le64_to_cpu(zd->wp)) {
+            continue;
+        }
+
+        break;
+    }
+
+    assert(candidate);
+
+out:
+    nvme_zone_active_excursion(n, ns, candidate, req);
+    return NVME_SUCCESS;
+}
+
 static uint16_t nvme_zrm_transition(NvmeCtrl *n, NvmeNamespace *ns,
                                     NvmeZone *zone, NvmeZoneState to,
                                     NvmeRequest *req)
@@ -1261,8 +1364,10 @@ static uint16_t nvme_zrm_transition(NvmeCtrl *n, NvmeNamespace *ns,
 
         case NVME_ZS_ZSC:
             if (!ns->zns.resources.active) {
-                trace_pci_nvme_err_too_many_active_zones(nvme_cid(req));
-                return NVME_TOO_MANY_ACTIVE_ZONES;
+                status = nvme_zrm_release_active(n, ns, req);
+                if (status) {
+                    return status;
+                }
             }
 
             ns->zns.resources.active--;
@@ -1273,8 +1378,10 @@ static uint16_t nvme_zrm_transition(NvmeCtrl *n, NvmeNamespace *ns,
         case NVME_ZS_ZSIO:
         case NVME_ZS_ZSEO:
             if (!ns->zns.resources.active) {
-                trace_pci_nvme_err_too_many_active_zones(nvme_cid(req));
-                return NVME_TOO_MANY_ACTIVE_ZONES;
+                status = nvme_zrm_release_active(n, ns, req);
+                if (status) {
+                    return status;
+                }
             }
 
             if (!ns->zns.resources.open) {
@@ -2498,6 +2605,40 @@ static uint16_t nvme_effects_log(NvmeCtrl *n, uint32_t buf_len, uint64_t off,
                     DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static uint16_t nvme_changed_zone_info(NvmeCtrl *n, uint32_t buf_len,
+                                       uint64_t off, NvmeRequest *req)
+{
+    uint32_t nsid = le32_to_cpu(req->cmd.nsid);
+    NvmeNamespace *ns = nvme_ns(n, nsid);
+    uint32_t trans_len;
+    uint16_t status;
+
+    if (unlikely(!ns)) {
+        return NVME_INVALID_NSID | NVME_DNR;
+    }
+
+    if (!nvme_ns_zoned(ns)) {
+        return NVME_INVALID_LOG_ID | NVME_DNR;
+    }
+
+    if (off > sizeof(NvmeChangedZoneList)) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    trans_len = MIN(sizeof(NvmeChangedZoneList) - off, buf_len);
+
+    status = nvme_dma(n, (uint8_t *)&ns->zns.changed_list + off, trans_len,
+                      DMA_DIRECTION_FROM_DEVICE, req);
+    if (status) {
+        return status;
+    }
+
+    memset(&ns->zns.changed_list, 0x0, sizeof(NvmeChangedZoneList));
+    nvme_clear_events(n, NVME_AER_TYPE_NOTICE);
+
+    return NVME_SUCCESS;
+}
+
 static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeCmd *cmd = &req->cmd;
@@ -2543,6 +2684,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
         return nvme_fw_log_info(n, len, off, req);
     case NVME_LOG_EFFECTS:
         return nvme_effects_log(n, len, off, req);
+    case NVME_LOG_CHANGED_ZONE_LIST:
+        return nvme_changed_zone_info(n, len, off, req);
     default:
         trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -3161,7 +3304,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
         if (((n->temperature >= n->features.temp_thresh_hi) ||
              (n->temperature <= n->features.temp_thresh_low)) &&
             NVME_AEC_SMART(n->features.async_config) & NVME_SMART_TEMPERATURE) {
-            nvme_enqueue_event(n, NVME_AER_TYPE_SMART,
+            nvme_enqueue_event(n, NULL, NVME_AER_TYPE_SMART,
                                NVME_AER_INFO_SMART_TEMP_THRESH,
                                NVME_LOG_SMART_INFO);
         }
@@ -3725,7 +3868,7 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
              * fashion.
              */
             if (n->outstanding_aers) {
-                nvme_enqueue_event(n, NVME_AER_TYPE_ERROR,
+                nvme_enqueue_event(n, NULL, NVME_AER_TYPE_ERROR,
                                    NVME_AER_INFO_ERR_INVALID_DB_REGISTER,
                                    NVME_LOG_ERROR_INFO);
             }
@@ -3742,7 +3885,7 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
                            qid, new_head);
 
             if (n->outstanding_aers) {
-                nvme_enqueue_event(n, NVME_AER_TYPE_ERROR,
+                nvme_enqueue_event(n, NULL, NVME_AER_TYPE_ERROR,
                                    NVME_AER_INFO_ERR_INVALID_DB_VALUE,
                                    NVME_LOG_ERROR_INFO);
             }
@@ -3779,7 +3922,7 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
                            " sqid=%"PRIu32", ignoring", qid);
 
             if (n->outstanding_aers) {
-                nvme_enqueue_event(n, NVME_AER_TYPE_ERROR,
+                nvme_enqueue_event(n, NULL, NVME_AER_TYPE_ERROR,
                                    NVME_AER_INFO_ERR_INVALID_DB_REGISTER,
                                    NVME_LOG_ERROR_INFO);
             }
@@ -3796,7 +3939,7 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
                            qid, new_tail);
 
             if (n->outstanding_aers) {
-                nvme_enqueue_event(n, NVME_AER_TYPE_ERROR,
+                nvme_enqueue_event(n, NULL, NVME_AER_TYPE_ERROR,
                                    NVME_AER_INFO_ERR_INVALID_DB_VALUE,
                                    NVME_LOG_ERROR_INFO);
             }
@@ -4095,6 +4238,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     id->ieee[2] = 0xb3;
     id->mdts = n->params.mdts;
     id->ver = cpu_to_le32(NVME_SPEC_VER);
+    id->oaes = cpu_to_le32(NVME_OAES_ZDCN);
     id->oacs = cpu_to_le16(0);
 
     /*
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 18f7b24ef5e9..ebd786b77868 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -78,7 +78,7 @@ pci_nvme_aer(uint16_t cid) "cid %"PRIu16""
 pci_nvme_aer_aerl_exceeded(void) "aerl exceeded"
 pci_nvme_aer_masked(uint8_t type, uint8_t mask) "type 0x%"PRIx8" mask 0x%"PRIx8""
 pci_nvme_aer_post_cqe(uint8_t typ, uint8_t info, uint8_t log_page) "type 0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8""
-pci_nvme_enqueue_event(uint8_t typ, uint8_t info, uint8_t log_page) "type 0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8""
+pci_nvme_enqueue_event(uint32_t nsid, uint8_t typ, uint8_t info, uint8_t log_page) "nsid 0x%"PRIx32" type 0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8""
 pci_nvme_enqueue_event_noqueue(int queued) "queued %d"
 pci_nvme_enqueue_event_masked(uint8_t typ) "type 0x%"PRIx8""
 pci_nvme_no_outstanding_aers(void) "ignoring event; no outstanding AERs"
@@ -91,6 +91,9 @@ pci_nvme_mmio_doorbell_sq(uint16_t sqid, uint16_t new_tail) "sqid %"PRIu16" new_
 pci_nvme_zone_zrm_transition(uint16_t cid, uint32_t nsid, uint64_t zslba, uint8_t from, uint8_t to) "cid %"PRIu16" nsid %"PRIu32" zslba 0x%"PRIx64" from 0x%"PRIx8" to 0x%"PRIx8""
 pci_nvme_zone_zrm_candidate(uint16_t cid, uint32_t nsid, uint64_t zslba, uint64_t wp, uint8_t zc) "cid %"PRIu16" nsid %"PRIu32" zslba 0x%"PRIx64" wp 0x%"PRIx64" zc 0x%"PRIx8""
 pci_nvme_zone_zrm_release_open(uint16_t cid, uint32_t nsid) "cid %"PRIu16" nsid %"PRIu32""
+pci_nvme_zone_zrm_release_active(uint16_t cid, uint32_t nsid) "cid %"PRIu16" nsid %"PRIu32""
+pci_nvme_zone_changed(uint16_t cid, uint32_t nsid, uint64_t zslba) "cid %"PRIu16" nsid %"PRIu32" zslba 0x%"PRIx64""
+pci_nvme_zone_active_excursion(uint32_t nsid, uint64_t zslba, const char *zc) "nsid %"PRIu32" zslba 0x%"PRIx64" zc \"%s\""
 pci_nvme_mmio_intm_set(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask set, data=0x%"PRIx64", new_mask=0x%"PRIx64""
 pci_nvme_mmio_intm_clr(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask clr, data=0x%"PRIx64", new_mask=0x%"PRIx64""
 pci_nvme_mmio_cfg(uint64_t data) "wrote MMIO, config controller config=0x%"PRIx64""
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 16/16] hw/block/nvme: support reset/finish recommended limits
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (14 preceding siblings ...)
  2020-09-24 20:45 ` [PATCH 15/16] hw/block/nvme: support zone active excursions Klaus Jensen
@ 2020-09-24 20:45 ` Klaus Jensen
  2020-09-24 22:43 ` [PATCH 00/16] hw/block/nvme: zoned namespace command set no-reply
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 20:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Klaus Jensen

From: Klaus Jensen <k.jensen@samsung.com>

Add the rrl,rrld,frl and frld device parameters. The frld and rrld
parameters specify the delay in seconds until the device sets the Finish
and Reset Zone Recommended attributes on zones when they are
transitioned to Full or an Opened state respectively. The rrl and frl
parameters specify the number of seconds before the device may perform
an internal operation to "clear" the attributes again.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 docs/specs/nvme.txt   |  9 +++++
 hw/block/nvme-ns.h    | 18 ++++++++--
 hw/block/nvme.h       |  2 ++
 include/block/nvme.h  |  2 ++
 hw/block/nvme-ns.c    | 79 +++++++++++++++++++++++++++++++++++++++++++
 hw/block/nvme.c       | 56 ++++++++++++++++++++++++++++--
 hw/block/trace-events |  3 ++
 7 files changed, 165 insertions(+), 4 deletions(-)

diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
index a286d9c07025..c2f30cfaaea6 100644
--- a/docs/specs/nvme.txt
+++ b/docs/specs/nvme.txt
@@ -41,6 +41,15 @@ nvme-ns Options
      will release active resources when needed by transitioning zones to the
      Full state.
 
+  `zns.{rrld,frld}`; Specifies the Reset and Finish Recommended Limit Delays,
+     that is, the number of seconds until the controller sets the Reset and
+     Finish Zone Recommended attributes on finished and opened zones
+     respectively.
+
+  `zns.{rrl,frl}`; Specifies the Reset and Finish Recommended Limits, that is,
+     the number of seconds until the controller clears the Reset and Finish
+     Zone Recommended attributes on finished and opened zones respectively.
+
 
 Reference Specifications
 ------------------------
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index d9093fdad984..8daf3a90d7f0 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -31,6 +31,10 @@ typedef struct NvmeNamespaceParams {
         uint32_t mar;
         uint32_t mor;
         uint16_t zoc;
+        uint32_t rrl;
+        uint32_t frl;
+        uint32_t rrld;
+        uint32_t frld;
     } zns;
 } NvmeNamespaceParams;
 
@@ -40,9 +44,15 @@ typedef struct NvmeZone {
 
     uint64_t wp_staging;
 
+    struct {
+        int64_t timestamp;
+    } stats;
+
     QTAILQ_ENTRY(NvmeZone) lru_entry;
 } NvmeZone;
 
+typedef QTAILQ_HEAD(, NvmeZone) NvmeZoneList;
+
 typedef struct NvmeNamespace {
     DeviceState  parent_obj;
     BlockConf    blkconf;
@@ -81,11 +91,15 @@ typedef struct NvmeNamespace {
             uint32_t open;
             uint32_t active;
 
-            QTAILQ_HEAD(, NvmeZone) lru_open;
-            QTAILQ_HEAD(, NvmeZone) lru_active;
+            NvmeZoneList lru_open;
+            NvmeZoneList lru_active;
         } resources;
 
         NvmeChangedZoneList changed_list;
+
+        NvmeZoneList lru_finished;
+        QEMUTimer *timer;
+        int64_t rrl_ns, rrld_ns, frl_ns, frld_ns;
     } zns;
 } NvmeNamespace;
 
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 8cd2d936548e..d840ff2cb97c 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -203,5 +203,7 @@ static inline NvmeCtrl *nvme_ctrl(NvmeRequest *req)
 }
 
 int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
+void nvme_zone_changed(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone,
+                       NvmeRequest *req);
 
 #endif /* HW_NVME_H */
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 0948a262b2df..a66da8c46f9e 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -894,6 +894,8 @@ typedef struct QEMU_PACKED NvmeChangedZoneList {
 } NvmeChangedZoneList;
 
 #define NVME_ZA_ZFC  (1 << 0)
+#define NVME_ZA_FZR  (1 << 1)
+#define NVME_ZA_RZR  (1 << 2)
 #define NVME_ZA_ZDEV (1 << 7)
 
 #define NVME_ZA_SET(za, attrs)   ((za) |= (attrs))
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 9a9f728d791f..6243458a436c 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -50,6 +50,64 @@ const char *nvme_zs_to_str(NvmeZoneState zs)
     return NULL;
 }
 
+static void nvme_ns_set_zone_attrs(NvmeCtrl *n, NvmeNamespace *ns,
+                                   NvmeZoneList *zone_list, int64_t delay,
+                                   int64_t limit, int64_t *next_timer,
+                                   uint8_t attr)
+{
+    NvmeZone *zone, *next;
+    int64_t now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+    int64_t timestamp;
+
+    QTAILQ_FOREACH_SAFE(zone, zone_list, lru_entry, next) {
+        timestamp = zone->stats.timestamp;
+        if (now - timestamp < delay) {
+            *next_timer = MIN(*next_timer, timestamp + delay);
+            return;
+        }
+
+        if (now - timestamp < delay + limit) {
+            trace_pci_nvme_ns_set_zone_attr(nvme_zslba(zone), attr);
+            zone->zd->za |= attr;
+            *next_timer = MIN(*next_timer, timestamp + delay + limit);
+        } else {
+            trace_pci_nvme_ns_clear_zone_attr(nvme_zslba(zone), attr);
+            zone->zd->za &= ~attr;
+            QTAILQ_REMOVE(zone_list, zone, lru_entry);
+            QTAILQ_INSERT_TAIL(zone_list, zone, lru_entry);
+
+            zone->stats.timestamp = now;
+            *next_timer = MIN(*next_timer, now + delay);
+        }
+
+        nvme_zone_changed(n, ns, zone, NULL);
+    }
+}
+
+static void nvme_ns_process_timer(void *opaque)
+{
+    NvmeNamespace *ns = opaque;
+    BusState *s = qdev_get_parent_bus(&ns->parent_obj);
+    NvmeCtrl *n = NVME(s->parent);
+    int64_t next_timer = INT64_MAX;
+
+    trace_pci_nvme_ns_process_timer(ns->params.nsid);
+
+    nvme_ns_set_zone_attrs(n, ns, &ns->zns.resources.lru_open, ns->zns.frld_ns,
+                           ns->zns.frl_ns, &next_timer, NVME_ZA_FZR);
+
+    nvme_ns_set_zone_attrs(n, ns, &ns->zns.resources.lru_active,
+                           ns->zns.frld_ns, ns->zns.frl_ns, &next_timer,
+                           NVME_ZA_FZR);
+
+    nvme_ns_set_zone_attrs(n, ns, &ns->zns.lru_finished, ns->zns.rrld_ns,
+                           ns->zns.rrl_ns, &next_timer, NVME_ZA_RZR);
+
+    if (next_timer != INT64_MAX) {
+        timer_mod(ns->zns.timer, next_timer);
+    }
+}
+
 static int nvme_blk_truncate(BlockBackend *blk, size_t len, Error **errp)
 {
     int ret;
@@ -121,6 +179,21 @@ static void nvme_ns_init_zoned(NvmeNamespace *ns)
 
     id_ns->ncap = ns->zns.num_zones * ns->params.zns.zcap;
 
+    id_ns_zns->rrl = ns->params.zns.rrl;
+    id_ns_zns->frl = ns->params.zns.frl;
+
+    if (ns->params.zns.rrl || ns->params.zns.frl) {
+        ns->zns.rrl_ns = ns->params.zns.rrl * NANOSECONDS_PER_SECOND;
+        ns->zns.rrld_ns = ns->params.zns.rrld * NANOSECONDS_PER_SECOND;
+        ns->zns.frl_ns = ns->params.zns.frl * NANOSECONDS_PER_SECOND;
+        ns->zns.frld_ns = ns->params.zns.frld * NANOSECONDS_PER_SECOND;
+
+        ns->zns.timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
+                                     nvme_ns_process_timer, ns);
+
+        QTAILQ_INIT(&ns->zns.lru_finished);
+    }
+
     id_ns_zns->mar = cpu_to_le32(ns->params.zns.mar);
     id_ns_zns->mor = cpu_to_le32(ns->params.zns.mor);
 
@@ -266,6 +339,8 @@ static int nvme_ns_setup_blk_pstate(NvmeNamespace *ns, Error **errp)
                         ns->zns.resources.active--;
                         QTAILQ_INSERT_TAIL(&ns->zns.resources.lru_active, zone,
                                            lru_entry);
+                        zone->stats.timestamp =
+                            qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
                         continue;
                     }
 
@@ -445,6 +520,10 @@ static Property nvme_ns_props[] = {
     DEFINE_PROP_UINT32("zns.mar", NvmeNamespace, params.zns.mar, 0xffffffff),
     DEFINE_PROP_UINT32("zns.mor", NvmeNamespace, params.zns.mor, 0xffffffff),
     DEFINE_PROP_UINT16("zns.zoc", NvmeNamespace, params.zns.zoc, 0),
+    DEFINE_PROP_UINT32("zns.rrl", NvmeNamespace, params.zns.rrl, 0),
+    DEFINE_PROP_UINT32("zns.frl", NvmeNamespace, params.zns.frl, 0),
+    DEFINE_PROP_UINT32("zns.rrld", NvmeNamespace, params.zns.rrld, 0),
+    DEFINE_PROP_UINT32("zns.frld", NvmeNamespace, params.zns.frld, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index ecc88e858eee..d472ed240059 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1200,8 +1200,8 @@ static inline void nvme_zone_reset_wp(NvmeZone *zone)
     zone->wp_staging = nvme_zslba(zone);
 }
 
-static void nvme_zone_changed(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone,
-                              NvmeRequest *req)
+void nvme_zone_changed(NvmeCtrl *n, NvmeNamespace *ns, NvmeZone *zone,
+                       NvmeRequest *req)
 {
     uint16_t num_ids = le16_to_cpu(ns->zns.changed_list.num_ids);
 
@@ -1244,6 +1244,7 @@ static void nvme_zone_active_excursion(NvmeCtrl *n, NvmeNamespace *ns,
 
     nvme_zrm_transition(n, ns, zone, NVME_ZS_ZSF, req);
     NVME_ZA_SET(zone->zd->za, NVME_ZA_ZFC);
+    NVME_ZA_CLEAR(zone->zd->za, NVME_ZA_FZR);
 
     nvme_zone_changed(n, ns, zone, req);
 }
@@ -1340,6 +1341,16 @@ out:
     return NVME_SUCCESS;
 }
 
+static void nvme_zone_activate(NvmeNamespace *ns, NvmeZone *zone)
+{
+    zone->stats.timestamp = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+
+    if (ns->params.zns.frld && !timer_pending(ns->zns.timer)) {
+        int64_t next_timer = zone->stats.timestamp + ns->zns.frld_ns;
+        timer_mod(ns->zns.timer, next_timer);
+    }
+}
+
 static uint16_t nvme_zrm_transition(NvmeCtrl *n, NvmeNamespace *ns,
                                     NvmeZone *zone, NvmeZoneState to,
                                     NvmeRequest *req)
@@ -1373,6 +1384,8 @@ static uint16_t nvme_zrm_transition(NvmeCtrl *n, NvmeNamespace *ns,
             ns->zns.resources.active--;
             QTAILQ_INSERT_TAIL(&ns->zns.resources.lru_active, zone, lru_entry);
 
+            nvme_zone_activate(ns, zone);
+
             break;
 
         case NVME_ZS_ZSIO:
@@ -1395,6 +1408,7 @@ static uint16_t nvme_zrm_transition(NvmeCtrl *n, NvmeNamespace *ns,
             ns->zns.resources.open--;
             QTAILQ_INSERT_TAIL(&ns->zns.resources.lru_open, zone, lru_entry);
 
+            nvme_zone_activate(ns, zone);
             break;
 
         default:
@@ -1521,6 +1535,43 @@ static uint16_t nvme_zrm_transition(NvmeCtrl *n, NvmeNamespace *ns,
     }
 
     nvme_zs_set(zone, to);
+
+    if (ns->params.zns.rrld) {
+        switch (to) {
+        case NVME_ZS_ZSRO:
+            /* clock is already ticking if the zone was already full */
+            if (from == NVME_ZS_ZSF) {
+                break;
+            }
+
+            /* fallthrough */
+
+        case NVME_ZS_ZSF:
+            QTAILQ_INSERT_TAIL(&ns->zns.lru_finished, zone, lru_entry);
+
+            zone->stats.timestamp = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+
+            if (!timer_pending(ns->zns.timer)) {
+                int64_t next_timer = zone->stats.timestamp + ns->zns.rrld_ns;
+                timer_mod(ns->zns.timer, next_timer);
+            }
+
+            break;
+
+        case NVME_ZS_ZSE:
+        case NVME_ZS_ZSO:
+            if (from == NVME_ZS_ZSF) {
+                QTAILQ_REMOVE(&ns->zns.lru_finished, zone, lru_entry);
+                zone->stats.timestamp = 0;
+            }
+
+            break;
+
+        default:
+            break;
+        }
+    }
+
     return NVME_SUCCESS;
 }
 
@@ -1727,6 +1778,7 @@ static uint16_t nvme_zone_mgmt_send_finish(NvmeCtrl *n, NvmeRequest *req,
         return status;
     }
 
+    NVME_ZA_CLEAR(zone->zd->za, NVME_ZA_FZR);
     if (nvme_zns_commit_zone(ns, zone) < 0) {
         return NVME_INTERNAL_DEV_ERROR;
     }
diff --git a/hw/block/trace-events b/hw/block/trace-events
index ebd786b77868..917520b6f4d4 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -94,6 +94,9 @@ pci_nvme_zone_zrm_release_open(uint16_t cid, uint32_t nsid) "cid %"PRIu16" nsid
 pci_nvme_zone_zrm_release_active(uint16_t cid, uint32_t nsid) "cid %"PRIu16" nsid %"PRIu32""
 pci_nvme_zone_changed(uint16_t cid, uint32_t nsid, uint64_t zslba) "cid %"PRIu16" nsid %"PRIu32" zslba 0x%"PRIx64""
 pci_nvme_zone_active_excursion(uint32_t nsid, uint64_t zslba, const char *zc) "nsid %"PRIu32" zslba 0x%"PRIx64" zc \"%s\""
+pci_nvme_ns_process_timer(uint32_t nsid) "nsid %"PRIu32""
+pci_nvme_ns_set_zone_attr(uint64_t zslba, uint8_t attr) "zslba 0x%"PRIx64" attr 0x%"PRIu8""
+pci_nvme_ns_clear_zone_attr(uint64_t zslba, uint8_t attr) "zslba 0x%"PRIx64" attr 0x%"PRIu8""
 pci_nvme_mmio_intm_set(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask set, data=0x%"PRIx64", new_mask=0x%"PRIx64""
 pci_nvme_mmio_intm_clr(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask clr, data=0x%"PRIx64", new_mask=0x%"PRIx64""
 pci_nvme_mmio_cfg(uint64_t data) "wrote MMIO, config controller config=0x%"PRIx64""
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 03/16] hw/block/nvme: make lba data size configurable
  2020-09-24 20:45 ` [PATCH 03/16] hw/block/nvme: make lba data size configurable Klaus Jensen
@ 2020-09-24 21:13   ` Klaus Jensen
  0 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-24 21:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, Max Reitz,
	Keith Busch, Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 270 bytes --]

On Sep 24 22:45, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> Allos the LBA data size (lbads) to be set between 9 and 12.
> 

The patch from Dmitry that uses the common logical_block_size BlockConf
parameter is probably better than this.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/16] hw/block/nvme: zoned namespace command set
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (15 preceding siblings ...)
  2020-09-24 20:45 ` [PATCH 16/16] hw/block/nvme: support reset/finish recommended limits Klaus Jensen
@ 2020-09-24 22:43 ` no-reply
  2020-09-25  7:55   ` Klaus Jensen
  2020-09-25  1:24 ` Keith Busch
  2020-09-25 17:06 ` Dmitry Fomichev
  18 siblings, 1 reply; 24+ messages in thread
From: no-reply @ 2020-09-24 22:43 UTC (permalink / raw)
  To: its; +Cc: fam, kwolf, qemu-block, k.jensen, qemu-devel, mreitz, its, kbusch

Patchew URL: https://patchew.org/QEMU/20200924204516.1881843-1-its@irrelevant.dk/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20200924204516.1881843-1-its@irrelevant.dk
Subject: [PATCH 00/16] hw/block/nvme: zoned namespace command set

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
a59b4bb hw/block/nvme: support reset/finish recommended limits
83d78bf hw/block/nvme: support zone active excursions
eb9852e hw/block/nvme: allow open to close transitions by controller
4018228 hw/block/nvme: track and enforce zone resources
3f934e8 hw/block/nvme: add the zone append command
6343d89 hw/block/nvme: add the zone management send command
a527eef hw/block/nvme: add the zone management receive command
fd032f9 hw/block/nvme: add basic read/write for zoned namespaces
aef6511 hw/block/nvme: support namespace types
ab4c119 hw/block/nvme: add commands supported and effects log page
3eb56a0 hw/block/nvme: add support for dulbe and block utilization tracking
b532fe0 hw/block/nvme: consolidate read, write and write zeroes
e992082 hw/block/nvme: reject io commands if only admin command set selected
8a19e08 hw/block/nvme: make lba data size configurable
3edfb11 hw/block/nvme: add trace event for requests with non-zero status code
8de0031 hw/block/nvme: add nsid to get/setfeat trace events

=== OUTPUT BEGIN ===
1/16 Checking commit 8de00318317d (hw/block/nvme: add nsid to get/setfeat trace events)
2/16 Checking commit 3edfb1110713 (hw/block/nvme: add trace event for requests with non-zero status code)
3/16 Checking commit 8a19e08afeb7 (hw/block/nvme: make lba data size configurable)
4/16 Checking commit e992082bb9bf (hw/block/nvme: reject io commands if only admin command set selected)
5/16 Checking commit b532fe07a157 (hw/block/nvme: consolidate read, write and write zeroes)
6/16 Checking commit 3eb56a0748fb (hw/block/nvme: add support for dulbe and block utilization tracking)
7/16 Checking commit ab4c119d9d68 (hw/block/nvme: add commands supported and effects log page)
ERROR: Macros with complex values should be enclosed in parenthesis
#46: FILE: hw/block/nvme.c:131:
+#define NVME_EFFECTS_NVM_INITIALIZER                   \
+    [NVME_CMD_FLUSH]            = NVME_EFFECTS_CSUPP | \
+                                  NVME_EFFECTS_LBCC,   \
+    [NVME_CMD_WRITE]            = NVME_EFFECTS_CSUPP | \
+                                  NVME_EFFECTS_LBCC,   \
+    [NVME_CMD_READ]             = NVME_EFFECTS_CSUPP,  \
+    [NVME_CMD_WRITE_ZEROES]     = NVME_EFFECTS_CSUPP | \
+                                  NVME_EFFECTS_LBCC

total: 1 errors, 0 warnings, 149 lines checked

Patch 7/16 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

8/16 Checking commit aef6511be82b (hw/block/nvme: support namespace types)
9/16 Checking commit fd032f918b37 (hw/block/nvme: add basic read/write for zoned namespaces)
10/16 Checking commit a527eef9b9fe (hw/block/nvme: add the zone management receive command)
11/16 Checking commit 6343d89bf734 (hw/block/nvme: add the zone management send command)
WARNING: Block comments use a leading /* on a separate line
#66: FILE: hw/block/nvme.c:1118:
+    return __nvme_allocate(ns, slba, nlb, false /* deallocate */);

WARNING: Block comments use a leading /* on a separate line
#77: FILE: hw/block/nvme.c:1129:
+    return __nvme_allocate(ns, slba, nlb, true /* deallocate */);

total: 0 errors, 2 warnings, 704 lines checked

Patch 11/16 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
12/16 Checking commit 3f934e89564a (hw/block/nvme: add the zone append command)
13/16 Checking commit 40182287d15e (hw/block/nvme: track and enforce zone resources)
14/16 Checking commit eb9852ee9c0f (hw/block/nvme: allow open to close transitions by controller)
15/16 Checking commit 83d78bf53392 (hw/block/nvme: support zone active excursions)
16/16 Checking commit a59b4bb2c855 (hw/block/nvme: support reset/finish recommended limits)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20200924204516.1881843-1-its@irrelevant.dk/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/16] hw/block/nvme: zoned namespace command set
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (16 preceding siblings ...)
  2020-09-24 22:43 ` [PATCH 00/16] hw/block/nvme: zoned namespace command set no-reply
@ 2020-09-25  1:24 ` Keith Busch
  2020-09-25  5:27   ` Klaus Jensen
  2020-09-25 17:06 ` Dmitry Fomichev
  18 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2020-09-25  1:24 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel, Max Reitz

On Thu, Sep 24, 2020 at 10:45:00PM +0200, Klaus Jensen wrote:
> Finally, I wrote this. I am *NOT* saying that this somehow makes it
> better, but as a maintainer, is a big deal to me since both series are
> arguably a lot of code to maintain and support (both series are about
> the same size). But - I am not the only maintainer, so if Keith (now
> suddenly placed in the grim role as some sort of arbiter) signs off on
> Dmitry's series, then so be it, I will rest my case.

I think it's neat there's enough interest in ZNS that we have multiple
solutions to consider.

I'm still catching up from virtual conferencing, but I should be able to have a
look over the weekend. I know everyone's put a lot of work into the development
of this capability, so maybe there's something to be taken from both? Not sure
yet if that's feasible, but I'll have a better idea on that later.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/16] hw/block/nvme: zoned namespace command set
  2020-09-25  1:24 ` Keith Busch
@ 2020-09-25  5:27   ` Klaus Jensen
  0 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-25  5:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 1027 bytes --]

On Sep 24 19:24, Keith Busch wrote:
> On Thu, Sep 24, 2020 at 10:45:00PM +0200, Klaus Jensen wrote:
> > Finally, I wrote this. I am *NOT* saying that this somehow makes it
> > better, but as a maintainer, is a big deal to me since both series are
> > arguably a lot of code to maintain and support (both series are about
> > the same size). But - I am not the only maintainer, so if Keith (now
> > suddenly placed in the grim role as some sort of arbiter) signs off on
> > Dmitry's series, then so be it, I will rest my case.
> 
> I think it's neat there's enough interest in ZNS that we have multiple
> solutions to consider.
> 

Yes - it is a luxury problem :)

> I'm still catching up from virtual conferencing, but I should be able to have a
> look over the weekend. I know everyone's put a lot of work into the development
> of this capability, so maybe there's something to be taken from both? Not sure
> yet if that's feasible, but I'll have a better idea on that later.

Thanks Keith, sounds good.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/16] hw/block/nvme: zoned namespace command set
  2020-09-24 22:43 ` [PATCH 00/16] hw/block/nvme: zoned namespace command set no-reply
@ 2020-09-25  7:55   ` Klaus Jensen
  0 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-25  7:55 UTC (permalink / raw)
  To: qemu-devel; +Cc: fam, kwolf, qemu-block, k.jensen, mreitz, kbusch

[-- Attachment #1: Type: text/plain, Size: 1989 bytes --]

On Sep 24 15:43, no-reply@patchew.org wrote:
> Patchew URL: https://patchew.org/QEMU/20200924204516.1881843-1-its@irrelevant.dk/
> 
> 
> 
> Hi,
> 
> This series seems to have some coding style problems. See output below for
> more information:
> 
> Type: series
> Message-id: 20200924204516.1881843-1-its@irrelevant.dk
> Subject: [PATCH 00/16] hw/block/nvme: zoned namespace command set
> 
> 7/16 Checking commit ab4c119d9d68 (hw/block/nvme: add commands supported and effects log page)
> ERROR: Macros with complex values should be enclosed in parenthesis
> #46: FILE: hw/block/nvme.c:131:
> +#define NVME_EFFECTS_NVM_INITIALIZER                   \
> +    [NVME_CMD_FLUSH]            = NVME_EFFECTS_CSUPP | \
> +                                  NVME_EFFECTS_LBCC,   \
> +    [NVME_CMD_WRITE]            = NVME_EFFECTS_CSUPP | \
> +                                  NVME_EFFECTS_LBCC,   \
> +    [NVME_CMD_READ]             = NVME_EFFECTS_CSUPP,  \
> +    [NVME_CMD_WRITE_ZEROES]     = NVME_EFFECTS_CSUPP | \
> +                                  NVME_EFFECTS_LBCC
> 

Pffft. If anyone has a better idea how to make this nice and not too
repetitive in the code, please let me know ;)

> 11/16 Checking commit 6343d89bf734 (hw/block/nvme: add the zone management send command)
> WARNING: Block comments use a leading /* on a separate line
> #66: FILE: hw/block/nvme.c:1118:
> +    return __nvme_allocate(ns, slba, nlb, false /* deallocate */);
> 
> WARNING: Block comments use a leading /* on a separate line
> #77: FILE: hw/block/nvme.c:1129:
> +    return __nvme_allocate(ns, slba, nlb, true /* deallocate */);
> 
> total: 0 errors, 2 warnings, 704 lines checked
> 
> Patch 11/16 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.

I seem to remember that this has been up before, but doesnt look like a
fix for that has gone into checkpatch.pl.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 00/16] hw/block/nvme: zoned namespace command set
  2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
                   ` (17 preceding siblings ...)
  2020-09-25  1:24 ` Keith Busch
@ 2020-09-25 17:06 ` Dmitry Fomichev
  2020-09-25 17:27   ` Klaus Jensen
  18 siblings, 1 reply; 24+ messages in thread
From: Dmitry Fomichev @ 2020-09-25 17:06 UTC (permalink / raw)
  To: Klaus Jensen, Keith Busch, qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Max Reitz, qemu-block, Klaus Jensen



> -----Original Message-----
> From: Qemu-block <qemu-block-
> bounces+dmitry.fomichev=wdc.com@nongnu.org> On Behalf Of Klaus
> Jensen
> Sent: Thursday, September 24, 2020 4:45 PM
> To: qemu-devel@nongnu.org
> Cc: Fam Zheng <fam@euphon.net>; Kevin Wolf <kwolf@redhat.com>; qemu-
> block@nongnu.org; Klaus Jensen <k.jensen@samsung.com>; Max Reitz
> <mreitz@redhat.com>; Keith Busch <kbusch@kernel.org>; Klaus Jensen
> <its@irrelevant.dk>
> Subject: [PATCH 00/16] hw/block/nvme: zoned namespace command set
> 
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> 
> While going through a few rounds of reviews on Dmitry's series I have
> 
> also continued nursing my own implementation since originally posted. I
> 
> did not receive any reviews originally, since it depended on a lot of
> 
> preceding series, but now, with the staging of multiple namespaces on
> 
> nvme-next yesterday, I think it deserves another shot since it now
> 
> applies directly. The series consists of a couple of trivial patches
> 
> followed by the "hw/block/nvme: add support for dulbe and block
> 
> utilization tracking", "hw/block/nvme: support namespace types" and the
> 
> set of zoned namespace support patches.
> 
> 
> 
> A couple of points on how this defers from Dmitry et. al.'s series and
> 
> why I think this implementation deserves a review.
> 
> 
> 
>   * Standard blockdev-based approach to persistent state. The
> 
>     implementation uses a plain blockdev associated with the nvme-ns
> 
>     device for storing state persistently. This same 'pstate' blockdev
> 
>     is also used for logical block allocation tracking.
> 

Is persistent state mandatory or optional? Sorry for asking, but I am
still catching up with your other patches. I think having it optional is
a big benefit for performance testing.

> 
> 
>   * Relies on automatic configuration of DLFEAT according to what the
> 
>     underlying blockdev provides (i.e. BDRV_O_UNMAP for guaranteeing
> 
>     zeroes on discarded blocks) for handling reads in the gaps between
> 
>     write pointer, ZCAP and ZSZE. Issues discards for zone resets. This
> 
>     removes the zero filling.
> 

Doesn't this make non-zero fill patterns impossible? In many storage
environments, vendors and admins are adamant about having varying
fill patterns to see who caused the data corruption if there is one.
Not sure how important this for the particular application, but WDC
series provides the functionality to specify the fill pattern.

> 
> 
> Finally, I wrote this. I am *NOT* saying that this somehow makes it
> 
> better, but as a maintainer, is a big deal to me since both series are
> 
> arguably a lot of code to maintain and support (both series are about
> 
> the same size). But - I am not the only maintainer, so if Keith (now
> 
> suddenly placed in the grim role as some sort of arbiter) signs off on
> 
> Dmitry's series, then so be it, I will rest my case.
> 
> 
> 
> I think we all want to see an implementation of zoned namespaces in QEMU
> 
> sooner rather than later, but I would lie if I said I wouldn't prefer
> 
> that it was this one.
> 
> 
> 
> Based-on: <20200922084533.1273962-1-its@irrelevant.dk>
> 
> 
> 
> Gollu Appalanaidu (1):
> 
>   hw/block/nvme: add commands supported and effects log page
> 
> 
> 
> Klaus Jensen (15):
> 
>   hw/block/nvme: add nsid to get/setfeat trace events
> 
>   hw/block/nvme: add trace event for requests with non-zero status code
> 
>   hw/block/nvme: make lba data size configurable
> 
>   hw/block/nvme: reject io commands if only admin command set selected
> 
>   hw/block/nvme: consolidate read, write and write zeroes
> 
>   hw/block/nvme: add support for dulbe and block utilization tracking
> 
>   hw/block/nvme: support namespace types
> 
>   hw/block/nvme: add basic read/write for zoned namespaces
> 
>   hw/block/nvme: add the zone management receive command
> 
>   hw/block/nvme: add the zone management send command
> 
>   hw/block/nvme: add the zone append command
> 
>   hw/block/nvme: track and enforce zone resources
> 
>   hw/block/nvme: allow open to close transitions by controller
> 
>   hw/block/nvme: support zone active excursions
> 
>   hw/block/nvme: support reset/finish recommended limits
> 
> 
> 
>  docs/specs/nvme.txt   |   49 +-
> 
>  hw/block/nvme-ns.h    |  166 +++-
> 
>  hw/block/nvme.h       |   24 +
> 
>  include/block/nvme.h  |  262 ++++++-
> 
>  block/nvme.c          |    4 +-
> 
>  hw/block/nvme-ns.c    |  411 +++++++++-
> 
>  hw/block/nvme.c       | 1727 +++++++++++++++++++++++++++++++++++++++-
> -
> 
>  hw/block/trace-events |   50 +-
> 
>  8 files changed, 2580 insertions(+), 113 deletions(-)
> 
> 
> 
> --
> 
> 2.28.0
> 
> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/16] hw/block/nvme: zoned namespace command set
  2020-09-25 17:06 ` Dmitry Fomichev
@ 2020-09-25 17:27   ` Klaus Jensen
  0 siblings, 0 replies; 24+ messages in thread
From: Klaus Jensen @ 2020-09-25 17:27 UTC (permalink / raw)
  To: Dmitry Fomichev
  Cc: Fam Zheng, Kevin Wolf, qemu-block, Klaus Jensen, qemu-devel,
	Max Reitz, Keith Busch

[-- Attachment #1: Type: text/plain, Size: 1678 bytes --]

On Sep 25 17:06, Dmitry Fomichev wrote:
> > From: Klaus Jensen <k.jensen@samsung.com>
> > 
> >   * Standard blockdev-based approach to persistent state. The
> > 
> >     implementation uses a plain blockdev associated with the nvme-ns
> > 
> >     device for storing state persistently. This same 'pstate' blockdev
> > 
> >     is also used for logical block allocation tracking.
> > 
> 
> Is persistent state mandatory or optional? Sorry for asking, but I am
> still catching up with your other patches. I think having it optional is
> a big benefit for performance testing.
> 

Yes, the 'pstate' blockdev is optional.

> > 
> > 
> >   * Relies on automatic configuration of DLFEAT according to what the
> > 
> >     underlying blockdev provides (i.e. BDRV_O_UNMAP for guaranteeing
> > 
> >     zeroes on discarded blocks) for handling reads in the gaps between
> > 
> >     write pointer, ZCAP and ZSZE. Issues discards for zone resets. This
> > 
> >     removes the zero filling.
> > 
> 
> Doesn't this make non-zero fill patterns impossible? In many storage
> environments, vendors and admins are adamant about having varying
> fill patterns to see who caused the data corruption if there is one.
> Not sure how important this for the particular application, but WDC
> series provides the functionality to specify the fill pattern.
> 

That *is* a good point.

By default I think it should default to either the 0x00 fill pattern (if
supported by the underlying blockdev), or "no fill pattern reported"
when 0x00's cannot be guaranteed. But, an option to enable filling with,
say 0xff, like you do in your series, would be nice.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-09-25 17:37 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-24 20:45 [PATCH 00/16] hw/block/nvme: zoned namespace command set Klaus Jensen
2020-09-24 20:45 ` [PATCH 01/16] hw/block/nvme: add nsid to get/setfeat trace events Klaus Jensen
2020-09-24 20:45 ` [PATCH 02/16] hw/block/nvme: add trace event for requests with non-zero status code Klaus Jensen
2020-09-24 20:45 ` [PATCH 03/16] hw/block/nvme: make lba data size configurable Klaus Jensen
2020-09-24 21:13   ` Klaus Jensen
2020-09-24 20:45 ` [PATCH 04/16] hw/block/nvme: reject io commands if only admin command set selected Klaus Jensen
2020-09-24 20:45 ` [PATCH 05/16] hw/block/nvme: consolidate read, write and write zeroes Klaus Jensen
2020-09-24 20:45 ` [PATCH 06/16] hw/block/nvme: add support for dulbe and block utilization tracking Klaus Jensen
2020-09-24 20:45 ` [PATCH 07/16] hw/block/nvme: add commands supported and effects log page Klaus Jensen
2020-09-24 20:45 ` [PATCH 08/16] hw/block/nvme: support namespace types Klaus Jensen
2020-09-24 20:45 ` [PATCH 09/16] hw/block/nvme: add basic read/write for zoned namespaces Klaus Jensen
2020-09-24 20:45 ` [PATCH 10/16] hw/block/nvme: add the zone management receive command Klaus Jensen
2020-09-24 20:45 ` [PATCH 11/16] hw/block/nvme: add the zone management send command Klaus Jensen
2020-09-24 20:45 ` [PATCH 12/16] hw/block/nvme: add the zone append command Klaus Jensen
2020-09-24 20:45 ` [PATCH 13/16] hw/block/nvme: track and enforce zone resources Klaus Jensen
2020-09-24 20:45 ` [PATCH 14/16] hw/block/nvme: allow open to close transitions by controller Klaus Jensen
2020-09-24 20:45 ` [PATCH 15/16] hw/block/nvme: support zone active excursions Klaus Jensen
2020-09-24 20:45 ` [PATCH 16/16] hw/block/nvme: support reset/finish recommended limits Klaus Jensen
2020-09-24 22:43 ` [PATCH 00/16] hw/block/nvme: zoned namespace command set no-reply
2020-09-25  7:55   ` Klaus Jensen
2020-09-25  1:24 ` Keith Busch
2020-09-25  5:27   ` Klaus Jensen
2020-09-25 17:06 ` Dmitry Fomichev
2020-09-25 17:27   ` Klaus Jensen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.